[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
hudi-bot commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1685199608 ## CI report: * 4f0de8a6d00fe72108a12d8316cb1d38389d6b31 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19355) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19362) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19365) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19376) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
hudi-bot commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1685165782 ## CI report: * 4f0de8a6d00fe72108a12d8316cb1d38389d6b31 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19355) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19362) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19365) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19376) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] majian1998 commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
majian1998 commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1685163877 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Close record readers after use during tests (#9457)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 5a7b5f28d99 [MINOR] Close record readers after use during tests (#9457) 5a7b5f28d99 is described below commit 5a7b5f28d99d16a7ca363a490a70702a87d85a89 Author: voonhous AuthorDate: Sun Aug 20 09:45:51 2023 +0800 [MINOR] Close record readers after use during tests (#9457) --- .../test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java | 1 + 1 file changed, 1 insertion(+) diff --git a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java index 6f787db6069..7185115a4d5 100644 --- a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java +++ b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java @@ -166,6 +166,7 @@ public class HoodieMergeOnReadTestUtils { .forEach(fieldsPair -> newRecord.set(fieldsPair.getKey(), values[fieldsPair.getValue().pos()])); records.add(newRecord.build()); } +recordReader.close(); } } catch (IOException ie) { LOG.error("Read records error", ie);
[GitHub] [hudi] danny0405 merged pull request #9457: [MINOR] Close record readers after use during tests
danny0405 merged PR #9457: URL: https://github.com/apache/hudi/pull/9457 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
danny0405 commented on PR #9416: URL: https://github.com/apache/hudi/pull/9416#issuecomment-1685150559 Looks good from my side, cc @yihua for the final review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guanziyue commented on a diff in pull request #4913: [HUDI-1517] create marker file for every log file
guanziyue commented on code in PR #4913: URL: https://github.com/apache/hudi/pull/4913#discussion_r1299270668 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java: ## @@ -273,4 +280,31 @@ protected static Option toAvroRecord(HoodieRecord record, Schema return Option.empty(); } } + + protected class AppendLogWriteCallback implements HoodieLogFileWriteCallback { +// here we distinguish log files created from log files being appended. Considering following scenario: +// An appending task write to log file. +// (1) append to existing file file_instant_writetoken1.log.1 +// (2) rollover and create file file_instant_writetoken2.log.2 +// Then this task failed and retry by a new task. +// (3) append to existing file file_instant_writetoken1.log.1 +// (4) rollover and create file file_instant_writetoken3.log.2 +// finally file_instant_writetoken2.log.2 should not be committed to hudi, we use marker file to delete it. +// keep in mind that log file is not always fail-safe unless it never roll over + Review Comment: > oh, I get it. in hdfs like systems, if we are using direct style markers, if two diff writers try to append to same log file (either sequentially or conrrently), we will be calling append type marker for the same log file more than once. and direct style markers will fail since the marker file already exists. is my understanding correct? did we make any fix on this end or not yet? I mean, I underststand we have reverted this patch in latest master. but for MDT purpose, I am looking to see if we can re-add this patch (per log file marker).So, trying to understand any gaps or failures we need to handle before we can add per log file marker support. Yes! you are correct. This was finally fixed by https://github.com/apache/hudi/pull/9003/files. Unfortunately, this PR was reverted due to another failure. W/o MDT, FileSystem Based FileSystemView can actually 'see' some uncommitted files, like log files being written. And according to current FileGroup definition, an uncommitted log file is considered valid as long as it has a committed base instant time. Such an uncommitted file should be correctly addressed in reading because hudi can find that the instant time in log block read from this log file is invalid. However, with this PR, we may delete an invalid log file when commit is going to finish while a reading job may require this file existing. In theory, such an error should not occur with MDT because MDT will not show this file until it is committed. For FileSystem Based FileSystemView, I failed to have an idea to fix this with a short time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guanziyue commented on a diff in pull request #4913: [HUDI-1517] create marker file for every log file
guanziyue commented on code in PR #4913: URL: https://github.com/apache/hudi/pull/4913#discussion_r1299270668 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java: ## @@ -273,4 +280,31 @@ protected static Option toAvroRecord(HoodieRecord record, Schema return Option.empty(); } } + + protected class AppendLogWriteCallback implements HoodieLogFileWriteCallback { +// here we distinguish log files created from log files being appended. Considering following scenario: +// An appending task write to log file. +// (1) append to existing file file_instant_writetoken1.log.1 +// (2) rollover and create file file_instant_writetoken2.log.2 +// Then this task failed and retry by a new task. +// (3) append to existing file file_instant_writetoken1.log.1 +// (4) rollover and create file file_instant_writetoken3.log.2 +// finally file_instant_writetoken2.log.2 should not be committed to hudi, we use marker file to delete it. +// keep in mind that log file is not always fail-safe unless it never roll over + Review Comment: > oh, I get it. in hdfs like systems, if we are using direct style markers, if two diff writers try to append to same log file (either sequentially or conrrently), we will be calling append type marker for the same log file more than once. and direct style markers will fail since the marker file already exists. is my understanding correct? did we make any fix on this end or not yet? I mean, I underststand we have reverted this patch in latest master. but for MDT purpose, I am looking to see if we can re-add this patch (per log file marker).So, trying to understand any gaps or failures we need to handle before we can add per log file marker support. Yes! you are correct. This was finally fixed by https://github.com/apache/hudi/pull/9003/files. Unfortunately, this PR was reverted by another failure. W/o MDT, FileSystem Based FileSystemView can actually 'see' some uncommitted files, like log files being written. And according to current FileGroup definition, an uncommitted log file is considered valid as long as it has a committed base instant time. Such an uncommitted file should be correctly addressed in reading because hudi can find that the instant time in log block is invalid. However, with this PR, we may delete an invalid log file when commit is going to finish while a reading job may require this file existed. In theory, such an error should not occur with MDT. For FileSystem Based FileSystemView, I failed to have an idea to fix this with a short time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution
hudi-bot commented on PR #9482: URL: https://github.com/apache/hudi/pull/9482#issuecomment-1685108892 ## CI report: * 87b20e5e4cc44ed70c52ee9ae0f746542f144e52 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19374) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9485: [HUDI-6730] Enable hoodie configuration using the --conf option with the "spark." prefix
hudi-bot commented on PR #9485: URL: https://github.com/apache/hudi/pull/9485#issuecomment-1685088898 ## CI report: * 79060b391199c430b6d0ae8d7e63a10dfb2a853f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19372) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution
hudi-bot commented on PR #9482: URL: https://github.com/apache/hudi/pull/9482#issuecomment-1685079217 ## CI report: * 641d974b5e43f37f8ed429e75e817ba8a5a8376e Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19373) * 87b20e5e4cc44ed70c52ee9ae0f746542f144e52 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19374) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #4913: [HUDI-1517] create marker file for every log file
nsivabalan commented on code in PR #4913: URL: https://github.com/apache/hudi/pull/4913#discussion_r1299237297 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java: ## @@ -273,4 +280,31 @@ protected static Option toAvroRecord(HoodieRecord record, Schema return Option.empty(); } } + + protected class AppendLogWriteCallback implements HoodieLogFileWriteCallback { +// here we distinguish log files created from log files being appended. Considering following scenario: +// An appending task write to log file. +// (1) append to existing file file_instant_writetoken1.log.1 +// (2) rollover and create file file_instant_writetoken2.log.2 +// Then this task failed and retry by a new task. +// (3) append to existing file file_instant_writetoken1.log.1 +// (4) rollover and create file file_instant_writetoken3.log.2 +// finally file_instant_writetoken2.log.2 should not be committed to hudi, we use marker file to delete it. +// keep in mind that log file is not always fail-safe unless it never roll over + Review Comment: oh, I get it. in hdfs like systems, if we are using direct style markers, if two diff writers try to append to same log file (either sequentially or conrrently), we will be calling append type marker for the same log file more than once. and direct style markers will fail since the marker file already exists. is my understanding correct? did we make any fix on this end or not yet? I mean, I underststand we have reverted this patch in latest master. but for MDT purpose, I am looking to see if we can re-add this patch (per log file marker).So, trying to understand any gaps or failures we need to handle before we can add per log file marker support. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #4913: [HUDI-1517] create marker file for every log file
nsivabalan commented on code in PR #4913: URL: https://github.com/apache/hudi/pull/4913#discussion_r1299237090 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java: ## @@ -273,4 +280,31 @@ protected static Option toAvroRecord(HoodieRecord record, Schema return Option.empty(); } } + + protected class AppendLogWriteCallback implements HoodieLogFileWriteCallback { +// here we distinguish log files created from log files being appended. Considering following scenario: +// An appending task write to log file. +// (1) append to existing file file_instant_writetoken1.log.1 +// (2) rollover and create file file_instant_writetoken2.log.2 +// Then this task failed and retry by a new task. +// (3) append to existing file file_instant_writetoken1.log.1 +// (4) rollover and create file file_instant_writetoken3.log.2 +// finally file_instant_writetoken2.log.2 should not be committed to hudi, we use marker file to delete it. +// keep in mind that log file is not always fail-safe unless it never roll over + Review Comment: sorry, can you guys help me understand why preLogOpen would throw here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution
hudi-bot commented on PR #9482: URL: https://github.com/apache/hudi/pull/9482#issuecomment-1685061547 ## CI report: * 75434ba7c835be022517f59805a12fc80da0d249 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19363) * 641d974b5e43f37f8ed429e75e817ba8a5a8376e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19373) * 87b20e5e4cc44ed70c52ee9ae0f746542f144e52 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9484: [HUDI-6729] Fix get partition values from path for non-string type partition column
hudi-bot commented on PR #9484: URL: https://github.com/apache/hudi/pull/9484#issuecomment-1685057712 ## CI report: * 905cc6b4eff305d54e52f4c1ac2d44d449e9afc5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19371) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
hudi-bot commented on PR #9416: URL: https://github.com/apache/hudi/pull/9416#issuecomment-1685057662 ## CI report: * 5bd469384744d76c63e658e043e4dac6a6fd5ac3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19370) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9483: [HUDI-6156] Prevent leaving tmp file in timeline, delete tmp file whe…
hudi-bot commented on PR #9483: URL: https://github.com/apache/hudi/pull/9483#issuecomment-1685057703 ## CI report: * f424ca9897807f1bdcb7886dd6bb402e0968f04f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19369) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution
hudi-bot commented on PR #9482: URL: https://github.com/apache/hudi/pull/9482#issuecomment-1685042098 ## CI report: * 75434ba7c835be022517f59805a12fc80da0d249 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19363) * 641d974b5e43f37f8ed429e75e817ba8a5a8376e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19373) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9467: [HUDI-6621] Fix downgrade handler for 0.14.0
hudi-bot commented on PR #9467: URL: https://github.com/apache/hudi/pull/9467#issuecomment-1685042082 ## CI report: * d20d5b2e45e0eccf8f3ec40077696eecf9dfc4bb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19368) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9482: [HUDI-6728] Update BigQuery manifest sync to support schema evolution
hudi-bot commented on PR #9482: URL: https://github.com/apache/hudi/pull/9482#issuecomment-1685040652 ## CI report: * 75434ba7c835be022517f59805a12fc80da0d249 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19363) * 641d974b5e43f37f8ed429e75e817ba8a5a8376e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9485: [HUDI-6730] Enable hoodie configuration using the --conf option with the "spark." prefix
hudi-bot commented on PR #9485: URL: https://github.com/apache/hudi/pull/9485#issuecomment-1685030774 ## CI report: * 79060b391199c430b6d0ae8d7e63a10dfb2a853f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19372) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9485: [HUDI-6730] Enable hoodie configuration using the --conf option with the "spark." prefix
hudi-bot commented on PR #9485: URL: https://github.com/apache/hudi/pull/9485#issuecomment-1685029379 ## CI report: * 79060b391199c430b6d0ae8d7e63a10dfb2a853f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6730) Enable hoodie configuration using the --conf option with the "spark." prefix.
[ https://issues.apache.org/jira/browse/HUDI-6730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6730: - Labels: pull-request-available (was: ) > Enable hoodie configuration using the --conf option with the "spark." prefix. > - > > Key: HUDI-6730 > URL: https://issues.apache.org/jira/browse/HUDI-6730 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Wechar >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] wecharyu opened a new pull request, #9485: [HUDI-6730] Enable hoodie configuration using the --conf option with the "spark." prefix
wecharyu opened a new pull request, #9485: URL: https://github.com/apache/hudi/pull/9485 ### Change Logs When submit spark job by `--conf` options, it only accepts option key start with "spark." prefix, so we can extract hoodie config through sqlConf start with "spark.hoodie.". ### Impact User can set hoodie conf by `--conf spark.hoodie.xxx=xxx` when submitting in the submitting a Spark job. ### Risk level (write none, low medium or high below) Low. ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6730) Enable hoodie configuration using the --conf option with the "spark." prefix.
[ https://issues.apache.org/jira/browse/HUDI-6730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wechar updated HUDI-6730: - Summary: Enable hoodie configuration using the --conf option with the "spark." prefix. (was: Enable hoodie configuration using the --conf option with the spark. prefix.) > Enable hoodie configuration using the --conf option with the "spark." prefix. > - > > Key: HUDI-6730 > URL: https://issues.apache.org/jira/browse/HUDI-6730 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Wechar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6730) Enable hoodie configuration using the --conf option with the spark. prefix.
Wechar created HUDI-6730: Summary: Enable hoodie configuration using the --conf option with the spark. prefix. Key: HUDI-6730 URL: https://issues.apache.org/jira/browse/HUDI-6730 Project: Apache Hudi Issue Type: Improvement Reporter: Wechar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9484: [HUDI-6729] Fix get partition values from path for non-string type partition column
hudi-bot commented on PR #9484: URL: https://github.com/apache/hudi/pull/9484#issuecomment-1685009140 ## CI report: * 4d503b60e26faf4f879e09f266255d6c9af98afc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19367) * 905cc6b4eff305d54e52f4c1ac2d44d449e9afc5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19371) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9477: [HUDI-6726] Fix connection leaks related to file reader and iterator close
hudi-bot commented on PR #9477: URL: https://github.com/apache/hudi/pull/9477#issuecomment-1685009045 ## CI report: * 2fe4b6b8c722c26e4d970e8613be2f73e4b4eb4f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19364) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9484: [HUDI-6729] Fix get partition values from path for non-string type partition column
hudi-bot commented on PR #9484: URL: https://github.com/apache/hudi/pull/9484#issuecomment-1684998275 ## CI report: * 4d503b60e26faf4f879e09f266255d6c9af98afc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19367) * 905cc6b4eff305d54e52f4c1ac2d44d449e9afc5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9477: [HUDI-6726] Fix connection leaks related to file reader and iterator close
hudi-bot commented on PR #9477: URL: https://github.com/apache/hudi/pull/9477#issuecomment-1684998127 ## CI report: * 2fe4b6b8c722c26e4d970e8613be2f73e4b4eb4f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed pull request #9457: [MINOR] Close record readers after use during tests
nsivabalan closed pull request #9457: [MINOR] Close record readers after use during tests URL: https://github.com/apache/hudi/pull/9457 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #9457: [MINOR] Close record readers after use during tests
nsivabalan commented on PR #9457: URL: https://github.com/apache/hudi/pull/9457#issuecomment-1684948315 test failure is unrelated. Landing the patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
hudi-bot commented on PR #9416: URL: https://github.com/apache/hudi/pull/9416#issuecomment-1684945535 ## CI report: * 642c6dd967978781d41b74138f89fae26192056b Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19263) * 5bd469384744d76c63e658e043e4dac6a6fd5ac3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19370) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table
codope commented on code in PR #9444: URL: https://github.com/apache/hudi/pull/9444#discussion_r1299186957 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -956,9 +956,7 @@ object DataSourceOptionsHelper { */ def fetchMissingWriteConfigsFromTableConfig(tableConfig: HoodieTableConfig, params: Map[String, String]) : Map[String, String] = { val missingWriteConfigs = scala.collection.mutable.Map[String, String]() -if (!params.contains(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key()) && tableConfig.getRawRecordKeyFieldProp != null) { - missingWriteConfigs ++= Map(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key() -> tableConfig.getRawRecordKeyFieldProp) -} Review Comment: I am not following the fix here. I think this is a valid block. If some batch did not have record key in the write config, then why not infer from table config if it is present? I believe we resolve the configs before setting the write operation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9483: [HUDI-6156] Prevent leaving tmp file in timeline, delete tmp file whe…
hudi-bot commented on PR #9483: URL: https://github.com/apache/hudi/pull/9483#issuecomment-1684944274 ## CI report: * 373fb78cc587229fd9210edc0b9102101b3a3deb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19366) * f424ca9897807f1bdcb7886dd6bb402e0968f04f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19369) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
hudi-bot commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1684944263 ## CI report: * 4f0de8a6d00fe72108a12d8316cb1d38389d6b31 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19355) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19362) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19365) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
hudi-bot commented on PR #9416: URL: https://github.com/apache/hudi/pull/9416#issuecomment-1684944240 ## CI report: * 642c6dd967978781d41b74138f89fae26192056b Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19263) * 5bd469384744d76c63e658e043e4dac6a6fd5ac3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9483: [HUDI-6156] Prevent leaving tmp file in timeline, delete tmp file whe…
hudi-bot commented on PR #9483: URL: https://github.com/apache/hudi/pull/9483#issuecomment-1684942784 ## CI report: * 373fb78cc587229fd9210edc0b9102101b3a3deb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19366) * f424ca9897807f1bdcb7886dd6bb402e0968f04f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9467: [HUDI-6621] Fix downgrade handler for 0.14.0
hudi-bot commented on PR #9467: URL: https://github.com/apache/hudi/pull/9467#issuecomment-1684942768 ## CI report: * 50d4aea5e545b5094368e3a192ffb5fd2008c481 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19349) * d20d5b2e45e0eccf8f3ec40077696eecf9dfc4bb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19368) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
codope commented on code in PR #9422: URL: https://github.com/apache/hudi/pull/9422#discussion_r1299184657 ## hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestMORColstats.java: ## @@ -0,0 +1,481 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.functional; + +import org.apache.hudi.DataSourceReadOptions; +import org.apache.hudi.DataSourceWriteOptions; +import org.apache.hudi.client.SparkRDDWriteClient; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.testutils.HoodieTestDataGenerator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieCompactionConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.testutils.HoodieSparkClientTestBase; + +import org.apache.spark.SparkException; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.File; +import java.io.IOException; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Properties; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import static org.apache.hudi.common.testutils.RawTripTestPayload.recordToString; +import static org.apache.hudi.config.HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS; +import static org.apache.spark.sql.SaveMode.Append; +import static org.apache.spark.sql.SaveMode.Overwrite; +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertThrows; + +/** + * Test mor with colstats enabled in scenarios to ensure that files + * are being appropriately read or not read. + * The strategy employed is to corrupt targeted base files. If we want + * to prove the file is read, we assert that an exception will be thrown. + * If we want to prove the file is not read, we expect the read to + * successfully execute. + */ +public class TestMORColstats extends HoodieSparkClientTestBase { + + private static String matchCond = "trip_type = 'UBERX'"; + private static String nonMatchCond = "trip_type = 'BLACK'"; + private static String[] dropColumns = {"_hoodie_commit_time", "_hoodie_commit_seqno", + "_hoodie_record_key", "_hoodie_partition_path", "_hoodie_file_name"}; + + private Boolean shouldOverwrite; + Map options; + @TempDir + public java.nio.file.Path basePath; + + @BeforeEach + public void setUp() throws Exception { +initSparkContexts(); +dataGen = new HoodieTestDataGenerator(); +shouldOverwrite = true; +options = getOptions(); +Properties props = new Properties(); +props.putAll(options); +try { + metaClient = HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, basePath.toString(), props); +} catch (IOException e) { + throw new RuntimeException(e); +} + } + + @AfterEach + public void tearDown() throws IOException { +cleanupSparkContexts(); +cleanupTestDataGenerator(); +metaClient = null; + } + + /** + * Create two files, one should be excluded by colstats + */ + @Test + public void testBaseFileOnly() { +Dataset inserts = makeInsertDf("000", 100); +Dataset batch1 = inserts.where(matchCond); +Dataset batch2 = inserts.where(nonMatchCond); +doWrite(batch1); +doWrite(batch2); +List filesToCorrupt = getFilesToCorrupt(); +assertEquals(1, filesToCorrupt.size()); +filesToCorrupt.forEach(TestMORColstats::corruptFile); +assertEquals(0, readMatchingRecords().except(batch1).count()); +//Read without data skipping to show that it will fail +//Reading with data skipping succeeded so that means that data skipping is working and the corrupted +//file was no
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
Zouxxyy commented on code in PR #9416: URL: https://github.com/apache/hudi/pull/9416#discussion_r1299184229 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java: ## @@ -452,107 +431,137 @@ private Stream getCommitInstantsToArchive() throws IOException { ? CompactionUtils.getOldestInstantToRetainForCompaction( table.getActiveTimeline(), config.getInlineCompactDeltaCommitMax()) : Option.empty(); + oldestInstantToRetainCandidates.add(oldestInstantToRetainForCompaction); - // The clustering commit instant can not be archived unless we ensure that the replaced files have been cleaned, + // 3. The clustering commit instant can not be archived unless we ensure that the replaced files have been cleaned, // without the replaced files metadata on the timeline, the fs view would expose duplicates for readers. // Meanwhile, when inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline // to check whether the file slice generated in pending clustering after archive isn't committed. Option oldestInstantToRetainForClustering = ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(), table.getMetaClient()); + oldestInstantToRetainCandidates.add(oldestInstantToRetainForClustering); + + // 4. If metadata table is enabled, do not archive instants which are more recent than the last compaction on the + // metadata table. + if (table.getMetaClient().getTableConfig().isMetadataTableAvailable()) { +try (HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(table.getContext(), config.getMetadataConfig(), config.getBasePath())) { + Option latestCompactionTime = tableMetadata.getLatestCompactionTime(); + if (!latestCompactionTime.isPresent()) { +LOG.info("Not archiving as there is no compaction yet on the metadata table"); +return Collections.emptyList(); + } else { +LOG.info("Limiting archiving of instants to latest compaction on metadata table at " + latestCompactionTime.get()); +oldestInstantToRetainCandidates.add(Option.of(new HoodieInstant( +HoodieInstant.State.COMPLETED, COMPACTION_ACTION, latestCompactionTime.get(; + } +} catch (Exception e) { + throw new HoodieException("Error limiting instant archival based on metadata table", e); +} + } + + // 5. If this is a metadata table, do not archive the commits that live in data set + // active timeline. This is required by metadata table, + // see HoodieTableMetadataUtil#processRollbackMetadata for details. + if (table.isMetadataTable()) { +HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder() + .setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath())) +.setConf(metaClient.getHadoopConf()) +.build(); +Option qualifiedEarliestInstant = +TimelineUtils.getEarliestInstantForMetadataArchival( +dataMetaClient.getActiveTimeline(), config.shouldArchiveBeyondSavepoint()); + +// Do not archive the instants after the earliest commit (COMMIT, DELTA_COMMIT, and +// REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive +// beyond savepoint) and the earliest inflight instant (all actions). +// This is required by metadata table, see HoodieTableMetadataUtil#processRollbackMetadata +// for details. +// Todo: Remove #7580 +// Note that we cannot blindly use the earliest instant of all actions, because CLEAN and +// ROLLBACK instants are archived separately apart from commits (check +// HoodieTimelineArchiver#getCleanInstantsToArchive). If we do so, a very old completed +// CLEAN or ROLLBACK instant can block the archive of metadata table timeline and causes +// the active timeline of metadata table to be extremely long, leading to performance issues +// for loading the timeline. +oldestInstantToRetainCandidates.add(qualifiedEarliestInstant); + } + + // Choose the instant in oldestInstantToRetainCandidates with the smallest + // timestamp as oldestInstantToRetain. + java.util.Optional oldestInstantToRetain = oldestInstantToRetainCandidates + .stream() + .filter(Option::isPresent) + .map(Option::get) + .min(HoodieInstant.COMPARATOR); - // Actually do the commits - Stream instantToArchiveStream = commitTimeline.getInstantsAsStream() + // Step2: We cannot archive any commits which are made after the first savepoint present, + // unless HoodieArchivalConfig#ARCHIVE_BEYOND_SAVEPOINT is enabled. + Option firstSavepoint = table.getComp
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
Zouxxyy commented on code in PR #9416: URL: https://github.com/apache/hudi/pull/9416#discussion_r1299184229 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java: ## @@ -452,107 +431,137 @@ private Stream getCommitInstantsToArchive() throws IOException { ? CompactionUtils.getOldestInstantToRetainForCompaction( table.getActiveTimeline(), config.getInlineCompactDeltaCommitMax()) : Option.empty(); + oldestInstantToRetainCandidates.add(oldestInstantToRetainForCompaction); - // The clustering commit instant can not be archived unless we ensure that the replaced files have been cleaned, + // 3. The clustering commit instant can not be archived unless we ensure that the replaced files have been cleaned, // without the replaced files metadata on the timeline, the fs view would expose duplicates for readers. // Meanwhile, when inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline // to check whether the file slice generated in pending clustering after archive isn't committed. Option oldestInstantToRetainForClustering = ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(), table.getMetaClient()); + oldestInstantToRetainCandidates.add(oldestInstantToRetainForClustering); + + // 4. If metadata table is enabled, do not archive instants which are more recent than the last compaction on the + // metadata table. + if (table.getMetaClient().getTableConfig().isMetadataTableAvailable()) { +try (HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(table.getContext(), config.getMetadataConfig(), config.getBasePath())) { + Option latestCompactionTime = tableMetadata.getLatestCompactionTime(); + if (!latestCompactionTime.isPresent()) { +LOG.info("Not archiving as there is no compaction yet on the metadata table"); +return Collections.emptyList(); + } else { +LOG.info("Limiting archiving of instants to latest compaction on metadata table at " + latestCompactionTime.get()); +oldestInstantToRetainCandidates.add(Option.of(new HoodieInstant( +HoodieInstant.State.COMPLETED, COMPACTION_ACTION, latestCompactionTime.get(; + } +} catch (Exception e) { + throw new HoodieException("Error limiting instant archival based on metadata table", e); +} + } + + // 5. If this is a metadata table, do not archive the commits that live in data set + // active timeline. This is required by metadata table, + // see HoodieTableMetadataUtil#processRollbackMetadata for details. + if (table.isMetadataTable()) { +HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder() + .setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath())) +.setConf(metaClient.getHadoopConf()) +.build(); +Option qualifiedEarliestInstant = +TimelineUtils.getEarliestInstantForMetadataArchival( +dataMetaClient.getActiveTimeline(), config.shouldArchiveBeyondSavepoint()); + +// Do not archive the instants after the earliest commit (COMMIT, DELTA_COMMIT, and +// REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive +// beyond savepoint) and the earliest inflight instant (all actions). +// This is required by metadata table, see HoodieTableMetadataUtil#processRollbackMetadata +// for details. +// Todo: Remove #7580 +// Note that we cannot blindly use the earliest instant of all actions, because CLEAN and +// ROLLBACK instants are archived separately apart from commits (check +// HoodieTimelineArchiver#getCleanInstantsToArchive). If we do so, a very old completed +// CLEAN or ROLLBACK instant can block the archive of metadata table timeline and causes +// the active timeline of metadata table to be extremely long, leading to performance issues +// for loading the timeline. +oldestInstantToRetainCandidates.add(qualifiedEarliestInstant); + } + + // Choose the instant in oldestInstantToRetainCandidates with the smallest + // timestamp as oldestInstantToRetain. + java.util.Optional oldestInstantToRetain = oldestInstantToRetainCandidates + .stream() + .filter(Option::isPresent) + .map(Option::get) + .min(HoodieInstant.COMPARATOR); - // Actually do the commits - Stream instantToArchiveStream = commitTimeline.getInstantsAsStream() + // Step2: We cannot archive any commits which are made after the first savepoint present, + // unless HoodieArchivalConfig#ARCHIVE_BEYOND_SAVEPOINT is enabled. + Option firstSavepoint = table.getComp
[jira] [Updated] (HUDI-6621) Add a downgrade step from 6 to 5 to detect new delete blocks
[ https://issues.apache.org/jira/browse/HUDI-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6621: - Labels: pull-request-available (was: ) > Add a downgrade step from 6 to 5 to detect new delete blocks > > > Key: HUDI-6621 > URL: https://issues.apache.org/jira/browse/HUDI-6621 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > In table version 6, we introduce a new delete block format (v3) with Avro > serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform > compaction to handle v3 delete blocks created using the new format. > Also with the addition of record index field in Metadata table schema, the > downgrade needs to delete the metadata table to avoid column drop errors > after downgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9467: [HUDI-6621] Fix downgrade handler for 0.14.0
hudi-bot commented on PR #9467: URL: https://github.com/apache/hudi/pull/9467#issuecomment-1684934537 ## CI report: * 50d4aea5e545b5094368e3a192ffb5fd2008c481 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19349) * d20d5b2e45e0eccf8f3ec40077696eecf9dfc4bb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6621) Add a downgrade step from 6 to 5 to detect new delete blocks
[ https://issues.apache.org/jira/browse/HUDI-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6621: -- Description: In table version 6, we introduce a new delete block format (v3) with Avro serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform compaction to handle v3 delete blocks created using the new format. Also with the addition of record index field in Metadata table schema, the downgrade needs to delete the metadata table to avoid column drop errors after downgrade. was:In table version 6, we introduce a new delete block format (v3) with Avro serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform compaction to handle v3 delete blocks created using the new format. > Add a downgrade step from 6 to 5 to detect new delete blocks > > > Key: HUDI-6621 > URL: https://issues.apache.org/jira/browse/HUDI-6621 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > > In table version 6, we introduce a new delete block format (v3) with Avro > serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform > compaction to handle v3 delete blocks created using the new format. > Also with the addition of record index field in Metadata table schema, the > downgrade needs to delete the metadata table to avoid column drop errors > after downgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6621) Add a downgrade step from 6 to 5 to detect new delete blocks
[ https://issues.apache.org/jira/browse/HUDI-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6621: -- Description: In table version 6, we introduce a new delete block format (v3) with Avro serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform compaction to handle v3 delete blocks created using the new format. (was: In table version 6, we introduce a new delete block format (v3) with Avro serde (HUDI-5760). For downgrading a table from v6 to v5, we need to check any v3 delete blocks using the new format and ask user to manually restore to a commit before any file slice with a v3 delete block.) > Add a downgrade step from 6 to 5 to detect new delete blocks > > > Key: HUDI-6621 > URL: https://issues.apache.org/jira/browse/HUDI-6621 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > > In table version 6, we introduce a new delete block format (v3) with Avro > serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform > compaction to handle v3 delete blocks created using the new format. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6717) Fix downgrade handler for 0.14.0
[ https://issues.apache.org/jira/browse/HUDI-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain closed HUDI-6717. - Resolution: Duplicate > Fix downgrade handler for 0.14.0 > > > Key: HUDI-6717 > URL: https://issues.apache.org/jira/browse/HUDI-6717 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > Since the log block version (due to delete block change) has been upgraded in > 0.14.0, the delete blocks can not be read in 0.13.0 or earlier. > Similarly the addition of record level index field in metadata table leads to > column drop error on downgrade. The Jira aims to fix the downgrade handler to > trigger compaction and delete metadata table if user wishes to downgrade from > version six (0.14.0) to version 5 (0.13.0). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] lokeshj1703 commented on a diff in pull request #9467: [HUDI-6717] Fix downgrade handler for 0.14.0
lokeshj1703 commented on code in PR #9467: URL: https://github.com/apache/hudi/pull/9467#discussion_r1299179473 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java: ## @@ -39,20 +47,26 @@ import static org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS; import static org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS_INFLIGHT; -import static org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTablePartition; /** * Downgrade handle to assist in downgrading hoodie table from version 6 to 5. * To ensure compatibility, we need recreate the compaction requested file to * .aux folder. + * Since version 6 includes a new schema field for metadata table(MDT), + * the MDT needs to be deleted during downgrade to avoid column drop error. + * Also log block version was upgraded in version 6, therefore full compaction needs + * to be completed during downgrade to avoid write failures. */ public class SixToFiveDowngradeHandler implements DowngradeHandler { @Override public Map downgrade(HoodieWriteConfig config, HoodieEngineContext context, String instantTime, SupportsUpgradeDowngrade upgradeDowngradeHelper) { final HoodieTable table = upgradeDowngradeHelper.getTable(config, context); -removeRecordIndexIfNeeded(table, context); +// Since version 6 includes a new schema field for metadata table(MDT), the MDT needs to be deleted during downgrade to avoid column drop error. +HoodieTableMetadataUtil.deleteMetadataTable(config.getBasePath(), context); +runCompaction(table, context, config, upgradeDowngradeHelper); Review Comment: Addressed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lokeshj1703 commented on a diff in pull request #9467: [HUDI-6717] Fix downgrade handler for 0.14.0
lokeshj1703 commented on code in PR #9467: URL: https://github.com/apache/hudi/pull/9467#discussion_r1299179444 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java: ## @@ -65,13 +79,31 @@ public Map downgrade(HoodieWriteConfig config, HoodieEng } /** - * Record-level index, a new partition in metadata table, was first added in - * 0.14.0 ({@link HoodieTableVersion#SIX}. Any downgrade from this version - * should remove this partition. + * Utility method to run compaction for MOR table as part of downgrade step. */ - private static void removeRecordIndexIfNeeded(HoodieTable table, HoodieEngineContext context) { -HoodieTableMetaClient metaClient = table.getMetaClient(); -deleteMetadataTablePartition(metaClient, context, MetadataPartitionType.RECORD_INDEX, false); + private void runCompaction(HoodieTable table, HoodieEngineContext context, HoodieWriteConfig config, + SupportsUpgradeDowngrade upgradeDowngradeHelper) { +try { + if (table.getMetaClient().getTableType() == HoodieTableType.MERGE_ON_READ) { +// The log block version has been upgraded in version six so compaction is required for downgrade. +// set required configs for scheduling compaction. + HoodieInstantTimeGenerator.setCommitTimeZone(table.getMetaClient().getTableConfig().getTimelineTimezone()); +HoodieWriteConfig compactionConfig = HoodieWriteConfig.newBuilder().withProps(config.getProps()).build(); +compactionConfig.setValue(HoodieCompactionConfig.INLINE_COMPACT.key(), "true"); + compactionConfig.setValue(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS.key(), "1"); + compactionConfig.setValue(HoodieCompactionConfig.INLINE_COMPACT_TRIGGER_STRATEGY.key(), CompactionTriggerStrategy.NUM_COMMITS.name()); + compactionConfig.setValue(HoodieCompactionConfig.COMPACTION_STRATEGY.key(), UnBoundedCompactionStrategy.class.getName()); +compactionConfig.setValue(HoodieMetadataConfig.ENABLE.key(), "false"); +EmbeddedTimelineServerHelper.createEmbeddedTimelineService(context, config); Review Comment: Addressed. This was required earlier but not needed any more. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java: ## @@ -39,20 +47,26 @@ import static org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS; import static org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS_INFLIGHT; -import static org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTablePartition; /** * Downgrade handle to assist in downgrading hoodie table from version 6 to 5. * To ensure compatibility, we need recreate the compaction requested file to * .aux folder. + * Since version 6 includes a new schema field for metadata table(MDT), + * the MDT needs to be deleted during downgrade to avoid column drop error. + * Also log block version was upgraded in version 6, therefore full compaction needs + * to be completed during downgrade to avoid write failures. Review Comment: Addressed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9484: [HUDI-6729] Fix get partition values from path for non-string type partition column
hudi-bot commented on PR #9484: URL: https://github.com/apache/hudi/pull/9484#issuecomment-1684924952 ## CI report: * 4d503b60e26faf4f879e09f266255d6c9af98afc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19367) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9484: [HUDI-6729] Fix get partition values from path for non-string type partition column
hudi-bot commented on PR #9484: URL: https://github.com/apache/hudi/pull/9484#issuecomment-1684923777 ## CI report: * 4d503b60e26faf4f879e09f266255d6c9af98afc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6729) Fix get partition values from path for non-string type partition column
[ https://issues.apache.org/jira/browse/HUDI-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6729: - Labels: pull-request-available (was: ) > Fix get partition values from path for non-string type partition column > --- > > Key: HUDI-6729 > URL: https://issues.apache.org/jira/browse/HUDI-6729 > Project: Apache Hudi > Issue Type: Bug > Components: hudi-utilities >Reporter: Wechar >Priority: Major > Labels: pull-request-available > > When we enable {{hoodie.datasource.read.extract.partition.values.from.path}} > to get partition values from path instead of data file, the exception throw > if partition column is not string type: > {code:bash} > Caused by: java.lang.ClassCastException: > org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:103) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getInt(rows.scala:41) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getInt$(rows.scala:41) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:195) > at > org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:97) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:264) > at > org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:314) > at > org.apache.hudi.HoodieDataSourceHelper$.$anonfun$buildHoodieParquetReader$1(HoodieDataSourceHelper.scala:67) > at > org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$2(HoodieBaseRelation.scala:602) > at > org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:680) > at > org.apache.hudi.HoodieBaseRelation$.$anonfun$projectReader$1(HoodieBaseRelation.scala:706) > at > org.apache.hudi.HoodieBaseRelation$.$anonfun$projectReader$2(HoodieBaseRelation.scala:711) > at > org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:680) > at > org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] wecharyu opened a new pull request, #9484: [HUDI-6729] Fix get partition values from path for non-string type partition column
wecharyu opened a new pull request, #9484: URL: https://github.com/apache/hudi/pull/9484 ### Change Logs When we enable `hoodie.datasource.read.extract.partition.values.from.path` to get partition values from path instead of data file, the exception throw if partition column is not string type. This patch fix the issue by cast partition value string to target datatype, following Spark's approach. ```bash Caused by: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:103) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getInt(rows.scala:41) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getInt$(rows.scala:41) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:195) at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:97) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:264) at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:314) at org.apache.hudi.HoodieDataSourceHelper$.$anonfun$buildHoodieParquetReader$1(HoodieDataSourceHelper.scala:67) at org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$2(HoodieBaseRelation.scala:602) at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:680) at org.apache.hudi.HoodieBaseRelation$.$anonfun$projectReader$1(HoodieBaseRelation.scala:706) at org.apache.hudi.HoodieBaseRelation$.$anonfun$projectReader$2(HoodieBaseRelation.scala:711) at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:680) at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) ``` ### Impact No ### Risk level (write none, low medium or high below) None ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6729) Fix get partition values from path for non-string type partition column
Wechar created HUDI-6729: Summary: Fix get partition values from path for non-string type partition column Key: HUDI-6729 URL: https://issues.apache.org/jira/browse/HUDI-6729 Project: Apache Hudi Issue Type: Bug Components: hudi-utilities Reporter: Wechar When we enable {{hoodie.datasource.read.extract.partition.values.from.path}} to get partition values from path instead of data file, the exception throw if partition column is not string type: {code:bash} Caused by: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:103) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getInt(rows.scala:41) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getInt$(rows.scala:41) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:195) at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:97) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:264) at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:314) at org.apache.hudi.HoodieDataSourceHelper$.$anonfun$buildHoodieParquetReader$1(HoodieDataSourceHelper.scala:67) at org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$2(HoodieBaseRelation.scala:602) at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:680) at org.apache.hudi.HoodieBaseRelation$.$anonfun$projectReader$1(HoodieBaseRelation.scala:706) at org.apache.hudi.HoodieBaseRelation$.$anonfun$projectReader$2(HoodieBaseRelation.scala:711) at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:680) at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9483: [HUDI-6156] Prevent leaving tmp file in timeline, delete tmp file whe…
hudi-bot commented on PR #9483: URL: https://github.com/apache/hudi/pull/9483#issuecomment-1684913205 ## CI report: * 373fb78cc587229fd9210edc0b9102101b3a3deb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19366) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9477: [HUDI-6726] Fix connection leaks related to file reader and iterator close
hudi-bot commented on PR #9477: URL: https://github.com/apache/hudi/pull/9477#issuecomment-1684913193 ## CI report: * 2fe4b6b8c722c26e4d970e8613be2f73e4b4eb4f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19364) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9483: [HUDI-6156] Prevent leaving tmp file in timeline, delete tmp file whe…
hudi-bot commented on PR #9483: URL: https://github.com/apache/hudi/pull/9483#issuecomment-1684905560 ## CI report: * 373fb78cc587229fd9210edc0b9102101b3a3deb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19366) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9483: [HUDI-6156] Prevent leaving tmp file in timeline, delete tmp file whe…
hudi-bot commented on PR #9483: URL: https://github.com/apache/hudi/pull/9483#issuecomment-1684904260 ## CI report: * 373fb78cc587229fd9210edc0b9102101b3a3deb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hbgstc123 opened a new pull request, #9483: [HUDI-6156] Prevent leaving tmp file in timeline, delete tmp file whe…
hbgstc123 opened a new pull request, #9483: URL: https://github.com/apache/hudi/pull/9483 …n rename throw exception. ### Change Logs follow former pr, try delete tmp file when rename throw excetion. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
hudi-bot commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1684902865 ## CI report: * 4f0de8a6d00fe72108a12d8316cb1d38389d6b31 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19355) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19362) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19365) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] majian1998 commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
majian1998 commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1684902220 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
hudi-bot commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1684878994 ## CI report: * 4f0de8a6d00fe72108a12d8316cb1d38389d6b31 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19355) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19362) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9477: [HUDI-6726] Fix connection leaks related to file reader and iterator close
hudi-bot commented on PR #9477: URL: https://github.com/apache/hudi/pull/9477#issuecomment-1684876834 ## CI report: * c90d959664437e13d53ce3c9810f824eaf396262 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19356) * 2fe4b6b8c722c26e4d970e8613be2f73e4b4eb4f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19364) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org