[GitHub] [hudi] hudi-bot commented on pull request #6225: [HUDI-4487] support to create ro/rt table by spark sql
hudi-bot commented on PR #6225: URL: https://github.com/apache/hudi/pull/6225#issuecomment-1200087165 ## CI report: * de8c1ae0ed8433f13e2f2e3087bc31499a9b3c05 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10429) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] soma1712 commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception
soma1712 commented on issue #6249: URL: https://github.com/apache/hudi/issues/6249#issuecomment-1200084568 hudi_read.txt is actually a .py file. As the system was not supporting to update a .py, I had to change it to .txt [hudi_read.txt](https://github.com/apache/hudi/files/9224774/hudi_read.txt) [results.txt](https://github.com/apache/hudi/files/9224775/results.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6225: [HUDI-4487] support to create ro/rt table by spark sql
hudi-bot commented on PR #6225: URL: https://github.com/apache/hudi/pull/6225#issuecomment-1200078793 ## CI report: * de8c1ae0ed8433f13e2f2e3087bc31499a9b3c05 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10429) * Unknown: [CANCELED](TBD) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6225: [HUDI-4487] support to create ro/rt table by spark sql
hudi-bot commented on PR #6225: URL: https://github.com/apache/hudi/pull/6225#issuecomment-1200078230 ## CI report: * de8c1ae0ed8433f13e2f2e3087bc31499a9b3c05 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10429) * Unknown: [CANCELED](TBD) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6225: [HUDI-4487] support to create ro/rt table by spark sql
hudi-bot commented on PR #6225: URL: https://github.com/apache/hudi/pull/6225#issuecomment-1200077692 ## CI report: * de8c1ae0ed8433f13e2f2e3087bc31499a9b3c05 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Fix convertPathWithScheme tests (#6251)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new c9725899c3 [MINOR] Fix convertPathWithScheme tests (#6251) c9725899c3 is described below commit c9725899c3f9516412dcc683875d81ac226d9b45 Author: Y Ethan Guo AuthorDate: Fri Jul 29 19:26:30 2022 -0700 [MINOR] Fix convertPathWithScheme tests (#6251) --- .../test/java/org/apache/hudi/common/fs/TestStorageSchemes.java| 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java b/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java index 9b173254ac..354ad6d0cc 100644 --- a/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java +++ b/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java @@ -69,6 +69,11 @@ public class TestStorageSchemes { assertEquals(s3TablePath3, HoodieWrapperFileSystem.convertPathWithScheme(s3TablePath3, "s3")); Path hdfsTablePath = new Path("hdfs://sandbox.foo.com:8020/test.1234/table1"); - System.out.println(HoodieWrapperFileSystem.convertPathWithScheme(hdfsTablePath, "hdfs")); +assertEquals(hdfsTablePath, HoodieWrapperFileSystem.convertPathWithScheme(hdfsTablePath, "hdfs")); + +Path localTablePath = new Path("file:/var/table1"); +Path localTablePathNoPrefix = new Path("/var/table1"); +assertEquals(localTablePath, HoodieWrapperFileSystem.convertPathWithScheme(localTablePath, "file")); +assertEquals(localTablePath, HoodieWrapperFileSystem.convertPathWithScheme(localTablePathNoPrefix, "file")); } }
[GitHub] [hudi] xushiyan merged pull request #6251: [MINOR] Fix convertPathWithScheme tests
xushiyan merged PR #6251: URL: https://github.com/apache/hudi/pull/6251 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan merged pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
xushiyan merged PR #6250: URL: https://github.com/apache/hudi/pull/6250 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-4507] Improve file name extraction logic in metadata utils (#6250)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 0f703a7e15 [HUDI-4507] Improve file name extraction logic in metadata utils (#6250) 0f703a7e15 is described below commit 0f703a7e15833493037f7f7a07882cd73044ee65 Author: Y Ethan Guo AuthorDate: Fri Jul 29 19:25:57 2022 -0700 [HUDI-4507] Improve file name extraction logic in metadata utils (#6250) --- .../java/org/apache/hudi/common/fs/FSUtils.java | 18 ++ .../hudi/metadata/HoodieTableMetadataUtil.java | 21 - .../java/org/apache/hudi/common/fs/TestFSUtils.java | 12 ++-- 3 files changed, 32 insertions(+), 19 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java index cfc143e3d0..d940f3bb45 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java @@ -615,6 +615,24 @@ public class FSUtils { return StringUtils.isNullOrEmpty(partitionPath) ? basePath : new Path(basePath, partitionPath); } + /** + * Extracts the file name from the relative path based on the table base path. For example: + * "/2022/07/29/file1.parquet", "/2022/07/29" -> "file1.parquet" + * "2022/07/29/file2.parquet", "2022/07/29" -> "file2.parquet" + * "/file3.parquet", "" -> "file3.parquet" + * "file4.parquet", "" -> "file4.parquet" + * + * @param filePathWithPartition the relative file path based on the table base path. + * @param partition the relative partition path. For partitioned table, `partition` contains the relative partition path; + * for non-partitioned table, `partition` is empty + * @return Extracted file name in String. + */ + public static String getFileName(String filePathWithPartition, String partition) { +int offset = StringUtils.isNullOrEmpty(partition) +? (filePathWithPartition.startsWith("/") ? 1 : 0) : partition.length() + 1; +return filePathWithPartition.substring(offset); + } + /** * Get DFS full partition path (e.g. hdfs://ip-address:8020:/) */ diff --git a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java index d41f09990e..2c5b8db0ed 100644 --- a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java +++ b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java @@ -325,16 +325,13 @@ public class HoodieTableMetadataUtil { return map; } -int offset = partition.equals(NON_PARTITIONED_NAME) -? (pathWithPartition.startsWith("/") ? 1 : 0) -: partition.length() + 1; -String filename = pathWithPartition.substring(offset); +String fileName = FSUtils.getFileName(pathWithPartition, partitionStatName); // Since write-stats are coming in no particular order, if the same // file have previously been appended to w/in the txn, we simply pick max // of the sizes as reported after every write, since file-sizes are // monotonically increasing (ie file-size never goes down, unless deleted) -map.merge(filename, stat.getFileSizeInBytes(), Math::max); +map.merge(fileName, stat.getFileSizeInBytes(), Math::max); return map; }, @@ -410,12 +407,7 @@ public class HoodieTableMetadataUtil { return Collections.emptyListIterator(); } - // For partitioned table, "partition" contains the relative partition path; - // for non-partitioned table, "partition" is empty - int offset = StringUtils.isNullOrEmpty(partition) - ? (pathWithPartition.startsWith("/") ? 1 : 0) : partition.length() + 1; - - final String fileName = pathWithPartition.substring(offset); + String fileName = FSUtils.getFileName(pathWithPartition, partition); if (!FSUtils.isBaseFile(new Path(fileName))) { return Collections.emptyListIterator(); } @@ -1162,13 +1154,8 @@ public class HoodieTableMetadataUtil { HoodieTableMetaClient datasetMetaClient, List columnsToIndex, boolean isDeleted) { -String partitionName = getPartitionIdentifier(partitionPath); -// NOTE: We
[GitHub] [hudi] xiarixiaoyao commented on issue #6243: [SUPPORT] sparksql mergeinto sqlstatment 'update set' not effect
xiarixiaoyao commented on issue #6243: URL: https://github.com/apache/hudi/issues/6243#issuecomment-1200067268 @fujianhua168 the reason is that you have configure preCombineField = 'ts' the old record has a bigger ts(1000) then the new record ts(900) , so hudi will not merge the new record -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
hudi-bot commented on PR #6250: URL: https://github.com/apache/hudi/pull/6250#issuecomment-1200066863 ## CI report: * 129007ab3840f01ccafaf5ef73275301fcd6799f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10464) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4508) Repair bug: if a fileSlice has no baseFile, it will throw exception, when reading optimized querie to mor
sherhomhuang created HUDI-4508: -- Summary: Repair bug: if a fileSlice has no baseFile, it will throw exception, when reading optimized querie to mor Key: HUDI-4508 URL: https://issues.apache.org/jira/browse/HUDI-4508 Project: Apache Hudi Issue Type: Bug Components: hive, trino-presto Reporter: sherhomhuang Assignee: sherhomhuang When read partition with fileSlice without baseFile, it will throw exception, when reading optimized querie to mor. It should not be a exception, but query none for the fileSlice. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] leesf commented on a diff in pull request #6245: [HUDI-4506] make BucketIndexPartitioner distribute data more balance
leesf commented on code in PR #6245: URL: https://github.com/apache/hudi/pull/6245#discussion_r933703835 ## .idea/vcs.xml: ## @@ -1,20 +1,4 @@
[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
hudi-bot commented on PR #6250: URL: https://github.com/apache/hudi/pull/6250#issuecomment-1200057185 ## CI report: * 129007ab3840f01ccafaf5ef73275301fcd6799f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10464) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6251: [MINOR] Fix convertPathWithScheme tests
hudi-bot commented on PR #6251: URL: https://github.com/apache/hudi/pull/6251#issuecomment-1200056245 ## CI report: * c52b9f9ac77b8493cdb0ae012a790b92cd0e4dcc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10465) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
hudi-bot commented on PR #6250: URL: https://github.com/apache/hudi/pull/6250#issuecomment-1200056238 ## CI report: * 129007ab3840f01ccafaf5ef73275301fcd6799f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x and more
hudi-bot commented on PR #6227: URL: https://github.com/apache/hudi/pull/6227#issuecomment-1200056209 ## CI report: * c771a314e72284d22cd682a48eb0013aaf09b3cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10466) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
xushiyan commented on code in PR #6250: URL: https://github.com/apache/hudi/pull/6250#discussion_r933697770 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -325,16 +325,13 @@ public static List convertMetadataToFilesPartitionRecords(HoodieCo return map; } -int offset = partition.equals(NON_PARTITIONED_NAME) -? (pathWithPartition.startsWith("/") ? 1 : 0) -: partition.length() + 1; -String filename = pathWithPartition.substring(offset); +String fileName = FSUtils.getFileName(pathWithPartition, partitionStatName); Review Comment: ok i see. i was confused by the var name `partition`, which should actually be called `partitionIdentifier` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
yihua commented on code in PR #6250: URL: https://github.com/apache/hudi/pull/6250#discussion_r933697289 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -325,16 +325,13 @@ public static List convertMetadataToFilesPartitionRecords(HoodieCo return map; } -int offset = partition.equals(NON_PARTITIONED_NAME) -? (pathWithPartition.startsWith("/") ? 1 : 0) -: partition.length() + 1; -String filename = pathWithPartition.substring(offset); +String fileName = FSUtils.getFileName(pathWithPartition, partitionStatName); Review Comment: We cannot use `partition` here which is generated by `getPartitionIdentifier(partitionStatName)`, changing the empty relative partition path to `.` partition identifier. `getFileName()` expects plain relative partition path, instead of the partition identifier used in the metadata table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6236: [SUPPORT] facing an issue on querying Data in Hudi version 0.10.1 using AWS glue
yihua commented on issue #6236: URL: https://github.com/apache/hudi/issues/6236#issuecomment-1200049384 @svaddoriya Have you tried to increase Spark memory settings? @rahil-c @zhedoubushishi @umehrot2 do you have any suggestions or best practices for querying the Hudi table with AWS Glue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
xushiyan commented on code in PR #6250: URL: https://github.com/apache/hudi/pull/6250#discussion_r933696374 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -325,16 +325,13 @@ public static List convertMetadataToFilesPartitionRecords(HoodieCo return map; } -int offset = partition.equals(NON_PARTITIONED_NAME) -? (pathWithPartition.startsWith("/") ? 1 : 0) -: partition.length() + 1; -String filename = pathWithPartition.substring(offset); +String fileName = FSUtils.getFileName(pathWithPartition, partitionStatName); Review Comment: the new util getFileName() uses `partition`, right? why not pass `partition` to check? as how previously done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6243: [SUPPORT] sparksql mergeinto sqlstatment 'update set' not effect
yihua commented on issue #6243: URL: https://github.com/apache/hudi/issues/6243#issuecomment-1200048651 @xiarixiaoyao @YannByron @XuQianJin-Stars any of you can help here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception
yihua commented on issue #6249: URL: https://github.com/apache/hudi/issues/6249#issuecomment-1200047941 @soma1712 could you share how you read the Hudi table in `s3://pythonscripts/hudi_read.py` and the full stacktrace as well? Which Hudi release do you use? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x and more
hudi-bot commented on PR #6227: URL: https://github.com/apache/hudi/pull/6227#issuecomment-1200043467 ## CI report: * 745c015e848fde5d7a78c21e828af97705efa0d0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10463) * c771a314e72284d22cd682a48eb0013aaf09b3cb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10466) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x and more
hudi-bot commented on PR #6227: URL: https://github.com/apache/hudi/pull/6227#issuecomment-1200042181 ## CI report: * 745c015e848fde5d7a78c21e828af97705efa0d0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10463) * c771a314e72284d22cd682a48eb0013aaf09b3cb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
hudi-bot commented on PR #6250: URL: https://github.com/apache/hudi/pull/6250#issuecomment-1200016266 ## CI report: * 129007ab3840f01ccafaf5ef73275301fcd6799f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10464) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6251: [MINOR] Fix convertPathWithScheme tests
hudi-bot commented on PR #6251: URL: https://github.com/apache/hudi/pull/6251#issuecomment-1200012764 ## CI report: * c52b9f9ac77b8493cdb0ae012a790b92cd0e4dcc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10465) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6251: [MINOR] Fix convertPathWithScheme tests
hudi-bot commented on PR #6251: URL: https://github.com/apache/hudi/pull/6251#issuecomment-1200011183 ## CI report: * c52b9f9ac77b8493cdb0ae012a790b92cd0e4dcc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x
hudi-bot commented on PR #6227: URL: https://github.com/apache/hudi/pull/6227#issuecomment-129286 ## CI report: * 745c015e848fde5d7a78c21e828af97705efa0d0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10463) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request, #6251: [MINOR] Fix convertPathWithScheme tests
yihua opened a new pull request, #6251: URL: https://github.com/apache/hudi/pull/6251 ## What is the purpose of the pull request This PR fixes the tests of `HoodieWrapperFileSystem.convertPathWithScheme`. ## Brief change log - Fixes tests in `TestStorageSchemes` ## Verify this pull request This change adds tests in `TestStorageSchemes`. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
hudi-bot commented on PR #6250: URL: https://github.com/apache/hudi/pull/6250#issuecomment-1199968512 ## CI report: * 129007ab3840f01ccafaf5ef73275301fcd6799f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10464) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x
hudi-bot commented on PR #6227: URL: https://github.com/apache/hudi/pull/6227#issuecomment-1199968461 ## CI report: * 745c015e848fde5d7a78c21e828af97705efa0d0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10463) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table
yihua commented on code in PR #6113: URL: https://github.com/apache/hudi/pull/6113#discussion_r933637330 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -409,8 +409,11 @@ public static HoodieData convertMetadataToBloomFilterRecords( LOG.error("Failed to find path in write stat to update metadata table " + hoodieWriteStat); return Collections.emptyListIterator(); } - int offset = partition.equals(NON_PARTITIONED_NAME) ? (pathWithPartition.startsWith("/") ? 1 : 0) : - partition.length() + 1; + + // For partitioned table, "partition" contains the relative partition path; + // for non-partitioned table, "partition" is empty + int offset = StringUtils.isNullOrEmpty(partition) Review Comment: Addressed in #6250. `String.replace` could be slow so I still use the current logic. I moved it into a util method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1
[ https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4496: - Labels: pull-request-available (was: ) > ORC fails w/ Spark 3.1 > -- > > Key: HUDI-4496 > URL: https://issues.apache.org/jira/browse/HUDI-4496 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > > After running TestHoodieSparkSqlWriter test for different Spark versions, > discovered that Orc version was incorrectly put as compile time dep on the > classpath, breaking Orc writing in Hudi in Spark 3.1: > https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
hudi-bot commented on PR #6250: URL: https://github.com/apache/hudi/pull/6250#issuecomment-1199966586 ## CI report: * 129007ab3840f01ccafaf5ef73275301fcd6799f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x
hudi-bot commented on PR #6227: URL: https://github.com/apache/hudi/pull/6227#issuecomment-1199966546 ## CI report: * 745c015e848fde5d7a78c21e828af97705efa0d0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
yihua commented on code in PR #6250: URL: https://github.com/apache/hudi/pull/6250#discussion_r933634375 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -1162,13 +1154,8 @@ private static Stream getColumnStatsRecords(String partitionPath, HoodieTableMetaClient datasetMetaClient, List columnsToIndex, boolean isDeleted) { -String partitionName = getPartitionIdentifier(partitionPath); -// NOTE: We have to chop leading "/" to make sure Hadoop does not treat it like -// absolute path String filePartitionPath = filePath.startsWith("/") ? filePath.substring(1) : filePath; -String fileName = partitionName.equals(NON_PARTITIONED_NAME) -? filePartitionPath -: filePartitionPath.substring(partitionName.length() + 1); +String fileName = FSUtils.getFileName(filePath, partitionPath); Review Comment: The same here, using `partitionPath` directly instead of the partition identified. ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -325,16 +325,13 @@ public static List convertMetadataToFilesPartitionRecords(HoodieCo return map; } -int offset = partition.equals(NON_PARTITIONED_NAME) -? (pathWithPartition.startsWith("/") ? 1 : 0) -: partition.length() + 1; -String filename = pathWithPartition.substring(offset); +String fileName = FSUtils.getFileName(pathWithPartition, partitionStatName); Review Comment: Before the change, the `partition` identifier is used, instead of `partitionStatName`. For a partitioned table, there is no difference; for a non-partitioned table, the `partition` identifier is `.` while `partitionStatName` could be empty or `/`. The new logic depends on `partitionStatName` instead of `partition` identified, and the file name extracted is not affected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4507) Improve file name extraction logic in metadata utils
[ https://issues.apache.org/jira/browse/HUDI-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4507: - Labels: pull-request-available (was: ) > Improve file name extraction logic in metadata utils > > > Key: HUDI-4507 > URL: https://issues.apache.org/jira/browse/HUDI-4507 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > > https://github.com/apache/hudi/pull/6113#discussion_r929275152 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua opened a new pull request, #6250: [HUDI-4507] Improve file name extraction logic in metadata utils
yihua opened a new pull request, #6250: URL: https://github.com/apache/hudi/pull/6250 ## What is the purpose of the pull request This PR improves file name extraction logic in metadata utils by adding a new util method. ## Brief change log - Adds a new util method `FSUtils.getFileName` - Refactors the logic of extracting file names in `HoodieTableMetadataUtil` - Adds a unit test for the new util method ## Verify this pull request This change adds a new test as mentioned above. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4496) ORC fails w/ Spark 3.1
[ https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573171#comment-17573171 ] Alexey Kudinkin commented on HUDI-4496: --- [https://github.com/apache/hudi/pull/6227] > ORC fails w/ Spark 3.1 > -- > > Key: HUDI-4496 > URL: https://issues.apache.org/jira/browse/HUDI-4496 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > > After running TestHoodieSparkSqlWriter test for different Spark versions, > discovered that Orc version was incorrectly put as compile time dep on the > classpath, breaking Orc writing in Hudi in Spark 3.1: > https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-4496) ORC fails w/ Spark 3.1
[ https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573170#comment-17573170 ] Alexey Kudinkin commented on HUDI-4496: --- [https://github.com/apache/hudi/pull/6227] > ORC fails w/ Spark 3.1 > -- > > Key: HUDI-4496 > URL: https://issues.apache.org/jira/browse/HUDI-4496 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > > After running TestHoodieSparkSqlWriter test for different Spark versions, > discovered that Orc version was incorrectly put as compile time dep on the > classpath, breaking Orc writing in Hudi in Spark 3.1: > https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1
[ https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4496: -- Status: Patch Available (was: In Progress) > ORC fails w/ Spark 3.1 > -- > > Key: HUDI-4496 > URL: https://issues.apache.org/jira/browse/HUDI-4496 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > > After running TestHoodieSparkSqlWriter test for different Spark versions, > discovered that Orc version was incorrectly put as compile time dep on the > classpath, breaking Orc writing in Hudi in Spark 3.1: > https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1
[ https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4496: -- Story Points: 2 (was: 1) > ORC fails w/ Spark 3.1 > -- > > Key: HUDI-4496 > URL: https://issues.apache.org/jira/browse/HUDI-4496 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > > After running TestHoodieSparkSqlWriter test for different Spark versions, > discovered that Orc version was incorrectly put as compile time dep on the > classpath, breaking Orc writing in Hudi in Spark 3.1: > https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1
[ https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4496: -- Status: In Progress (was: Open) > ORC fails w/ Spark 3.1 > -- > > Key: HUDI-4496 > URL: https://issues.apache.org/jira/browse/HUDI-4496 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > > After running TestHoodieSparkSqlWriter test for different Spark versions, > discovered that Orc version was incorrectly put as compile time dep on the > classpath, breaking Orc writing in Hudi in Spark 3.1: > https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4507) Improve file name extraction logic in metadata utils
[ https://issues.apache.org/jira/browse/HUDI-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4507: Summary: Improve file name extraction logic in metadata utils (was: Improve filename extraction logic in metadata utils) > Improve file name extraction logic in metadata utils > > > Key: HUDI-4507 > URL: https://issues.apache.org/jira/browse/HUDI-4507 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: Ethan Guo >Priority: Major > Fix For: 0.12.0 > > > https://github.com/apache/hudi/pull/6113#discussion_r929275152 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4507) Improve filename extraction logic in metadata utils
[ https://issues.apache.org/jira/browse/HUDI-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4507: Fix Version/s: 0.12.0 > Improve filename extraction logic in metadata utils > --- > > Key: HUDI-4507 > URL: https://issues.apache.org/jira/browse/HUDI-4507 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 0.12.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4507) Improve filename extraction logic in metadata utils
[ https://issues.apache.org/jira/browse/HUDI-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4507: Component/s: code-quality > Improve filename extraction logic in metadata utils > --- > > Key: HUDI-4507 > URL: https://issues.apache.org/jira/browse/HUDI-4507 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: Ethan Guo >Priority: Major > Fix For: 0.12.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4507) Improve filename extraction logic in metadata utils
[ https://issues.apache.org/jira/browse/HUDI-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4507: Description: https://github.com/apache/hudi/pull/6113#discussion_r929275152 > Improve filename extraction logic in metadata utils > --- > > Key: HUDI-4507 > URL: https://issues.apache.org/jira/browse/HUDI-4507 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: Ethan Guo >Priority: Major > Fix For: 0.12.0 > > > https://github.com/apache/hudi/pull/6113#discussion_r929275152 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4507) Improve filename extraction logic in metadata utils
Ethan Guo created HUDI-4507: --- Summary: Improve filename extraction logic in metadata utils Key: HUDI-4507 URL: https://issues.apache.org/jira/browse/HUDI-4507 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] neerajpadarthi commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns
neerajpadarthi commented on issue #6232: URL: https://github.com/apache/hudi/issues/6232#issuecomment-1199914603 @yihua Hey, I have verified the same in Hudi 0.10.1 but no luck still precision is getting truncated. Below are the configs, spark session details and spark/Hudi outputs. Could you please verify and let me know if anything is missing here? Thanks ===Environment Details EMR: emr-6.6.0 Hudi version : 0.10.1 Spark version : Spark 3.2.0 Hive version : Hive 3.1.2 Hadoop version :Storage (HDFS/S3/GCS..) : S3 Running on Docker? (yes/no) : no ===Spark Configs def create_spark_session(): spark = SparkSession \ .builder \ .config(“spark.sql.extensions”, “org.apache.spark.sql.hudi.HoodieSparkSessionExtension”) \ .config(“spark.sql.parquet.writeLegacyFormat”, “true”) \ .config(“spark.sql.parquet.outputTimestampType”, “TIMESTAMP_MICROS”) \ .config(“spark.sql.legacy.parquet.datetimeRebaseModeInRead”, “LEGACY”)\ .config(“spark.sql.legacy.parquet.int96RebaseModeInRead”,“LEGACY”)\ .enableHiveSupport()\ .getOrCreate() return spark ===Hudi Configs db_name = <> tableName = <> pk =<> de_dup =<> commonConfig = {‘hoodie.datasource.hive_sync.database’: db_name,‘hoodie.table.name’: tableName,‘hoodie.datasource.hive_sync.support_timestamp’: ‘true’,‘hoodie.datasource.write.recordkey.field’: pk,‘hoodie.datasource.write.precombine.field’: de_dup,‘hoodie.datasource.hive_sync.enable’: ‘true’,‘hoodie.datasource.hive_sync.table’: tableName} nonPartitionConfig = {‘hoodie.datasource.hive_sync.partition_extractor_class’:‘org.apache.hudi.hive.NonPartitionedExtractor’,‘hoodie.datasource.write.keygenerator.class’:‘org.apache.hudi.keygen.NonpartitionedKeyGenerator’} config = {‘hoodie.bulkinsert.shuffle.parallelism’: 10,‘hoodie.datasource.write.operation’: ‘bulk_insert’,‘hoodie.parquet.outputtimestamptype’:‘TIMESTAMP_MICROS’, #‘hoodie.datasource.write.row.writer.enable’:’false’} ===Spark DF Output +--+--+--+ |id|creation_date |last_updated | +--+--+--+ |1340225 |2017-01-24 00:02:10 |2022-02-25 07:03:54.000853| |722b232f-e|2022-02-22 06:02:32.000481|2022-02-25 08:54:05.00042 | |53773de3-9|2022-02-25 07:21:06.37|2022-02-25 08:35:57.000877| +--+--+--+ ===Hudi V0.10.1 Output +---+-+--+--++--+---+---+ |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id|creation_date |last_updated | +---+-+--+--++--+---+---+ |20220729201157281 |20220729201157281_1_2|53773de3-9| |55f7c820-c289-4eb7-aabc-4f079bd44536-0_1-11-10_20220729201157281.parquet|53773de3-9|2022-02-25 07:21:06|2022-02-25 08:35:57| |20220729201157281 |20220729201157281_2_3|722b232f-e| |0dd8d6c2-9d64-40d7-a4db-bf7cf95bd02c-0_2-11-11_20220729201157281.parquet|722b232f-e|2022-02-22 06:02:32|2022-02-25 08:54:05| |20220729201157281 |20220729201157281_0_1|1340225 | |2e0cf27b-999d-4d5e-9c4e-52d27c25294e-0_0-9-9_20220729201157281.parquet |1340225 |2017-01-24 00:02:10|2022-02-25 07:03:54| +---+-+--+--++--+---+---+ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency
hudi-bot commented on PR #6228: URL: https://github.com/apache/hudi/pull/6228#issuecomment-1199889169 ## CI report: * 0cc2dbb39e432baf741bb3dd94c6d627cb250297 UNKNOWN * 6f055012562507406afe0ab0ec37e4a5388538f2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10462) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency
xushiyan commented on code in PR #6228: URL: https://github.com/apache/hudi/pull/6228#discussion_r933550437 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java: ## @@ -172,37 +179,46 @@ public Pair>, String> fetchNextBatch(Option lastCkpt String s3FS = props.getString(Config.S3_FS_PREFIX, "s3").toLowerCase(); String s3Prefix = s3FS + "://"; -// Extract distinct file keys from s3 meta hoodie table -final List cloudMetaDf = source +// Create S3 paths +final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, Config.DEFAULT_ENABLE_EXISTS_CHECK); +SerializableConfiguration serializableConfiguration = new SerializableConfiguration(sparkContext.hadoopConfiguration()); +List cloudFiles = source .filter(filter) .select("s3.bucket.name", "s3.object.key") .distinct() -.collectAsList(); -// Create S3 paths -final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, Config.DEFAULT_ENABLE_EXISTS_CHECK); -List cloudFiles = new ArrayList<>(); -for (Row row : cloudMetaDf) { - // construct file path, row index 0 refers to bucket and 1 refers to key - String bucket = row.getString(0); - String filePath = s3Prefix + bucket + "/" + row.getString(1); - if (checkExists) { -FileSystem fs = FSUtils.getFs(s3Prefix + bucket, sparkSession.sparkContext().hadoopConfiguration()); -try { - if (fs.exists(new Path(filePath))) { -cloudFiles.add(filePath); - } -} catch (IOException e) { - LOG.error(String.format("Error while checking path exists for %s ", filePath), e); -} - } else { -cloudFiles.add(filePath); - } -} +.mapPartitions((MapPartitionsFunction) fileListIterator -> { + List cloudFilesPerPartition = new ArrayList<>(); + final Configuration configuration = serializableConfiguration.newCopy(); + fileListIterator.forEachRemaining(row -> { +String bucket = row.getString(0); +String filePath = s3Prefix + bucket + "/" + row.getString(1); +String decodeUrl = null; +try { + decodeUrl = URLDecoder.decode(filePath, StandardCharsets.UTF_8.name()); + if (checkExists) { +FileSystem fs = FSUtils.getFs(s3Prefix + bucket, configuration); +if (fs.exists(new Path(decodeUrl))) { + cloudFilesPerPartition.add(decodeUrl); +} + } else { +cloudFilesPerPartition.add(decodeUrl); + } +} catch (IOException e) { + LOG.error(String.format("Error while checking path exists for %s ", decodeUrl), e); +} catch (Throwable e) { + LOG.warn("Failed to add cloud file ", e); Review Comment: didn't realize this before...in the original logic, any exception other than IOException will fail the fetch right? here it'll silence it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] soma1712 opened a new issue, #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception
soma1712 opened a new issue, #6249: URL: https://github.com/apache/hudi/issues/6249 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1. 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : * Spark version : * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : * Running on Docker? (yes/no) : **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` Detailed Notes - We have Incoming Delta transactions from an Oracle based application that are being pushed into S3 endpoint using AWS DMS services. These CDC records are applied as upserts on to already existing Hudi table in a different S3 bucket (Initial Load data). The UPSERTS are happening by running below Spark Submits - spark-submit \ --deploy-mode client \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.shuffle.service.enabled=true \ --conf spark.default.parallelism=500 \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.initialExecutors=3 \ --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.app.name= \ --jars /usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hive/lib/hbase-client.jar /usr/lib/hudi/hudi-utilities-bundle.jar \ --table-type MERGE_ON_READ \ --op UPSERT \ --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:1 \ --source-ordering-field dms_seq_no \ --props s3://bucket/cdc.properties \ --hoodie-conf hoodie.datasource.hive_sync.database=glue_db \ --target-base-path s3://bucket/table_1 \ --target-table table_1 \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://bucket/ \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --enable-sync This table will be subsequently read with hudi options and joined with other hudi tables to populate the Final Enriched layer. While reading a Hudi table we are facing the ArrayIndexOutOfbound exception. Below are the Hudi props and Spark Submits we execute to read and populate the downstream. hoodie.datasource.write.partitionpath.field= hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator hoodie.datasource.hive_sync.enable=true hoodie.datasource.hive_sync.assume_date_partitioning=false hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor hoodie.parquet.small.file.limit=134217728 hoodie.parquet.max.file.size=1048576000 hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.cleaner.commits.retained=1 hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE FALSE END AS _hoodie_is_deleted,* from hoodie.datasource.hive_sync.support_timestamp=true hoodie.datasource.compaction.async.enable=true hoodie.index.type=BLOOM hoodie.compact.inline=true hoodiecompactionconfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP=5 hoodie.metadata.compact.max.delta.commits=5 hoodie.clean.automatic=true hoodie.clean.async=true hoodie.datasource.hive_sync.table=table_1 hoodie.datasource.write.recordkey.field=table_1_ID spark-submit --deploy-mode client --conf spark.yarn.appMasterEnv.SPARK_HOME=/prod/null --conf spark.executorEnv.SPARK_HOME=/prod/null --conf spark.shuffle.service.enabled=true --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar s3://pythonscripts/hudi_read.py TaskSetManager: Lost task 32.2 in stage 6.0 (TID 253) on ip-172-31-16-236.ec2.internal, executor 1: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1] 22/07/21 15:50:26 INFO TaskSetManager: Starting task 32.3 in stage 6.0 (TID 296, ip-172-31-16-236.ec2.internal, executor 1, partition 32, PROCESS_LOCAL, 8887 bytes) 22/07/21 15:50:26 INFO TaskSetManager: Lost task 33.2 in stage 6.0 (TID 256) on ip-172-31-16-236.ec2.internal, executor 1: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2] -- This is an automated message from the Apache Git Service. To respond to the message, please log on
[GitHub] [hudi] hudi-bot commented on pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency
hudi-bot commented on PR #6228: URL: https://github.com/apache/hudi/pull/6228#issuecomment-1199833744 ## CI report: * 0cc2dbb39e432baf741bb3dd94c6d627cb250297 UNKNOWN * e14bff1ef93f0c1fbbacf384d4fcaa3ef314050c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10434) * 6f055012562507406afe0ab0ec37e4a5388538f2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10462) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vamshigv commented on a diff in pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency
vamshigv commented on code in PR #6228: URL: https://github.com/apache/hudi/pull/6228#discussion_r933517693 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java: ## @@ -172,37 +177,47 @@ public Pair>, String> fetchNextBatch(Option lastCkpt String s3FS = props.getString(Config.S3_FS_PREFIX, "s3").toLowerCase(); String s3Prefix = s3FS + "://"; -// Extract distinct file keys from s3 meta hoodie table -final List cloudMetaDf = source +// Create S3 paths +final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, Config.DEFAULT_ENABLE_EXISTS_CHECK); +SerializableConfiguration serializableConfiguration = new SerializableConfiguration(sparkContext.hadoopConfiguration()); +List cloudFiles = source .filter(filter) .select("s3.bucket.name", "s3.object.key") .distinct() -.collectAsList(); -// Create S3 paths -final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, Config.DEFAULT_ENABLE_EXISTS_CHECK); -List cloudFiles = new ArrayList<>(); -for (Row row : cloudMetaDf) { - // construct file path, row index 0 refers to bucket and 1 refers to key - String bucket = row.getString(0); - String filePath = s3Prefix + bucket + "/" + row.getString(1); - if (checkExists) { -FileSystem fs = FSUtils.getFs(s3Prefix + bucket, sparkSession.sparkContext().hadoopConfiguration()); -try { - if (fs.exists(new Path(filePath))) { -cloudFiles.add(filePath); - } -} catch (IOException e) { - LOG.error(String.format("Error while checking path exists for %s ", filePath), e); -} - } else { -cloudFiles.add(filePath); - } -} +.rdd().toJavaRDD().mapPartitions(fileListIterator -> { + List cloudFilesPerPartition = new ArrayList<>(); + fileListIterator.forEachRemaining(row -> { +final Configuration configuration = serializableConfiguration.newCopy(); +String bucket = row.getString(0); +String filePath = s3Prefix + bucket + "/" + row.getString(1); +try { + String decodeUrl = URLDecoder.decode(filePath, StandardCharsets.UTF_8.name()); + if (checkExists) { +FileSystem fs = FSUtils.getFs(s3Prefix + bucket, configuration); +try { Review Comment: @xushiyan Simplified this nesting now. PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency
hudi-bot commented on PR #6228: URL: https://github.com/apache/hudi/pull/6228#issuecomment-1199828862 ## CI report: * 0cc2dbb39e432baf741bb3dd94c6d627cb250297 UNKNOWN * e14bff1ef93f0c1fbbacf384d4fcaa3ef314050c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10434) * 6f055012562507406afe0ab0ec37e4a5388538f2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vamshigv commented on a diff in pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency
vamshigv commented on code in PR #6228: URL: https://github.com/apache/hudi/pull/6228#discussion_r933517139 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java: ## @@ -172,37 +177,47 @@ public Pair>, String> fetchNextBatch(Option lastCkpt String s3FS = props.getString(Config.S3_FS_PREFIX, "s3").toLowerCase(); String s3Prefix = s3FS + "://"; -// Extract distinct file keys from s3 meta hoodie table -final List cloudMetaDf = source +// Create S3 paths +final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, Config.DEFAULT_ENABLE_EXISTS_CHECK); +SerializableConfiguration serializableConfiguration = new SerializableConfiguration(sparkContext.hadoopConfiguration()); +List cloudFiles = source .filter(filter) .select("s3.bucket.name", "s3.object.key") .distinct() -.collectAsList(); -// Create S3 paths -final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, Config.DEFAULT_ENABLE_EXISTS_CHECK); -List cloudFiles = new ArrayList<>(); -for (Row row : cloudMetaDf) { - // construct file path, row index 0 refers to bucket and 1 refers to key - String bucket = row.getString(0); - String filePath = s3Prefix + bucket + "/" + row.getString(1); - if (checkExists) { -FileSystem fs = FSUtils.getFs(s3Prefix + bucket, sparkSession.sparkContext().hadoopConfiguration()); -try { - if (fs.exists(new Path(filePath))) { -cloudFiles.add(filePath); - } -} catch (IOException e) { - LOG.error(String.format("Error while checking path exists for %s ", filePath), e); -} - } else { -cloudFiles.add(filePath); - } -} +.rdd().toJavaRDD().mapPartitions(fileListIterator -> { Review Comment: @xushiyan removed conversion to JavaRDD here. Applied mapPartitions on dataset directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leobiscassi commented on issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception
leobiscassi commented on issue #6142: URL: https://github.com/apache/hudi/issues/6142#issuecomment-1199808432 Hey @qianchutao, I was able to fix this on my side and maybe the solution help you too. So, basically this error happens because a mismatch of order between the schema declared inside the parquet files and the table schema ddl on Athena / Presto. This normally works on Athena because the default method to map the columns on Athena is using the names [1], for Presto the default way is by column indexes [2], so when you have schema evolution or for some reason the order of columns doesn't match between the parquet files and the table schema, this starts to happen, nothing related to hudi itself. To fix this add the config `hive.parquet.use-column-names=true` under the EMR config tab or at start up time, this is going to update the config files and restart the presto cluster. If you want to do this on a running cluster you'll need to do on master and worker nodes and restart presto, without doing that the config won't work. Let me know if this helps [1] https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html [2] https://stackoverflow.com/questions/60183579/presto-fails-with-type-mismatch-errors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] neerajpadarthi commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns
neerajpadarthi commented on issue #6232: URL: https://github.com/apache/hudi/issues/6232#issuecomment-1199790011 @yihua - I will validate with 0.10.1 @YannByron - Thanks for checking. I have tested with below configs passing to spark session but I still see the same issue. "spark.sql.parquet.outputTimestampType","TIMESTAMP_MICROS" "spark.sql.parquet.writeLegacyFormat", "true" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
hudi-bot commented on PR #5629: URL: https://github.com/apache/hudi/pull/5629#issuecomment-1199619090 ## CI report: * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN * 279857485f18875cab94f72b5bf61522bdaecd31 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10458) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4032) Remove double file-listing in Hudi Relations
[ https://issues.apache.org/jira/browse/HUDI-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573043#comment-17573043 ] Alexey Kudinkin commented on HUDI-4032: --- This has been addressed by: https://github.com/apache/hudi/pull/5722/files# > Remove double file-listing in Hudi Relations > > > Key: HUDI-4032 > URL: https://issues.apache.org/jira/browse/HUDI-4032 > Project: Apache Hudi > Issue Type: Task > Components: index >Reporter: Ethan Guo >Priority: Blocker > Fix For: 0.12.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-4032) Remove double file-listing in Hudi Relations
[ https://issues.apache.org/jira/browse/HUDI-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin closed HUDI-4032. - Resolution: Fixed > Remove double file-listing in Hudi Relations > > > Key: HUDI-4032 > URL: https://issues.apache.org/jira/browse/HUDI-4032 > Project: Apache Hudi > Issue Type: Task > Components: index >Reporter: Ethan Guo >Priority: Blocker > Fix For: 0.12.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4032) Remove double file-listing in BaseFileOnlyRelation
[ https://issues.apache.org/jira/browse/HUDI-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4032: -- Summary: Remove double file-listing in BaseFileOnlyRelation (was: Remove double file-listing in SparkHoodieFileIndex) > Remove double file-listing in BaseFileOnlyRelation > -- > > Key: HUDI-4032 > URL: https://issues.apache.org/jira/browse/HUDI-4032 > Project: Apache Hudi > Issue Type: Task > Components: index >Reporter: Ethan Guo >Priority: Blocker > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4032) Remove double file-listing in Hudi Relations
[ https://issues.apache.org/jira/browse/HUDI-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4032: -- Fix Version/s: 0.12.0 (was: 0.13.0) > Remove double file-listing in Hudi Relations > > > Key: HUDI-4032 > URL: https://issues.apache.org/jira/browse/HUDI-4032 > Project: Apache Hudi > Issue Type: Task > Components: index >Reporter: Ethan Guo >Priority: Blocker > Fix For: 0.12.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4032) Remove double file-listing in Hudi Relations
[ https://issues.apache.org/jira/browse/HUDI-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4032: -- Summary: Remove double file-listing in Hudi Relations (was: Remove double file-listing in BaseFileOnlyRelation) > Remove double file-listing in Hudi Relations > > > Key: HUDI-4032 > URL: https://issues.apache.org/jira/browse/HUDI-4032 > Project: Apache Hudi > Issue Type: Task > Components: index >Reporter: Ethan Guo >Priority: Blocker > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] YannByron commented on pull request #6225: [HUDI-4487] support to create ro/rt table by spark sql
YannByron commented on PR #6225: URL: https://github.com/apache/hudi/pull/6225#issuecomment-1199575340 @xushiyan can you help to call a new CI process. i execute `run azure` twice, but it just returns the first status. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-4505] Returns instead of throws if lock file exists for FileSystemBasedLockProvider (#6242)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new e04b3188e2 [HUDI-4505] Returns instead of throws if lock file exists for FileSystemBasedLockProvider (#6242) e04b3188e2 is described below commit e04b3188e465eabed71ba19342cb92d10963 Author: Danny Chan AuthorDate: Fri Jul 29 23:32:19 2022 +0800 [HUDI-4505] Returns instead of throws if lock file exists for FileSystemBasedLockProvider (#6242) To avoid unnecessary exception throws --- .../transaction/lock/FileSystemBasedLockProvider.java | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java index 96a42e8409..4135ef9acd 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java @@ -54,8 +54,8 @@ public class FileSystemBasedLockProvider implements LockProvider, Serial private static final String LOCK_FILE_NAME = "lock"; private final int lockTimeoutMinutes; - private transient FileSystem fs; - private transient Path lockFile; + private final transient FileSystem fs; + private final transient Path lockFile; protected LockConfiguration lockConfiguration; public FileSystemBasedLockProvider(final LockConfiguration lockConfiguration, final Configuration configuration) { @@ -87,8 +87,13 @@ public class FileSystemBasedLockProvider implements LockProvider, Serial try { synchronized (LOCK_FILE_NAME) { // Check whether lock is already expired, if so try to delete lock file -if (fs.exists(this.lockFile) && checkIfExpired()) { - fs.delete(this.lockFile, true); +if (fs.exists(this.lockFile)) { + if (checkIfExpired()) { +fs.delete(this.lockFile, true); +LOG.warn("Delete expired lock file: " + this.lockFile); + } else { +return false; + } } acquireLock(); return fs.exists(this.lockFile); @@ -123,7 +128,7 @@ public class FileSystemBasedLockProvider implements LockProvider, Serial } try { long modificationTime = fs.getFileStatus(this.lockFile).getModificationTime(); - if (System.currentTimeMillis() - modificationTime > lockTimeoutMinutes * 60 * 1000) { + if (System.currentTimeMillis() - modificationTime > lockTimeoutMinutes * 60 * 1000L) { return true; } } catch (IOException | HoodieIOException e) {
[GitHub] [hudi] danny0405 merged pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…
danny0405 merged PR #6242: URL: https://github.com/apache/hudi/pull/6242 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6248: [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value
hudi-bot commented on PR #6248: URL: https://github.com/apache/hudi/pull/6248#issuecomment-1199503241 ## CI report: * e74c3a80c33bfc25cc01514efebc3a2c8ba75eb9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10460) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6247: [MINOR] Add license header
hudi-bot commented on PR #6247: URL: https://github.com/apache/hudi/pull/6247#issuecomment-1199344177 ## CI report: * 1c96edb81b1623c50975c8d3fd81241a81e40445 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10457) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6248: [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value
hudi-bot commented on PR #6248: URL: https://github.com/apache/hudi/pull/6248#issuecomment-1199314567 ## CI report: * e74c3a80c33bfc25cc01514efebc3a2c8ba75eb9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10460) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…
hudi-bot commented on PR #6242: URL: https://github.com/apache/hudi/pull/6242#issuecomment-1199314485 ## CI report: * 4887d5c40a6b62998d6f5e64e06e91a326129ff8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10450) * 0ba12547e49adc6b5c285a51b893242f4d1690f6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10459) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
hudi-bot commented on PR #5629: URL: https://github.com/apache/hudi/pull/5629#issuecomment-1199313520 ## CI report: * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN * 652a0d666fe29487d3ce2c2ce1cef70dc443dd61 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10302) * 279857485f18875cab94f72b5bf61522bdaecd31 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10458) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6248: [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value
hudi-bot commented on PR #6248: URL: https://github.com/apache/hudi/pull/6248#issuecomment-1199309745 ## CI report: * e74c3a80c33bfc25cc01514efebc3a2c8ba75eb9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…
hudi-bot commented on PR #6242: URL: https://github.com/apache/hudi/pull/6242#issuecomment-1199309652 ## CI report: * 4887d5c40a6b62998d6f5e64e06e91a326129ff8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10450) * 0ba12547e49adc6b5c285a51b893242f4d1690f6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan opened a new pull request, #6248: [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value
nsivabalan opened a new pull request, #6248: URL: https://github.com/apache/hudi/pull/6248 ## What is the purpose of the pull request From 0.12, we are standardizing the default partition value for hudi as "__HIVE_DEFAULT_PARTITION__". Previously, hudi was using "default" as the default value (i.e. if partition column is null, this fallback value will be used). The fix was put up so that query engines will not run into any class cast exception if original partition path fields are non string types. But after this fix, we might need to migrate older hudi tables. ie. if "default" partition exists, we have to rewrite it to "__HIVE_DEFAULT_PARTITION__". This patch is adding an upgrade step, where we detect such hudi tables and fail the upgrade. And added instructions on what needs to be done before upgrading. ## Brief change log - Added FourToFiveUpgradeHandler to detect hudi tables with "default" partition and throwing exception. ## Verify this pull request This change added tests and can be verified as follows: - TestUpgradeDowngrade#testUpgradeFourtoFive - TestUpgradeDowngrade#testUpgradeFourtoFiveWithDefaultPartition ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on a diff in pull request #4718: [HUDI-3345][RFC-36] Proposal for hudi metastore server.
pratyakshsharma commented on code in PR #4718: URL: https://github.com/apache/hudi/pull/4718#discussion_r933224773 ## rfc/rfc-36/rfc-36.md: ## @@ -0,0 +1,605 @@ + +# RFC-36: Hudi Metastore Server + +## Proposers + +- @minihippo + +## Approvers + + +## Status + +JIRA: [HUDI-3345](https://issues.apache.org/jira/browse/HUDI-3345) + +> Please keep the status updated in `rfc/README.md`. + +# Hudi Metastore Server + +## Abstract + +Currently, Hudi is widely used as a table format in the data warehouse. There is a lack of central metastore server to manage the metadata of data lake table. Hive metastore as a commonly used catalog service in the data warehouse on Hadoop cannot store the unique metadata like timeline of the hudi table. + +The proposal is to implement an unified metadata management system called hudi metastore server to store the metadata of the hudi table, and be compatible with hive metastore so that other engines can access it without any changes. + +## Backgroud + +**How Hudi metadata is stored** + +The metadata of hudi are table location, configuration and schema, timeline generated by instants, metadata of each commit / instant, which records files created / updated, new records num and so on in this commit. Besides, the information of files in a hudi table is also a part of hudi metadata. Review Comment: This discussion brings me to a high level question. Today column stats are already stored at a file level in metadata table. So do we intend to completely replace metadata table with this new metastore server? Or do we intend to use metastore server only to store table level stats similar to how hive metastore does that? Another possibility I can think of is just exposing endpoints via metastore service to interact with different partitions of metadata table as Vinoth pointed out in another comment. @minihippo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on a diff in pull request #4718: [HUDI-3345][RFC-36] Proposal for hudi metastore server.
pratyakshsharma commented on code in PR #4718: URL: https://github.com/apache/hudi/pull/4718#discussion_r933224773 ## rfc/rfc-36/rfc-36.md: ## @@ -0,0 +1,605 @@ + +# RFC-36: Hudi Metastore Server + +## Proposers + +- @minihippo + +## Approvers + + +## Status + +JIRA: [HUDI-3345](https://issues.apache.org/jira/browse/HUDI-3345) + +> Please keep the status updated in `rfc/README.md`. + +# Hudi Metastore Server + +## Abstract + +Currently, Hudi is widely used as a table format in the data warehouse. There is a lack of central metastore server to manage the metadata of data lake table. Hive metastore as a commonly used catalog service in the data warehouse on Hadoop cannot store the unique metadata like timeline of the hudi table. + +The proposal is to implement an unified metadata management system called hudi metastore server to store the metadata of the hudi table, and be compatible with hive metastore so that other engines can access it without any changes. + +## Backgroud + +**How Hudi metadata is stored** + +The metadata of hudi are table location, configuration and schema, timeline generated by instants, metadata of each commit / instant, which records files created / updated, new records num and so on in this commit. Besides, the information of files in a hudi table is also a part of hudi metadata. Review Comment: This discussion brings me to a high level question. Today column stats are already stored at a file level in metadata table. So do we intend to completely replace metadata table with this new metastore server? Or do we intend to use metastore server only to store table level stats similar to how hive metastore does that? @minihippo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6247: [MINOR] Add license header
hudi-bot commented on PR #6247: URL: https://github.com/apache/hudi/pull/6247#issuecomment-1199242806 ## CI report: * 1c96edb81b1623c50975c8d3fd81241a81e40445 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10457) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on a diff in pull request #4718: [HUDI-3345][RFC-36] Proposal for hudi metastore server.
pratyakshsharma commented on code in PR #4718: URL: https://github.com/apache/hudi/pull/4718#discussion_r933216702 ## rfc/rfc-36/rfc-36.md: ## @@ -0,0 +1,605 @@ + +# RFC-36: Hudi Metastore Server + +## Proposers + +- @minihippo + +## Approvers + + +## Status + +JIRA: [HUDI-3345](https://issues.apache.org/jira/browse/HUDI-3345) + +> Please keep the status updated in `rfc/README.md`. + +# Hudi Metastore Server + +## Abstract + +Currently, Hudi is widely used as a table format in the data warehouse. There is a lack of central metastore server to manage the metadata of data lake table. Hive metastore as a commonly used catalog service in the data warehouse on Hadoop cannot store the unique metadata like timeline of the hudi table. + +The proposal is to implement an unified metadata management system called hudi metastore server to store the metadata of the hudi table, and be compatible with hive metastore so that other engines can access it without any changes. + +## Backgroud + +**How Hudi metadata is stored** + +The metadata of hudi are table location, configuration and schema, timeline generated by instants, metadata of each commit / instant, which records files created / updated, new records num and so on in this commit. Besides, the information of files in a hudi table is also a part of hudi metadata. + +Different from instant or schema recorded by a separate file that is stored under `${tablelocation}/.hoodie` on the HDFS or object storage, files info are managed by the HDFS directly. Hudi gets all files of a table by file listing. File listing is a costly operation and its performance is limited by namenode. In addition, there will be a few invalid files on the file system, which are created by spark speculative tasks(for example) and are not deleted successfully. Getting files by listing will result in inconsistency, so hudi has to store the valid files from each commit metadata, the metadata about files is usually referred to snapshot. + +RFC-15 metadata table is a proposal that can solve these problems. However, it only manages the metadata of one table. There is a lack of a unified view. + +**The integration of Hive metastore and Hudi metadata lacks a single source of truth.** + +Hive metastore server is widely used as a metadata center in the data warehouse on Hadoop. It stores the metadata for hive tables like their schema, location and partitions. Currently, almost all of the storage or computing engines support registering table information to it, discovering and retrieving metadata from it. Meanwhile, cloud service providers like AWS Glue, HUAWEI Cloud, Google Cloud Dataproc, Alibaba Cloud, ByteDance Volcano Engine all provide Apache Hive metastore compatible catalog. It seems that hive metastore has become a standard in the data warehouse. + +Different from the traditional table format like hive table, the data lake table not only has schema, partitions and other hive metadata, but also has timeline, snapshot which is unconventional. Hence, the metadata of data lake cannot be managed by HMS directly. + +Hudi just syncs the schema and partitions to HMS by now, and other metadata still stores on HDFS or object store. Metadata synchronization between different metadata management systems will result in inconsistency. + +## Overview + +![architecture](architecture.png) + +The hudi metastore server is for metadata management of the data lake table, to support metadata persistency, efficient metadata access and other extensions for data lake. The metadata server managed includes the information of databases and tables, partitions, schemas, instants, instants' meta and files' meta. + +The metastore server has two main components: service and storage. The storage is for metadata persistency and the service is to receive the get / put requests from client and return / store the processing result after doing some logical operations on metadata. + +The hudi metastore server is / has + +- **A metastore server for data lake** +- Different from the traditional table format, the metadata of the data lake has timeline and snapshot concepts, in addition to schema and partitions. + +- The metastore server is an unified metadata management system for data lake table. + +- **Pluggable storage** +- The storage is only responsible for metadata presistency. Therefore, it's doesn't matter what the storage engine is used to store the data, it can be a RDBMS, kv system or file system. + +- **Easy to be expanded** +- The service is stateless, so it can be scaled horizontally to support higher QPS. The storage can be split vertically to store more data. + +- **Compatible with multiple computing engines** +- The server has an adapter to be compatible with hive metastore server. + +## Design + +This part has four sections: what the service does, what and how the metadata stores, how the service interacts with the storage when reading
[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback
wzx140 commented on code in PR #6132: URL: https://github.com/apache/hudi/pull/6132#discussion_r933207193 ## rfc/rfc-46/rfc-46.md: ## @@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio Record Merge API -Stateless component interface providing for API Combining Records will look like following: +CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following: ```java -interface HoodieMerge { - HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer); - - Option combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException; +interface HoodieRecordMerger { + // combineAndGetUpdateValue and precombine + Option merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException; + + // The record type handled by the current merger + // SPARK, AVRO, FLINK + HoodieRecordType getRecordType(); } - /** -* Spark-specific implementation -*/ - class HoodieSparkRecordMerge implements HoodieMerge { +/** + * Spark-specific implementation + */ +class HoodieSparkRecordMerger implements HoodieRecordMerger { - @Override - public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) { -// HoodieSparkRecords preCombine - } + @Override + Option merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException { + // HoodieSparkRecord precombine and combineAndGetUpdateValue + } - @Override - public Option combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) { - // HoodieSparkRecord combineAndGetUpdateValue - } + @Override + HoodieRecordType getRecordType() { + return HoodieRecordType.SPARK; } +} - /** -* Flink-specific implementation -*/ - class HoodieFlinkRecordMerge implements HoodieMerge { - - @Override - public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) { -// HoodieFlinkRecord preCombine - } +/** + * Flink-specific implementation + */ +class HoodieFlinkRecordMerger implements HoodieRecordMerger { + + @Override + Option merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException { + // HoodieFlinkRecord precombine and combineAndGetUpdateValue + } - @Override - public Option combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) { - // HoodieFlinkRecord combineAndGetUpdateValue - } + @Override + HoodieRecordType getRecordType() { + return HoodieRecordType.FLINK; } +} ``` Where user can provide their own subclass implementing such interface for the engines of interest. - Migration from `HoodieRecordPayload` to `HoodieMerge` + Migration from `HoodieRecordPayload` to `HoodieRecordMerger` To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently -already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieMerge`, that will +already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordMerger` called `HoodieAvroRecordMerger`, that will be using user-defined subclass of `HoodieRecordPayload` to combine the records. Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into -new `HoodieMerge` implementation. +new `HoodieRecordMerger` implementation. + +Precombine is used to merge records from logs or incoming records; CombineAndGetUpdateValue is used to merge record from log file and record from base file. +these two merge logics are not exactly the same for some RecordPayload, such as OverwriteWithLatestAvroPaload. +We add an Enum in HoodieRecord to mark where it comes from(BASE, LOG or WRITE). `HoodieAvroRecordMerger`'s API will look like following: Review Comment: I think you're right. I've removed the mark(BASE, LOG or WRITE) in HoodieRecord and unified logic of HoodieSparkRecord. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6247: [MINOR] Add license header
hudi-bot commented on PR #6247: URL: https://github.com/apache/hudi/pull/6247#issuecomment-1199234091 ## CI report: * 1c96edb81b1623c50975c8d3fd81241a81e40445 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…
hudi-bot commented on PR #6242: URL: https://github.com/apache/hudi/pull/6242#issuecomment-1199234025 ## CI report: * 4887d5c40a6b62998d6f5e64e06e91a326129ff8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10450) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
hudi-bot commented on PR #5629: URL: https://github.com/apache/hudi/pull/5629#issuecomment-1199233077 ## CI report: * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN * 652a0d666fe29487d3ce2c2ce1cef70dc443dd61 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10302) * 279857485f18875cab94f72b5bf61522bdaecd31 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6246: Be able to disable precombine field when table schema contains a field named ts
hudi-bot commented on PR #6246: URL: https://github.com/apache/hudi/pull/6246#issuecomment-1199228989 ## CI report: * 0c6cfaaeb0512d426753e989b6fcc72c5d79293b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10455) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6245: [HUDI-4506] make BucketIndexPartitioner distribute data more balance
hudi-bot commented on PR #6245: URL: https://github.com/apache/hudi/pull/6245#issuecomment-1199228946 ## CI report: * 003df191ea86c299144f8a577ba817bb52ecd593 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10454) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…
hudi-bot commented on PR #6242: URL: https://github.com/apache/hudi/pull/6242#issuecomment-1199228878 ## CI report: * 4887d5c40a6b62998d6f5e64e06e91a326129ff8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10450) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] haripriyarhp commented on issue #6166: [SUPPORT] Missing records when using Kafka Hudi sink to write to S3.
haripriyarhp commented on issue #6166: URL: https://github.com/apache/hudi/issues/6166#issuecomment-1199226564 @rmahindra123 : Unfortunately, I am not able to share the .hoodie folder. Just to add, yesterday I tried it out again. I sent messages to a topic in batches. Below are the steps I followed 1. Sent a batch of 100 records to kafka. Ran compaction. No.of messages in kafka and no.of records in Athena, matched. 2. Sent a batch of another 100 records to Kafka -> Compaction -> no.of msgs in kafka = no.of records in Athena. 3. Sent a batch of another 100 records (here there were some duplicates ) -> Compaction -> no.of.msgs in Kafka = no. of records in Athena. 4. Sent another batch 98 records (some were duplicates) -> compaction -> no.of messages != no.of records in Athena. There were no more files to be compacted. About 24 records were missing. 5. Sent another 100 records. -> compaction -> record count did not match. there was same 24 missing. More or less, I followed the above steps several times before I raised the issue here. Each time, after few runs the record count does not match even after running compaction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope opened a new pull request, #6247: [MINOR] Add license header
codope opened a new pull request, #6247: URL: https://github.com/apache/hudi/pull/6247 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…
danny0405 commented on PR #6242: URL: https://github.com/apache/hudi/pull/6242#issuecomment-1199208040 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-4504) Disable metadata table by default for flink
[ https://issues.apache.org/jira/browse/HUDI-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen resolved HUDI-4504. -- > Disable metadata table by default for flink > --- > > Key: HUDI-4504 > URL: https://issues.apache.org/jira/browse/HUDI-4504 > Project: Apache Hudi > Issue Type: Task > Components: flink >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-4504) Disable metadata table by default for flink
[ https://issues.apache.org/jira/browse/HUDI-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572944#comment-17572944 ] Danny Chen commented on HUDI-4504: -- Fixed via master branch: a1cf401350ee7f8a66b4e927bce22b45a11260fc > Disable metadata table by default for flink > --- > > Key: HUDI-4504 > URL: https://issues.apache.org/jira/browse/HUDI-4504 > Project: Apache Hudi > Issue Type: Task > Components: flink >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-4504] Disable metadata table by default for flink (#6241)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new a1cf401350 [HUDI-4504] Disable metadata table by default for flink (#6241) a1cf401350 is described below commit a1cf401350ee7f8a66b4e927bce22b45a11260fc Author: Danny Chan AuthorDate: Fri Jul 29 20:06:24 2022 +0800 [HUDI-4504] Disable metadata table by default for flink (#6241) --- .../src/main/java/org/apache/hudi/configuration/FlinkOptions.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java index 0984296ee5..933c112312 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java @@ -104,8 +104,8 @@ public class FlinkOptions extends HoodieConfig { public static final ConfigOption METADATA_ENABLED = ConfigOptions .key("metadata.enabled") .booleanType() - .defaultValue(true) - .withDescription("Enable the internal metadata table which serves table metadata like level file listings, default enabled"); + .defaultValue(false) + .withDescription("Enable the internal metadata table which serves table metadata like level file listings, default disabled"); public static final ConfigOption METADATA_COMPACTION_DELTA_COMMITS = ConfigOptions .key("metadata.compaction.delta_commits")
[GitHub] [hudi] danny0405 merged pull request #6241: [HUDI-4504] Disable metadata table by default for flink
danny0405 merged PR #6241: URL: https://github.com/apache/hudi/pull/6241 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #6241: [HUDI-4504] Disable metadata table by default for flink
danny0405 commented on PR #6241: URL: https://github.com/apache/hudi/pull/6241#issuecomment-1199200078 https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10449=results The CI is actually green and i would merge it then ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
svn commit: r56033 - in /dev/hudi/hudi-0.12.0-rc1: ./ hudi-0.12.0-rc1.src.tgz hudi-0.12.0-rc1.src.tgz.asc hudi-0.12.0-rc1.src.tgz.sha512
Author: codope Date: Fri Jul 29 11:50:52 2022 New Revision: 56033 Log: Add source distribution for 0.12.0-rc1 Added: dev/hudi/hudi-0.12.0-rc1/ dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz (with props) dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.asc dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.sha512 Added: dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz == Binary file - no diff available. Propchange: dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz -- svn:mime-type = application/octet-stream Added: dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.asc == --- dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.asc (added) +++ dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.asc Fri Jul 29 11:50:52 2022 @@ -0,0 +1,14 @@ +-BEGIN PGP SIGNATURE- + +iQGzBAABCAAdFiEE/SFTQuMZlBmt+/Qd1GI+OqFtdbAFAmLjyHUACgkQ1GI+OqFt +dbDUCAv+LzOubgFQQ3eDQtXZid+jPHbH1yLxLh9gLDkPRPE3eaUE9tMpl83d8zKU +eY4kmD4Byax4FzQnFbcSSdWniXXh2cj5GVLYGO3EQirQ+evkY+ZSIP5JK2mrlJ9B +ZlbPkC3S4egsxZVKE+ytMz4vvCvVgO3y19VfAmMvWyDq3st3aNDjmF4962RJUXoK +oCr/6/6A56/q94qniLJR4XOAK49VZdsuuumBi8ldoSU5KraNtuCs8MLd13EyxcW4 +gYGtLp1qmvt21NT9YG5NI4XKIT9+/LAoX9P7q9DSkib7iyFn3wnZsvwiVhIsh94A +UxNni1mjRGIerJkD3ZpHQZWdUsgpaqnQ9qROwIunsUMzr4stYOhEMzcq27Orl+uX +rasgdjCGD6MV/TWbGpmU2qrd1CO976BkCJC0o5+2rSrmCl68atlNGT6XnCzg4Mkg +zuGeSTNAyPNbDhhQHGHZyH9PV5HbMGq3Q4Vk4dk7Ke04AzT3Ru5NQ9kA8dlR4dYq +qJ89SlsH +=SXT4 +-END PGP SIGNATURE- Added: dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.sha512 == --- dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.sha512 (added) +++ dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.sha512 Fri Jul 29 11:50:52 2022 @@ -0,0 +1 @@ +ee289f7b0c26211e8b8d9d4f645c9ce01d8f4e75c71d998141626df2fd4adf8cef957a2959feaa06505a815ad582a039d7a67ccfff3b71a6c61733918c520486 hudi-0.12.0-rc1.src.tgz
[hudi] annotated tag release-0.12.0-rc1 updated (3383b2388b -> 170eb40a62)
This is an automated email from the ASF dual-hosted git repository. codope pushed a change to annotated tag release-0.12.0-rc1 in repository https://gitbox.apache.org/repos/asf/hudi.git *** WARNING: tag release-0.12.0-rc1 was modified! *** from 3383b2388b (commit) to 170eb40a62 (tag) tagging 3383b2388bd0b107646edf38e98bdb5ee88281bc (commit) replaces hoodie-0.4.7 by Sagar Sumit on Fri Jul 29 17:17:54 2022 +0530 - Log - 0.12.0 -BEGIN PGP SIGNATURE- iI0EABYKADUWIQQ7EyGPQog2tHUAvYP7nRNrCtiI9gUCYuPI6hccc2FnYXJzdW1p dDA5QGdtYWlsLmNvbQAKCRD7nRNrCtiI9pBQAP9uEKBKPDKzcNtC6qtGnNT08q0t eMT5jCwNffY9+ztiSQEA45MDutsbEhWq+MVbE3tF9W+KzEj/Um7t4vaJ9tV49A8= =4zJ1 -END PGP SIGNATURE- --- No new revisions were added by this update. Summary of changes:
[hudi] annotated tag release-0.12.0-rc1 updated (3383b2388b -> 26d7199091)
This is an automated email from the ASF dual-hosted git repository. codope pushed a change to annotated tag release-0.12.0-rc1 in repository https://gitbox.apache.org/repos/asf/hudi.git *** WARNING: tag release-0.12.0-rc1 was modified! *** from 3383b2388b (commit) to 26d7199091 (tag) tagging 3383b2388bd0b107646edf38e98bdb5ee88281bc (commit) replaces hoodie-0.4.7 by Sagar Sumit on Fri Jul 29 17:06:56 2022 +0530 - Log - 0.12.0 -BEGIN PGP SIGNATURE- iI0EABYKADUWIQQ7EyGPQog2tHUAvYP7nRNrCtiI9gUCYuPGWBccc2FnYXJzdW1p dDA5QGdtYWlsLmNvbQAKCRD7nRNrCtiI9tXnAP9gc/O/gDz0YztHMPtZ1wQDBZDG HrbYlapQE/AFa39ZwQD/ZzEZbrKd70JkUD+QEKW0Dt/NGDzyClTZ4Q9Ridr/dwA= =ipMK -END PGP SIGNATURE- --- No new revisions were added by this update. Summary of changes:
[hudi] tag release-0.12.0-rc1 created (now 3383b2388b)
This is an automated email from the ASF dual-hosted git repository. codope pushed a change to tag release-0.12.0-rc1 in repository https://gitbox.apache.org/repos/asf/hudi.git at 3383b2388b (commit) No new revisions were added by this update.
[GitHub] [hudi] hudi-bot commented on pull request #6246: Be able to disable precombine field when table schema contains a field named ts
hudi-bot commented on PR #6246: URL: https://github.com/apache/hudi/pull/6246#issuecomment-1199131031 ## CI report: * 0c6cfaaeb0512d426753e989b6fcc72c5d79293b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10455) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org