[GitHub] [hudi] hudi-bot commented on pull request #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation
hudi-bot commented on PR #6677: URL: https://github.com/apache/hudi/pull/6677#issuecomment-1247664423 ## CI report: * 59e2196397ef68d75697a35b1b91e661ef9d3aa4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11372) * 0ce0aee73e1641f071abdfc44d4f5473a425befb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table
hudi-bot commented on PR #4676: URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247657488 ## CI report: * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN * e3914eb7b48fc4c5e3bd6f0fd00888ac6da8fa21 UNKNOWN * 24747bb7e1f23d6db70672cab3795cb131ce8dcb Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11371) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hackergin commented on a diff in pull request #6628: [HUDI-4806] Use Avro version from the root pom for Flink bundle
hackergin commented on code in PR #6628: URL: https://github.com/apache/hudi/pull/6628#discussion_r971586922 ## packaging/hudi-flink-bundle/pom.xml: ## @@ -501,8 +501,7 @@ org.apache.avro avro - - 1.10.0 + ${avro.version} Review Comment: > Yes, we've tested with Flink streamer loading data from Kafka datasource in Hudi format. And it works fine Hi, @CTTY I met a java.lang.ClassNotFoundException when using the latest master code. Class org.apache.avro.LogicalTypes$LocalTimestampMillis seems to only appear in avro 1.10 version. Please help to confirm this problem, correct me if I am wrong . ``` Caused by: java.lang.ClassNotFoundException: org.apache.hudi.org.apache.avro.LogicalTypes$LocalTimestampMillis at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ~[?:1.8.0_202] at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_202] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) ~[?:1.8.0_202] at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_202] at org.apache.hudi.table.HoodieTableFactory.inferAvroSchema(HoodieTableFactory.java:346) ~[hudi-flink1.14-bundle-0.13.0-SNAPSHOT.jar:0.13.0-SNAPSHOT] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hackergin commented on a diff in pull request #6628: [HUDI-4806] Use Avro version from the root pom for Flink bundle
hackergin commented on code in PR #6628: URL: https://github.com/apache/hudi/pull/6628#discussion_r971586922 ## packaging/hudi-flink-bundle/pom.xml: ## @@ -501,8 +501,7 @@ org.apache.avro avro - - 1.10.0 + ${avro.version} Review Comment: > Yes, we've tested with Flink streamer loading data from Kafka datasource in Hudi format. And it works fine Hi, @CTTY I met a java.lang.ClassNotFoundException when using the latest master code. Class org.apache.avro.LogicalTypes$LocalTimestampMillis seems to only appear in avro 1.10 version. Please help to confirm this problem, correct me if I am wrong . ``` Caused by: java.lang.ClassNotFoundException: org.apache.hudi.org.apache.avro.LogicalTypes$LocalTimestampMillis at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ~[?:1.8.0_202] at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_202] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) ~[?:1.8.0_202] at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_202] at org.apache.hudi.table.HoodieTableFactory.inferAvroSchema(HoodieTableFactory.java:346) ~[hudi-flink1.14-bundle-0.13.0-SNAPSHOT.jar:0.13.0-SNAPSHOT] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] fengjian428 commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table
fengjian428 commented on code in PR #4676: URL: https://github.com/apache/hudi/pull/4676#discussion_r971585340 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieWriteHelper.java: ## @@ -51,16 +55,20 @@ protected HoodieData> tag(HoodieData> dedupedRec @Override public HoodieData> deduplicateRecords( - HoodieData> records, HoodieIndex index, int parallelism) { + HoodieData> records, HoodieIndex index, int parallelism, String jsonSchema) { boolean isIndexingGlobal = index.isGlobal(); +final Schema[] schema = {null}; return records.mapToPair(record -> { HoodieKey hoodieKey = record.getKey(); // If index used is global, then records are expected to differ in their partitionPath Object key = isIndexingGlobal ? hoodieKey.getRecordKey() : hoodieKey; return Pair.of(key, record); }).reduceByKey((rec1, rec2) -> { + if (schema[0] == null) { +schema[0] = new Schema.Parser().parse(jsonSchema); Review Comment: How about passing SerializableSchema to every executor, seems it is not easy to get spark context here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boundarymate commented on a diff in pull request #3951: [HUDI-2715] The BitCaskDiskMap iterator may cause memory leak
boundarymate commented on code in PR #3951: URL: https://github.com/apache/hudi/pull/3951#discussion_r971492356 ## hudi-common/src/main/java/org/apache/hudi/common/util/collection/BitCaskDiskMap.java: ## @@ -275,6 +282,7 @@ public void close() { } } writeOnlyFile.delete(); + this.iterators.forEach(ClosableIterator::close); Review Comment: Hi danny Why not close in finally? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter
zhangyue19921010 commented on code in PR #6600: URL: https://github.com/apache/hudi/pull/6600#discussion_r971576211 ## rfc/rfc-62/rfc-62.md: ## @@ -0,0 +1,443 @@ + +# RFC-62: Diagnostic Reporter + + + +## Proposers + +- zhangyue19921...@163.com + +## Approvers + - @codope + - @xushiyan + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-4707 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +With the development of hudi, more and more users choose hudi to build their own ingestion pipelines to support real-time or batch upsert requirements. +Subsequently, some of them may ask the community for help, such as how to improve the performance of their hudi ingestion jobs? Why did their hudi jobs fail? etc. + +For the volunteers in the hudi community, dealing with such issue, the volunteers may ask users to provide a list of information, including engine context, job configs, +data pattern, Spark UI, etc. Users need to spend extra effort to review their own jobs, collect metrics one buy one according to the list and give feedback to volunteers. +By the way, unexpected errors may occur at this time as users are manually collecting these information. + +Obviously, there are relatively high communication costs for both volunteers and users. + +On the other hand, for advanced users, they also need some way to efficiently understand the characteristics of their hudi tables, including data volume, upsert pattern, and so on. + +## Background +As we know, hudi already has its own unique metrics system and metadata framework. These information are very important for hudi job tuning or troubleshooting. Fo example: + +1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job. + +2. Hudi metadata table which will records all the partitions, all the data files, etc + +3. Each commit of hudi records various metadata information and runtime metrics currently written, such as: +```json +{ +"partitionToWriteStats":{ +"20210623/0/20210825":[ +{ +"fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0", + "path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet", +"prevCommit":"null", +"numWrites":123352, +"numDeletes":0, +"numUpdateWrites":0, +"numInserts":123352, +"totalWriteBytes":4675371, +"totalWriteErrors":0, +"tempPath":null, +"partitionPath":"20210623/0/20210825", +"totalLogRecords":0, +"totalLogFilesCompacted":0, +"totalLogSizeCompacted":0, +"totalUpdatedRecordsCompacted":0, +"totalLogBlocks":0, +"totalCorruptLogBlock":0, +"totalRollbackBlocks":0, +"fileSizeInBytes":4675371, +"minEventTime":null, +"maxEventTime":null +} +] +}, +"compacted":false, +"extraMetadata":{ +"schema":"" +}, +"operationType":"UPSERT", +"totalRecordsDeleted":0, +"totalLogFilesSize":0, +"totalScanTime":0, +"totalCreateTime":21051, +"totalUpsertTime":0, +"minAndMaxEventTime":{ +"Optional.empty":{ +"val":null, +"present":false +} +}, +"writePartitionPaths":[ +"20210623/0/20210825" +], +"fileIdAndRelativePaths":{ + "c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet" +}, +"totalLogRecordsCompacted":0, +"totalLogFilesCompacted":0, +"totalCompactedRecordsUpdated":0 +} +``` + +In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool. +This tool can be turned on as the final stage in ingestion job after commit which will collect common troubleshooting information including engine(take spark as example here) runtime information and generate a diagnostic report json file. + +Or users can trigger this diagnostic reporter tool using hudi-cli to generate this report json file. + +## Implementation + +This Diagnostic Reporter Tool will go through whole hudi table and generate a report json file which contains all the necessary information. Also this tool will package .hoodie folder as a zip compressed file. + +Users can use this Diagnostic Reporter Tool in the following two ways: +1. Users can directly enable this diagnostic reporter in the writing jobs, at this time diagnostic reporter tool will go through current hudi table and generate report files as the last stage after commit. +2. Users can directly generate the corresponding report file for a hudi table through the hudi cli command
[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter
zhangyue19921010 commented on code in PR #6600: URL: https://github.com/apache/hudi/pull/6600#discussion_r971574398 ## rfc/rfc-62/rfc-62.md: ## @@ -0,0 +1,443 @@ + +# RFC-62: Diagnostic Reporter + + + +## Proposers + +- zhangyue19921...@163.com + +## Approvers + - @codope + - @xushiyan + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-4707 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +With the development of hudi, more and more users choose hudi to build their own ingestion pipelines to support real-time or batch upsert requirements. +Subsequently, some of them may ask the community for help, such as how to improve the performance of their hudi ingestion jobs? Why did their hudi jobs fail? etc. + +For the volunteers in the hudi community, dealing with such issue, the volunteers may ask users to provide a list of information, including engine context, job configs, +data pattern, Spark UI, etc. Users need to spend extra effort to review their own jobs, collect metrics one buy one according to the list and give feedback to volunteers. +By the way, unexpected errors may occur at this time as users are manually collecting these information. + +Obviously, there are relatively high communication costs for both volunteers and users. + +On the other hand, for advanced users, they also need some way to efficiently understand the characteristics of their hudi tables, including data volume, upsert pattern, and so on. + +## Background +As we know, hudi already has its own unique metrics system and metadata framework. These information are very important for hudi job tuning or troubleshooting. Fo example: + +1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job. + +2. Hudi metadata table which will records all the partitions, all the data files, etc + +3. Each commit of hudi records various metadata information and runtime metrics currently written, such as: +```json +{ +"partitionToWriteStats":{ +"20210623/0/20210825":[ +{ +"fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0", + "path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet", +"prevCommit":"null", +"numWrites":123352, +"numDeletes":0, +"numUpdateWrites":0, +"numInserts":123352, +"totalWriteBytes":4675371, +"totalWriteErrors":0, +"tempPath":null, +"partitionPath":"20210623/0/20210825", +"totalLogRecords":0, +"totalLogFilesCompacted":0, +"totalLogSizeCompacted":0, +"totalUpdatedRecordsCompacted":0, +"totalLogBlocks":0, +"totalCorruptLogBlock":0, +"totalRollbackBlocks":0, +"fileSizeInBytes":4675371, +"minEventTime":null, +"maxEventTime":null +} +] +}, +"compacted":false, +"extraMetadata":{ +"schema":"" +}, +"operationType":"UPSERT", +"totalRecordsDeleted":0, +"totalLogFilesSize":0, +"totalScanTime":0, +"totalCreateTime":21051, +"totalUpsertTime":0, +"minAndMaxEventTime":{ +"Optional.empty":{ +"val":null, +"present":false +} +}, +"writePartitionPaths":[ +"20210623/0/20210825" +], +"fileIdAndRelativePaths":{ + "c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet" +}, +"totalLogRecordsCompacted":0, +"totalLogFilesCompacted":0, +"totalCompactedRecordsUpdated":0 +} +``` + +In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool. +This tool can be turned on as the final stage in ingestion job after commit which will collect common troubleshooting information including engine(take spark as example here) runtime information and generate a diagnostic report json file. + +Or users can trigger this diagnostic reporter tool using hudi-cli to generate this report json file. + +## Implementation + +This Diagnostic Reporter Tool will go through whole hudi table and generate a report json file which contains all the necessary information. Also this tool will package .hoodie folder as a zip compressed file. + +Users can use this Diagnostic Reporter Tool in the following two ways: +1. Users can directly enable this diagnostic reporter in the writing jobs, at this time diagnostic reporter tool will go through current hudi table and generate report files as the last stage after commit. +2. Users can directly generate the corresponding report file for a hudi table through the hudi cli command
[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter
zhangyue19921010 commented on code in PR #6600: URL: https://github.com/apache/hudi/pull/6600#discussion_r971551331 ## rfc/rfc-62/rfc-62.md: ## @@ -0,0 +1,443 @@ + +# RFC-62: Diagnostic Reporter + + + +## Proposers + +- zhangyue19921...@163.com + +## Approvers + - @codope + - @xushiyan + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-4707 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +With the development of hudi, more and more users choose hudi to build their own ingestion pipelines to support real-time or batch upsert requirements. +Subsequently, some of them may ask the community for help, such as how to improve the performance of their hudi ingestion jobs? Why did their hudi jobs fail? etc. + +For the volunteers in the hudi community, dealing with such issue, the volunteers may ask users to provide a list of information, including engine context, job configs, +data pattern, Spark UI, etc. Users need to spend extra effort to review their own jobs, collect metrics one buy one according to the list and give feedback to volunteers. +By the way, unexpected errors may occur at this time as users are manually collecting these information. + +Obviously, there are relatively high communication costs for both volunteers and users. + +On the other hand, for advanced users, they also need some way to efficiently understand the characteristics of their hudi tables, including data volume, upsert pattern, and so on. + +## Background +As we know, hudi already has its own unique metrics system and metadata framework. These information are very important for hudi job tuning or troubleshooting. Fo example: + +1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job. + +2. Hudi metadata table which will records all the partitions, all the data files, etc + +3. Each commit of hudi records various metadata information and runtime metrics currently written, such as: +```json +{ +"partitionToWriteStats":{ +"20210623/0/20210825":[ +{ +"fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0", + "path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet", +"prevCommit":"null", +"numWrites":123352, +"numDeletes":0, +"numUpdateWrites":0, +"numInserts":123352, +"totalWriteBytes":4675371, +"totalWriteErrors":0, +"tempPath":null, +"partitionPath":"20210623/0/20210825", +"totalLogRecords":0, +"totalLogFilesCompacted":0, +"totalLogSizeCompacted":0, +"totalUpdatedRecordsCompacted":0, +"totalLogBlocks":0, +"totalCorruptLogBlock":0, +"totalRollbackBlocks":0, +"fileSizeInBytes":4675371, +"minEventTime":null, +"maxEventTime":null +} +] +}, +"compacted":false, +"extraMetadata":{ +"schema":"" +}, +"operationType":"UPSERT", +"totalRecordsDeleted":0, +"totalLogFilesSize":0, +"totalScanTime":0, +"totalCreateTime":21051, +"totalUpsertTime":0, +"minAndMaxEventTime":{ +"Optional.empty":{ +"val":null, +"present":false +} +}, +"writePartitionPaths":[ +"20210623/0/20210825" +], +"fileIdAndRelativePaths":{ + "c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet" +}, +"totalLogRecordsCompacted":0, +"totalLogFilesCompacted":0, +"totalCompactedRecordsUpdated":0 +} +``` + +In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool. +This tool can be turned on as the final stage in ingestion job after commit which will collect common troubleshooting information including engine(take spark as example here) runtime information and generate a diagnostic report json file. + +Or users can trigger this diagnostic reporter tool using hudi-cli to generate this report json file. + +## Implementation + +This Diagnostic Reporter Tool will go through whole hudi table and generate a report json file which contains all the necessary information. Also this tool will package .hoodie folder as a zip compressed file. + +Users can use this Diagnostic Reporter Tool in the following two ways: +1. Users can directly enable this diagnostic reporter in the writing jobs, at this time diagnostic reporter tool will go through current hudi table and generate report files as the last stage after commit. +2. Users can directly generate the corresponding report file for a hudi table through the hudi cli command
[GitHub] [hudi] hudi-bot commented on pull request #5933: [HUDI-4293] Implement Create/Drop/Show/Refresh Index Command for Secondary Index
hudi-bot commented on PR #5933: URL: https://github.com/apache/hudi/pull/5933#issuecomment-1247608019 ## CI report: * 65359879df848d75b6693f4c313dc9453d635edd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11370) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation
hudi-bot commented on PR #6677: URL: https://github.com/apache/hudi/pull/6677#issuecomment-1247605650 ## CI report: * 59e2196397ef68d75697a35b1b91e661ef9d3aa4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11372) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter
zhangyue19921010 commented on code in PR #6600: URL: https://github.com/apache/hudi/pull/6600#discussion_r971542395 ## rfc/rfc-62/rfc-62.md: ## @@ -0,0 +1,443 @@ + +# RFC-62: Diagnostic Reporter + + + +## Proposers + +- zhangyue19921...@163.com + +## Approvers + - @codope + - @xushiyan + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-4707 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +With the development of hudi, more and more users choose hudi to build their own ingestion pipelines to support real-time or batch upsert requirements. +Subsequently, some of them may ask the community for help, such as how to improve the performance of their hudi ingestion jobs? Why did their hudi jobs fail? etc. + +For the volunteers in the hudi community, dealing with such issue, the volunteers may ask users to provide a list of information, including engine context, job configs, +data pattern, Spark UI, etc. Users need to spend extra effort to review their own jobs, collect metrics one buy one according to the list and give feedback to volunteers. +By the way, unexpected errors may occur at this time as users are manually collecting these information. + +Obviously, there are relatively high communication costs for both volunteers and users. + +On the other hand, for advanced users, they also need some way to efficiently understand the characteristics of their hudi tables, including data volume, upsert pattern, and so on. + +## Background +As we know, hudi already has its own unique metrics system and metadata framework. These information are very important for hudi job tuning or troubleshooting. Fo example: + +1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job. + +2. Hudi metadata table which will records all the partitions, all the data files, etc + +3. Each commit of hudi records various metadata information and runtime metrics currently written, such as: +```json +{ +"partitionToWriteStats":{ +"20210623/0/20210825":[ +{ +"fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0", + "path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet", +"prevCommit":"null", +"numWrites":123352, +"numDeletes":0, +"numUpdateWrites":0, +"numInserts":123352, +"totalWriteBytes":4675371, +"totalWriteErrors":0, +"tempPath":null, +"partitionPath":"20210623/0/20210825", +"totalLogRecords":0, +"totalLogFilesCompacted":0, +"totalLogSizeCompacted":0, +"totalUpdatedRecordsCompacted":0, +"totalLogBlocks":0, +"totalCorruptLogBlock":0, +"totalRollbackBlocks":0, +"fileSizeInBytes":4675371, +"minEventTime":null, +"maxEventTime":null +} +] +}, +"compacted":false, +"extraMetadata":{ +"schema":"" +}, +"operationType":"UPSERT", +"totalRecordsDeleted":0, +"totalLogFilesSize":0, +"totalScanTime":0, +"totalCreateTime":21051, +"totalUpsertTime":0, +"minAndMaxEventTime":{ +"Optional.empty":{ +"val":null, +"present":false +} +}, +"writePartitionPaths":[ +"20210623/0/20210825" +], +"fileIdAndRelativePaths":{ + "c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet" +}, +"totalLogRecordsCompacted":0, +"totalLogFilesCompacted":0, +"totalCompactedRecordsUpdated":0 +} +``` + +In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool. +This tool can be turned on as the final stage in ingestion job after commit which will collect common troubleshooting information including engine(take spark as example here) runtime information and generate a diagnostic report json file. Review Comment: I think if users turn on this feature, hoodie is going to generate this report no matter current commit is successful or not. This report will be created under `.hoodie/report/instant-time/HudiTableName + "_" + HudiVersion + "_" + appName + "_" + applicationId + "_" + applicationAttemptId + "_" + isLocal + format` Of course, some contents in our report json is pretty time consuming like zip file or `Data information`, which could be false by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For quer
[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter
zhangyue19921010 commented on code in PR #6600: URL: https://github.com/apache/hudi/pull/6600#discussion_r971542395 ## rfc/rfc-62/rfc-62.md: ## @@ -0,0 +1,443 @@ + +# RFC-62: Diagnostic Reporter + + + +## Proposers + +- zhangyue19921...@163.com + +## Approvers + - @codope + - @xushiyan + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-4707 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +With the development of hudi, more and more users choose hudi to build their own ingestion pipelines to support real-time or batch upsert requirements. +Subsequently, some of them may ask the community for help, such as how to improve the performance of their hudi ingestion jobs? Why did their hudi jobs fail? etc. + +For the volunteers in the hudi community, dealing with such issue, the volunteers may ask users to provide a list of information, including engine context, job configs, +data pattern, Spark UI, etc. Users need to spend extra effort to review their own jobs, collect metrics one buy one according to the list and give feedback to volunteers. +By the way, unexpected errors may occur at this time as users are manually collecting these information. + +Obviously, there are relatively high communication costs for both volunteers and users. + +On the other hand, for advanced users, they also need some way to efficiently understand the characteristics of their hudi tables, including data volume, upsert pattern, and so on. + +## Background +As we know, hudi already has its own unique metrics system and metadata framework. These information are very important for hudi job tuning or troubleshooting. Fo example: + +1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job. + +2. Hudi metadata table which will records all the partitions, all the data files, etc + +3. Each commit of hudi records various metadata information and runtime metrics currently written, such as: +```json +{ +"partitionToWriteStats":{ +"20210623/0/20210825":[ +{ +"fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0", + "path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet", +"prevCommit":"null", +"numWrites":123352, +"numDeletes":0, +"numUpdateWrites":0, +"numInserts":123352, +"totalWriteBytes":4675371, +"totalWriteErrors":0, +"tempPath":null, +"partitionPath":"20210623/0/20210825", +"totalLogRecords":0, +"totalLogFilesCompacted":0, +"totalLogSizeCompacted":0, +"totalUpdatedRecordsCompacted":0, +"totalLogBlocks":0, +"totalCorruptLogBlock":0, +"totalRollbackBlocks":0, +"fileSizeInBytes":4675371, +"minEventTime":null, +"maxEventTime":null +} +] +}, +"compacted":false, +"extraMetadata":{ +"schema":"" +}, +"operationType":"UPSERT", +"totalRecordsDeleted":0, +"totalLogFilesSize":0, +"totalScanTime":0, +"totalCreateTime":21051, +"totalUpsertTime":0, +"minAndMaxEventTime":{ +"Optional.empty":{ +"val":null, +"present":false +} +}, +"writePartitionPaths":[ +"20210623/0/20210825" +], +"fileIdAndRelativePaths":{ + "c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet" +}, +"totalLogRecordsCompacted":0, +"totalLogFilesCompacted":0, +"totalCompactedRecordsUpdated":0 +} +``` + +In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool. +This tool can be turned on as the final stage in ingestion job after commit which will collect common troubleshooting information including engine(take spark as example here) runtime information and generate a diagnostic report json file. Review Comment: I think if users turn on this feature, hoodie is going to generate a report, no matter current commit is successful or not. This report will be created under `.hoodie/report/instant-time/HudiTableName + "_" + HudiVersion + "_" + appName + "_" + applicationId + "_" + applicationAttemptId + "_" + isLocal + format` Of course, some contents in our report json is pretty time consuming like zip file or `Data information`, which could be false by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For querie
[hudi] branch master updated: [HUDI-4837] Stop sleeping where it is not necessary after the success (#6270)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 35d03e9a1b [HUDI-4837] Stop sleeping where it is not necessary after the success (#6270) 35d03e9a1b is described below commit 35d03e9a1bede05d10f10c6e4b57ffe66ca7f330 Author: Volodymyr Burenin AuthorDate: Thu Sep 15 00:11:34 2022 -0500 [HUDI-4837] Stop sleeping where it is not necessary after the success (#6270) Co-authored-by: Volodymyr Burenin Co-authored-by: Y Ethan Guo --- .../org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java | 4 +++- .../java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java | 4 ++-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java index 1e78610ced..81f06a0f9f 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java @@ -315,7 +315,9 @@ public class KafkaOffsetGen { // TODO(HUDI-4625) cleanup, introduce retrying client partitionInfos = consumer.partitionsFor(topicName); try { -TimeUnit.SECONDS.sleep(10); +if (partitionInfos == null) { + TimeUnit.SECONDS.sleep(10); +} } catch (InterruptedException e) { LOG.error("Sleep failed while fetching partitions"); } diff --git a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java index 2b99f19b27..1147736143 100644 --- a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java +++ b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java @@ -249,7 +249,7 @@ abstract class BaseTestKafkaSource extends SparkClientFunctionalTestHarness { // create a topic with very short retention final String topic = TEST_TOPIC_PREFIX + "testFailOnDataLoss"; Properties topicConfig = new Properties(); -topicConfig.setProperty("retention.ms", "1"); +topicConfig.setProperty("retention.ms", "8000"); testUtils.createTopic(topic, 1, topicConfig); TypedProperties failOnDataLossProps = createPropsForKafkaSource(topic, null, "earliest"); @@ -261,7 +261,7 @@ abstract class BaseTestKafkaSource extends SparkClientFunctionalTestHarness { assertEquals(2, fetch1.getBatch().get().count()); // wait for the checkpoint to expire -Thread.sleep(10001); +Thread.sleep(3); Throwable t = assertThrows(HoodieDeltaStreamerException.class, () -> { kafkaSource.fetchNewDataInAvroFormat(Option.of(fetch1.getCheckpointForNextBatch()), Long.MAX_VALUE); });
[GitHub] [hudi] yihua merged pull request #6270: [HUDI-4837] Stop sleeping where it is not necessary after the success
yihua merged PR #6270: URL: https://github.com/apache/hudi/pull/6270 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #6270: [HUDI-4837] Stop sleeping where it is not necessary after the success
yihua commented on PR #6270: URL: https://github.com/apache/hudi/pull/6270#issuecomment-1247585982 CI is green. https://user-images.githubusercontent.com/2497195/190318997-d8438ceb-c0c9-457b-9fbf-19c26fb01e0a.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation
hudi-bot commented on PR #6677: URL: https://github.com/apache/hudi/pull/6677#issuecomment-1247572632 ## CI report: * 59e2196397ef68d75697a35b1b91e661ef9d3aa4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11372) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table
hudi-bot commented on PR #4676: URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247571390 ## CI report: * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368) * e3914eb7b48fc4c5e3bd6f0fd00888ac6da8fa21 UNKNOWN * 24747bb7e1f23d6db70672cab3795cb131ce8dcb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11371) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation
hudi-bot commented on PR #6677: URL: https://github.com/apache/hudi/pull/6677#issuecomment-1247569658 ## CI report: * 59e2196397ef68d75697a35b1b91e661ef9d3aa4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon
hudi-bot commented on PR #6668: URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247569608 ## CI report: * 2480ce4c97130601e2727ab82851c428ea7a84bf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11345) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table
hudi-bot commented on PR #4676: URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247568324 ## CI report: * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368) * e3914eb7b48fc4c5e3bd6f0fd00888ac6da8fa21 UNKNOWN * 24747bb7e1f23d6db70672cab3795cb131ce8dcb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (8c296e0356 -> 1f2e72e06e)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 8c296e0356 [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module (#6550) add 1f2e72e06e [HUDI-4752] Add dedup support for MOR table in cli (#6608) No new revisions were added by this update. Summary of changes: .../scala/org/apache/hudi/cli/DedupeSparkJob.scala | 4 +- .../hudi/cli/integ/ITTestRepairsCommand.java | 117 ++--- 2 files changed, 82 insertions(+), 39 deletions(-)
[GitHub] [hudi] yihua merged pull request #6608: [HUDI-4752] Add dedup support for MOR table in cli
yihua merged PR #6608: URL: https://github.com/apache/hudi/pull/6608 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon
hudi-bot commented on PR #6668: URL: https://github.com/apache/hudi/pull/6668#issuecomment-124750 ## CI report: * Unknown: [CANCELED](TBD) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5933: [HUDI-4293] Implement Create/Drop/Show/Refresh Index Command for Secondary Index
hudi-bot commented on PR #5933: URL: https://github.com/apache/hudi/pull/5933#issuecomment-1247565916 ## CI report: * 3d5de064b208083601499666d925df2ec151afd9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11353) * 65359879df848d75b6693f4c313dc9453d635edd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11370) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table
hudi-bot commented on PR #4676: URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247565279 ## CI report: * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368) * e3914eb7b48fc4c5e3bd6f0fd00888ac6da8fa21 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon
Zouxxyy commented on PR #6668: URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247552005 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon
Zouxxyy commented on PR #6668: URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247551903 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4294) Introduce build action to actually perform index data generation
[ https://issues.apache.org/jira/browse/HUDI-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shibei updated HUDI-4294: - Status: In Progress (was: Open) > Introduce build action to actually perform index data generation > > > Key: HUDI-4294 > URL: https://issues.apache.org/jira/browse/HUDI-4294 > Project: Apache Hudi > Issue Type: New Feature >Reporter: shibei >Assignee: shibei >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > In this issue, we introduce a new action type called build to actually > perform index data generation. This action contains two steps as clustering > action does: > # Generate action plan to clarify which files and which indexes need to be > built; > # Execute build index according action plan generated by step one; > > Call procedure will be implemented as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4294) Introduce build action to actually perform index data generation
[ https://issues.apache.org/jira/browse/HUDI-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4294: - Labels: pull-request-available (was: ) > Introduce build action to actually perform index data generation > > > Key: HUDI-4294 > URL: https://issues.apache.org/jira/browse/HUDI-4294 > Project: Apache Hudi > Issue Type: New Feature >Reporter: shibei >Assignee: shibei >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > In this issue, we introduce a new action type called build to actually > perform index data generation. This action contains two steps as clustering > action does: > # Generate action plan to clarify which files and which indexes need to be > built; > # Execute build index according action plan generated by step one; > > Call procedure will be implemented as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] huberylee opened a new pull request, #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation
huberylee opened a new pull request, #6677: URL: https://github.com/apache/hudi/pull/6677 ### Change Logs Introducing a new action type called build to actually perform index data generation. This action contains two steps as clustering action does: - Generate action plan to clarify which files and which indexes need to be built; - Execute build index according action plan generated by step one; Call procedure will be implemented as well to show or run build action. Classes in package ``org.apache.hudi.secondary.index.lucene.hadoop`` were copied from package ``org.apache.solr.hdfs.store`` in Apache Solr project. ### Impact User can use ``call show_build(table=> '$table'[, path => $path], limit => $limit, show_involved_partition => [true/false])`` to list build commits, use``call run_build(table => '$table'[, path => $path], predicate => '$predicate', show_involved_partition => [true/false])`` to trigger new build action if conditions are satisfied. **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhengyuan-cn commented on issue #6596: [SUPPORT] with Impala 4.0 Records lost
zhengyuan-cn commented on issue #6596: URL: https://github.com/apache/hudi/issues/6596#issuecomment-1247545755 > > I replaced impala hudi dependency jar (hudi-common-0.5.0-incubating.jar, hudi-hadoop-mr-0.5.0-incubating.jar) with (hudi-common-0.12.0.jar, hudi-hadoop-mr-0.12.0.jar),issues still. > > > ENV: impala4.0+hive3.1.1 with hudi 0.11 is correct. > > @zhengyuan-cn do you mean you replaced `hudi-*-0.5.0` with `hudi-*-0.11.0` and it worked? NO, in env( impala4.0+hive3.1.1 with hudi 0.11) it worked, and result is correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boundarymate commented on a diff in pull request #3951: [HUDI-2715] The BitCaskDiskMap iterator may cause memory leak
boundarymate commented on code in PR #3951: URL: https://github.com/apache/hudi/pull/3951#discussion_r971492356 ## hudi-common/src/main/java/org/apache/hudi/common/util/collection/BitCaskDiskMap.java: ## @@ -275,6 +282,7 @@ public void close() { } } writeOnlyFile.delete(); + this.iterators.forEach(ClosableIterator::close); Review Comment: Hi danny Why not close in finally? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5933: [HUDI-4293] Implement Create/Drop/Show/Refresh Index Command for Secondary Index
hudi-bot commented on PR #5933: URL: https://github.com/apache/hudi/pull/5933#issuecomment-1247537120 ## CI report: * 3d5de064b208083601499666d925df2ec151afd9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11353) * 65359879df848d75b6693f4c313dc9453d635edd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon
hudi-bot commented on PR #6668: URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247534796 ## CI report: * 2480ce4c97130601e2727ab82851c428ea7a84bf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11345) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon
hudi-bot commented on PR #6668: URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247532061 ## CI report: * 2480ce4c97130601e2727ab82851c428ea7a84bf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11345) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table
hudi-bot commented on PR #4676: URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247530990 ## CI report: * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN * e5af3c2bc8310bf3d41560fed377bfdd078505be Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11362) * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368) * e3914eb7b48fc4c5e3bd6f0fd00888ac6da8fa21 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon
Zouxxyy commented on PR #6668: URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247530533 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] fengjian428 commented on pull request #4676: [HUDI-3304] support partial update on mor table
fengjian428 commented on PR #4676: URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247530055 > Thanks for the work, i have reviewed and write a patch: [3304.patch.zip](https://github.com/apache/hudi/files/9571179/3304.patch.zip) Thanks, @danny0405 applied your patch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
hudi-bot commented on PR #6358: URL: https://github.com/apache/hudi/pull/6358#issuecomment-1247526048 ## CI report: * 288d166c49602a4593b1e97763a467811903737d UNKNOWN * 7d716951c917fb1e173da31798736adc172800c4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11369) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TJX2014 commented on pull request #6634: [HUDI-4813] Fix infer keygen not work in sparksql side issue
TJX2014 commented on PR #6634: URL: https://github.com/apache/hudi/pull/6634#issuecomment-1247515534 @danny0405 Please help me review, seems ok in the last ci test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TJX2014 commented on pull request #6634: [HUDI-4813] Fix infer keygen not work in sparksql side issue
TJX2014 commented on PR #6634: URL: https://github.com/apache/hudi/pull/6634#issuecomment-1247514620 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guanlisheng commented on issue #6659: [SUPPORT] query hudi table with Spark SQL on Hive return empty result
guanlisheng commented on issue #6659: URL: https://github.com/apache/hudi/issues/6659#issuecomment-1247494876 Hey @xushiyan , it is 0.9.0. after further debugging with the 0.9.0 bundle, I suspect it is related to #6007. hence I am waiting for 0.11.0 on EMR 5.x and will also try on the non-partitioned table. the workaround now is to unset the `spark.sql.sources.provider` with HIVE SQL.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guanlisheng closed issue #6659: [SUPPORT] query hudi table with Spark SQL on Hive return empty result
guanlisheng closed issue #6659: [SUPPORT] query hudi table with Spark SQL on Hive return empty result URL: https://github.com/apache/hudi/issues/6659 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #4676: [HUDI-3304] support partial update on mor table
danny0405 commented on PR #4676: URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247491717 Thanks for the work, i have reviewed and write a patch: [3304.patch.zip](https://github.com/apache/hudi/files/9571179/3304.patch.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
hudi-bot commented on PR #6358: URL: https://github.com/apache/hudi/pull/6358#issuecomment-1247488723 ## CI report: * 8915ca346137d319276026dd7aa396a9c7bd2b29 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11128) * 288d166c49602a4593b1e97763a467811903737d UNKNOWN * 7d716951c917fb1e173da31798736adc172800c4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11369) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
hudi-bot commented on PR #6358: URL: https://github.com/apache/hudi/pull/6358#issuecomment-1247485971 ## CI report: * 8915ca346137d319276026dd7aa396a9c7bd2b29 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11128) * 288d166c49602a4593b1e97763a467811903737d UNKNOWN * 7d716951c917fb1e173da31798736adc172800c4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
hudi-bot commented on PR #6358: URL: https://github.com/apache/hudi/pull/6358#issuecomment-1247483231 ## CI report: * 8915ca346137d319276026dd7aa396a9c7bd2b29 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11128) * 288d166c49602a4593b1e97763a467811903737d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6673: [HUDI-4785] Fix partition discovery in bootstrap operation
hudi-bot commented on PR #6673: URL: https://github.com/apache/hudi/pull/6673#issuecomment-1247480519 ## CI report: * d549379aa13fdd32255ab4b47b184ae98014d44f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11366) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6672: [HUDI-4757] Create pyspark examples
hudi-bot commented on PR #6672: URL: https://github.com/apache/hudi/pull/6672#issuecomment-1247480503 ## CI report: * 25fad9af64012f22e0bb00d1a454026de0902f92 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11367) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6608: [HUDI-4752] Add dedup support for MOR table in cli
hudi-bot commented on PR #6608: URL: https://github.com/apache/hudi/pull/6608#issuecomment-1247480389 ## CI report: * 74ab4e17c851a4d7a910269b5e36e52880321ba8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11348) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
alexeykudinkin commented on code in PR #6358: URL: https://github.com/apache/hudi/pull/6358#discussion_r971445275 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -169,23 +179,107 @@ object HoodieSparkSqlWriter { } val commitActionType = CommitUtils.getCommitActionType(operation, tableConfig.getTableType) - val dropPartitionColumns = hoodieConfig.getBoolean(DataSourceWriteOptions.DROP_PARTITION_COLUMNS) + + // Register Avro classes ([[Schema]], [[GenericData]]) w/ Kryo + sparkContext.getConf.registerKryoClasses( +Array(classOf[org.apache.avro.generic.GenericData], + classOf[org.apache.avro.Schema])) + + val (structName, nameSpace) = AvroConversionUtils.getAvroRecordNameAndNamespace(tblName) + val reconcileSchema = parameters(DataSourceWriteOptions.RECONCILE_SCHEMA.key()).toBoolean + + val schemaEvolutionEnabled = parameters.getOrDefault(DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.key(), "false").toBoolean + var internalSchemaOpt = getLatestTableInternalSchema(fs, basePath, sparkContext) + + val sourceSchema = AvroConversionUtils.convertStructTypeToAvroSchema(df.schema, structName, nameSpace) + val latestTableSchemaOpt = getLatestTableSchema(spark, basePath, tableIdentifier, sparkContext.hadoopConfiguration) + + val writerSchema: Schema = latestTableSchemaOpt match { +// In case table schema is empty we're just going to use the source schema as a +// writer's schema. No additional handling is required +case None => sourceSchema +// Otherwise, we need to make sure we reconcile incoming and latest table schemas +case Some(latestTableSchema) => + if (reconcileSchema) { +// In case we need to reconcile the schema and schema evolution is enabled, +// we will force-apply schema evolution to the writer's schema +if (schemaEvolutionEnabled && internalSchemaOpt.isEmpty) { + internalSchemaOpt = Some(AvroInternalSchemaConverter.convert(sourceSchema)) +} + +if (internalSchemaOpt.isDefined) { + // Apply schema evolution, by auto-merging write schema and read schema + val mergedInternalSchema = AvroSchemaEvolutionUtils.reconcileSchema(sourceSchema, internalSchemaOpt.get) + AvroInternalSchemaConverter.convert(mergedInternalSchema, latestTableSchema.getName) +} else if (TableSchemaResolver.isSchemaCompatible(sourceSchema, latestTableSchema)) { + // In case schema reconciliation is enabled and source and latest table schemas + // are compatible (as defined by [[TableSchemaResolver#isSchemaCompatible]]), then we + // will rebase incoming batch onto the table's latest schema (ie, reconcile them) + // + // NOTE: Since we'll be converting incoming batch from [[sourceSchema]] into [[latestTableSchema]] + // we're validating in that order (where [[sourceSchema]] is treated as a reader's schema, + // and [[latestTableSchema]] is treated as a writer's schema) + latestTableSchema +} else { + log.error( +s""" + |Failed to reconcile incoming batch schema with the table's one. + |Incoming schema ${sourceSchema.toString(true)} + + |Table's schema ${latestTableSchema.toString(true)} + + |""".stripMargin) + throw new SchemaCompatibilityException("Failed to reconcile incoming schema with the table's one") +} + } else { +// Before validating whether schemas are compatible, we need to "canonicalize" source's schema +// relative to the table's one, by doing a (minor) reconciliation of the nullability constraints: +// for ex, if in incoming schema column A is designated as non-null, but it's designated as nullable +// in the table's one we want to proceed w/ such operation, simply relaxing such constraint in the +// source schema. +val canonicalizedSourceSchema = AvroSchemaEvolutionUtils.canonicalizeColumnNullability(sourceSchema, latestTableSchema) +// In case reconciliation is disabled, we have to validate that the source's schema +// is compatible w/ the table's latest schema, such that we're able to read existing table's +// records using [[sourceSchema]]. +if (TableSchemaResolver.isSchemaCompatible(latestTableSchema, canonicalizedSourceSchema)) { + canonicalizedSourceSchema +} else { + log.error( +s""" + |Incoming batch schema is not compatible with the table's one. + |Incoming schema ${canonicalizedSourceSc
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
alexeykudinkin commented on code in PR #6358: URL: https://github.com/apache/hudi/pull/6358#discussion_r971444542 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -169,23 +179,107 @@ object HoodieSparkSqlWriter { } val commitActionType = CommitUtils.getCommitActionType(operation, tableConfig.getTableType) - val dropPartitionColumns = hoodieConfig.getBoolean(DataSourceWriteOptions.DROP_PARTITION_COLUMNS) + + // Register Avro classes ([[Schema]], [[GenericData]]) w/ Kryo + sparkContext.getConf.registerKryoClasses( +Array(classOf[org.apache.avro.generic.GenericData], + classOf[org.apache.avro.Schema])) Review Comment: We always had that, this code just been moved from below to make sure we handle the schema in the same way for bulk-insert (w/ row-writing) as we do for any other operation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
alexeykudinkin commented on code in PR #6358: URL: https://github.com/apache/hudi/pull/6358#discussion_r971444217 ## hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java: ## @@ -295,91 +295,19 @@ private MessageType convertAvroSchemaToParquet(Schema schema) { } /** - * HUDI specific validation of schema evolution. Ensures that a newer schema can be used for the dataset by - * checking if the data written using the old schema can be read using the new schema. + * Establishes whether {@code prevSchema} is compatible w/ {@code newSchema}, as + * defined by Avro's {@link SchemaCompatibility} * - * HUDI requires a Schema to be specified in HoodieWriteConfig and is used by the HoodieWriteClient to - * create the records. The schema is also saved in the data files (parquet format) and log files (avro format). - * Since a schema is required each time new data is ingested into a HUDI dataset, schema can be evolved over time. - * - * New Schema is compatible only if: - * A1. There is no change in schema - * A2. A field has been added and it has a default value specified - * - * New Schema is incompatible if: - * B1. A field has been deleted - * B2. A field has been renamed (treated as delete + add) - * B3. A field's type has changed to be incompatible with the older type - * - * Issue with org.apache.avro.SchemaCompatibility: - * org.apache.avro.SchemaCompatibility checks schema compatibility between a writer schema (which originally wrote - * the AVRO record) and a readerSchema (with which we are reading the record). It ONLY guarantees that that each - * field in the reader record can be populated from the writer record. Hence, if the reader schema is missing a - * field, it is still compatible with the writer schema. - * - * In other words, org.apache.avro.SchemaCompatibility was written to guarantee that we can read the data written - * earlier. It does not guarantee schema evolution for HUDI (B1 above). - * - * Implementation: This function implements specific HUDI specific checks (listed below) and defers the remaining - * checks to the org.apache.avro.SchemaCompatibility code. - * - * Checks: - * C1. If there is no change in schema: success - * C2. If a field has been deleted in new schema: failure - * C3. If a field has been added in new schema: it should have default value specified - * C4. If a field has been renamed(treated as delete + add): failure - * C5. If a field type has changed: failure - * - * @param oldSchema Older schema to check. - * @param newSchema Newer schema to check. - * @return True if the schema validation is successful - * - * TODO revisit this method: it's implemented incorrectly as it might be applying different criteria - * to top-level record and nested record (for ex, if that nested record is contained w/in an array) + * @param prevSchema previous instance of the schema + * @param newSchema new instance of the schema */ - public static boolean isSchemaCompatible(Schema oldSchema, Schema newSchema) { -if (oldSchema.getType() == newSchema.getType() && newSchema.getType() == Schema.Type.RECORD) { - // record names must match: - if (!SchemaCompatibility.schemaNameEquals(newSchema, oldSchema)) { -return false; - } - - // Check that each field in the oldSchema can populated the newSchema - for (final Field oldSchemaField : oldSchema.getFields()) { -final Field newSchemaField = SchemaCompatibility.lookupWriterField(newSchema, oldSchemaField); -if (newSchemaField == null) { - // C4 or C2: newSchema does not correspond to any field in the oldSchema - return false; -} else { - if (!isSchemaCompatible(oldSchemaField.schema(), newSchemaField.schema())) { -// C5: The fields do not have a compatible type -return false; - } -} - } - - // Check that new fields added in newSchema have default values as they will not be - // present in oldSchema and hence cannot be populated on reading records from existing data. - for (final Field newSchemaField : newSchema.getFields()) { -final Field oldSchemaField = SchemaCompatibility.lookupWriterField(oldSchema, newSchemaField); -if (oldSchemaField == null) { - if (newSchemaField.defaultVal() == null) { -// C3: newly added field in newSchema does not have a default value -return false; - } -} - } - - // All fields in the newSchema record can be populated from the oldSchema record - return true; -} else { - // Use the checks implemented by Avro - // newSchema is the schema which will be used to read the records written earlier using oldSchema. Hence, in the - // check below, use newSchema as the reader schema and oldSchema as the wr
[GitHub] [hudi] xushiyan commented on issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines
xushiyan commented on issue #6610: URL: https://github.com/apache/hudi/issues/6610#issuecomment-1247472941 is this a spark streaming job you're running ? does it scale accordingly when backfill traffic spiked up? the OOM also hints that you may need tune spark configs properly, like spark memory and spark memory.storage.fraction to give more execution memory. Looks like order of records does not matter here since you pump them into the same topic. Why not start a batch job just for backfill? that's how people usually run backfill jobs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4453) Support partition pruning for tables Bootstrapped from Source Hive Style partitioned tables
[ https://issues.apache.org/jira/browse/HUDI-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4453: - Labels: pull-request-available (was: ) > Support partition pruning for tables Bootstrapped from Source Hive Style > partitioned tables > --- > > Key: HUDI-4453 > URL: https://issues.apache.org/jira/browse/HUDI-4453 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Udit Mehrotra >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > As of now the *Bootstrap* feature determines the source schema by reading it > from the source parquet files => > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/ParquetBootstrapMetadataHandler.java#L61] > This does not consider parquet tables which might be Hive style partitioned. > Thus, from the source schema partition columns would be missed and not > written to the target Hudi table either. Also because of this partition > pruning does not work, as we are unable to prune out source partitions. We > should improve this logic to determine partition schema correctly from the > partition paths in case of hive style partitioned tables and write the > partition column values correctly in the target Hudi table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua opened a new pull request, #6676: [HUDI-4453] Fix schema to include partition columns in bootstrap operation
yihua opened a new pull request, #6676: URL: https://github.com/apache/hudi/pull/6676 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan closed issue #6579: [SUPPORT] How to participate in HUDI code contribution
xushiyan closed issue #6579: [SUPPORT] How to participate in HUDI code contribution URL: https://github.com/apache/hudi/issues/6579 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #6579: [SUPPORT] How to participate in HUDI code contribution
xushiyan commented on issue #6579: URL: https://github.com/apache/hudi/issues/6579#issuecomment-1247465847 hi @azhsmesos thanks for your interests in contributing! please check out https://hudi.apache.org/docs/quick-start-guide for quick start examples (both spark and flink) and many more guides under https://hudi.apache.org/docs/overview There is a `new-to-hudi` label you can [filter on from jira](https://issues.apache.org/jira/browse/HUDI-4752?jql=project%20%3D%20HUDI%20and%20labels%20%3D%20new-to-hudi%20and%20statusCategory%20%20!%3D%20done). It's not up-to-date though, as in we have not deliberately gone through all issues to add this label. But it can be somewhere to start with. I'd also suggest go through https://hudi.apache.org/contribute/how-to-contribute and other related pages. Please provide feedback if you have further questions on these guides. cc @bhasudha -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #6596: [SUPPORT] with Impala 4.0 Records lost
xushiyan commented on issue #6596: URL: https://github.com/apache/hudi/issues/6596#issuecomment-1247459150 > I replaced impala hudi dependency jar (hudi-common-0.5.0-incubating.jar, hudi-hadoop-mr-0.5.0-incubating.jar) with (hudi-common-0.12.0.jar, hudi-hadoop-mr-0.12.0.jar),issues still. > ENV: impala4.0+hive3.1.1 with hudi 0.11 is correct. @zhengyuan-cn do you mean you replaced `hudi-*-0.5.0` with `hudi-*-0.11.0` and it worked? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #6618: Caused by: org.apache.http.NoHttpResponseException: xxxxxx:34812 failed to respond[SUPPORT]
xushiyan commented on issue #6618: URL: https://github.com/apache/hudi/issues/6618#issuecomment-1247453345 @Aload can you verify if the patch is used in your version of hudi? and still having the problem? > I have encountered this problem,this pr may solve your problem : #6393 in order to help diagnose, we need more info also to reproduce it. like configs and code snippet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #6626: [SUPPORT] HUDI merge into via spark sql not working
xushiyan commented on issue #6626: URL: https://github.com/apache/hudi/issues/6626#issuecomment-1247449888 @arunb2w noticed that you're on hudi 0.10. would you also verify if 0.12 has the same behavior? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #6644: Hudi Multi Writer DynamoDBBasedLocking issue
xushiyan commented on issue #6644: URL: https://github.com/apache/hudi/issues/6644#issuecomment-1247444513 > Is it mandatory to set AWS_ACCESS_KEY,AWS_SECRET_KEY ? No you should not need to. in aws env you'll just rely on whatever roles for your service to access another service. Please raise support case with aws and get help to configure roles properly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Xiaohan-Shen closed issue #6653: [SUPPORT] Hudi table COW taking up significant space for a small table
Xiaohan-Shen closed issue #6653: [SUPPORT] Hudi table COW taking up significant space for a small table URL: https://github.com/apache/hudi/issues/6653 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Xiaohan-Shen commented on issue #6653: [SUPPORT] Hudi table COW taking up significant space for a small table
Xiaohan-Shen commented on issue #6653: URL: https://github.com/apache/hudi/issues/6653#issuecomment-1247442099 I just figured out the problem: I used the primary key field for partitioning, so it was creating one partition for every row. My bad. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #6653: [SUPPORT] Hudi table COW taking up significant space for a small table
xushiyan commented on issue #6653: URL: https://github.com/apache/hudi/issues/6653#issuecomment-1247438706 likely a lot of small files were created. @Xiaohan-Shen how many files were created in S3? and what do the file sizes look like? cc @zhangyue19921010 @yihua this can be a good point for diagnositc reporter to capture, as in how file sizes distribution look like, to help diagnose parquet size setting issues for example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #6655: [SUPPORT] tryComposeIndexFilterExpr in dataskip util could support InSet expression of spark?
xushiyan commented on issue #6655: URL: https://github.com/apache/hudi/issues/6655#issuecomment-1247434901 @alexeykudinkin can you take a look pls? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-3780) improve drop partitions
[ https://issues.apache.org/jira/browse/HUDI-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-3780. - Resolution: Fixed > improve drop partitions > --- > > Key: HUDI-3780 > URL: https://issues.apache.org/jira/browse/HUDI-3780 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: Forward Xu >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on code in PR #6046: URL: https://github.com/apache/hudi/pull/6046#discussion_r971405570 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java: ## @@ -273,6 +330,60 @@ private HoodieData> readRecordsForGroupBaseFiles(JavaSparkContex .map(record -> transform(record, writeConfig))); } + /** + * Get dataset of all records for the group. This includes all records from file slice (Apply updates from log files, if any). + */ + private Dataset readRecordsForGroupAsRow(JavaSparkContext jsc, + HoodieClusteringGroup clusteringGroup, + String instantTime) { +List clusteringOps = clusteringGroup.getSlices().stream() +.map(ClusteringOperation::create).collect(Collectors.toList()); +boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0); +SQLContext sqlContext = new SQLContext(jsc.sc()); + +String[] baseFilePaths = clusteringOps +.stream() +.map(op -> { + ArrayList readPaths = new ArrayList<>(); + if (op.getBootstrapFilePath() != null) { +readPaths.add(op.getBootstrapFilePath()); + } + if (op.getDataFilePath() != null) { +readPaths.add(op.getDataFilePath()); + } + return readPaths; +}) +.flatMap(Collection::stream) +.filter(path -> !path.isEmpty()) +.toArray(String[]::new); +String[] deltaPaths = clusteringOps +.stream() +.filter(op -> !op.getDeltaFilePaths().isEmpty()) +.flatMap(op -> op.getDeltaFilePaths().stream()) +.toArray(String[]::new); + +Dataset inputRecords; +if (hasLogFiles) { + String compactionFractor = Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction")) + .orElse("0.75"); + String[] paths = CollectionUtils.combine(baseFilePaths, deltaPaths); + inputRecords = sqlContext.read() Review Comment: Good idea!I'll take a try -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3953) Flink Hudi module should support low-level read and write APIs
[ https://issues.apache.org/jira/browse/HUDI-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605011#comment-17605011 ] Kenneth William Krugler commented on HUDI-3953: --- I was initially wondering why Hudi didn't have a regular Flink sink. But after having implemented code to write Pinot segments, I can see advantages to having control over partitioning, which isn't possible at the sink level. > Flink Hudi module should support low-level read and write APIs > --- > > Key: HUDI-3953 > URL: https://issues.apache.org/jira/browse/HUDI-3953 > Project: Apache Hudi > Issue Type: Improvement >Reporter: yuemeng >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > > Currently. Flink Hudi Module only supports SQL APIs. People who want to use > low-level APIs such used for operating Flink state or another purpose don't > have a friendly way. > It can be provided a low-level APIs for users to write/read hoodie data > The API design and main change will be: > # add sink and source API in Pipelines > # getSinkRuntimeProvider in HoodieTableSink call Pipelines.sink(...) to > return DataStreamSink > # getScanRuntimeProvider in HoodieTableSource call Pipelines.source() to > return DataStream > # move some common methods such as getInputFormat in util class > # low-level API such as read and write just call Pipelines.sink(...) and > Pipelines.source() -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi
alexeykudinkin commented on code in PR #6476: URL: https://github.com/apache/hudi/pull/6476#discussion_r971364028 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java: ## @@ -0,0 +1,229 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.generic.IndexedRecord; + +import org.apache.hudi.avro.HoodieAvroUtils; +import org.apache.hudi.avro.SerializableRecord; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.common.table.cdc.HoodieCDCOperation; +import org.apache.hudi.common.table.cdc.HoodieCDCUtils; +import org.apache.hudi.common.table.log.AppendResult; +import org.apache.hudi.common.table.log.HoodieLogFormat; +import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock; +import org.apache.hudi.common.table.log.block.HoodieLogBlock; +import org.apache.hudi.common.util.DefaultSizeEstimator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.collection.ExternalSpillableMap; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.HoodieUpsertException; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.concurrent.atomic.AtomicLong; +import java.util.function.Function; +import java.util.stream.Collectors; + +public class HoodieCDCLogger implements Closeable { + + private final String partitionPath; + + private final String fileName; + + private final String commitTime; + + private final List keyFields; + + private final int taskPartitionId; + + private final boolean populateMetaFields; + + // writer for cdc data + private final HoodieLogFormat.Writer cdcWriter; + + private final boolean cdcEnabled; + + private final String cdcSupplementalLoggingMode; + + // the cdc data + private final Map cdcData; + + private final Function rewriteRecordFunc; + + // the count of records currently being written, used to generate the same seqno for the cdc data + private final AtomicLong writtenRecordCount = new AtomicLong(-1); + + public HoodieCDCLogger( + String partitionPath, + String fileName, + String commitTime, + HoodieWriteConfig config, + List keyFields, + int taskPartitionId, + HoodieLogFormat.Writer cdcWriter, + long maxInMemorySizeInBytes, + Function rewriteRecordFunc) { +try { + this.partitionPath = partitionPath; + this.fileName = fileName; + this.commitTime = commitTime; + this.keyFields = keyFields; + this.taskPartitionId = taskPartitionId; + this.populateMetaFields = config.populateMetaFields(); + this.cdcWriter = cdcWriter; + this.rewriteRecordFunc = rewriteRecordFunc; + + this.cdcEnabled = config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED); + this.cdcSupplementalLoggingMode = config.getStringOrDefault(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE); + this.cdcData = new ExternalSpillableMap<>( + maxInMemorySizeInBytes, + config.getSpillableMapBasePath(), + new DefaultSizeEstimator<>(), + new DefaultSizeEstimator<>(), + config.getCommonConfig().getSpillableDiskMapType(), + config.getCommonConfig().isBitCaskDiskMapCompressionEnabled() + ); +} catch (IOException e) { + throw new HoodieUpsertException("Failed to initialize HoodieCDCLogger", e); +} + } + + public void put(HoodieRecord hoodieRecord, GenericRecord oldRecord, Option indexedRecord) { +if (cdcEnabled) { + String recordKey; + if (oldRecord == null) { +recordKey = hoodieRecord.getRecordKey(); + } else { +recordKey = StringUtils.join( Review Comment: Please check my previous commen
[GitHub] [hudi] xushiyan commented on issue #6659: [SUPPORT] query hudi table with Spark SQL on Hive return empty result
xushiyan commented on issue #6659: URL: https://github.com/apache/hudi/issues/6659#issuecomment-1247408240 from what you listed, `org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4` this is hudi 0.5.3. can you confirm which hudi version do you have problem with? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi
hudi-bot commented on PR #6476: URL: https://github.com/apache/hudi/pull/6476#issuecomment-1247402607 ## CI report: * 34088aeee92daffe28ef3a17c04bb8e000f233e7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11363) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi
alexeykudinkin commented on code in PR #6476: URL: https://github.com/apache/hudi/pull/6476#discussion_r971363419 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java: ## @@ -273,6 +283,33 @@ protected HoodieFileWriter createNewFileWriter(String instantTime, Path path, Ho return HoodieFileWriterFactory.getFileWriter(instantTime, path, hoodieTable, config, schema, taskContextSupplier); } + protected HoodieLogFormat.Writer createLogWriter( + Option fileSlice, String baseCommitTime) throws IOException { +int logVersion = HoodieLogFile.LOGFILE_BASE_VERSION; +long logFileSize = 0L; +String logWriteToken = writeToken; +if (fileSlice.isPresent()) { + Option latestLogFileOpt = fileSlice.get().getLatestLogFile(); + if (latestLogFileOpt.isPresent()) { +HoodieLogFile latestLogFile = latestLogFileOpt.get(); +logVersion = latestLogFile.getLogVersion(); +logFileSize = latestLogFile.getFileSize(); +logWriteToken = FSUtils.getWriteTokenFromLogPath(latestLogFile.getPath()); + } +} +return HoodieLogFormat.newWriterBuilder() + .onParentPath(FSUtils.getPartitionPath(hoodieTable.getMetaClient().getBasePath(), partitionPath)) +.withFileId(fileId) +.overBaseCommit(baseCommitTime) +.withLogVersion(logVersion) +.withFileSize(logFileSize) +.withSizeThreshold(config.getLogFileMaxSize()) +.withFs(fs) +.withRolloverLogWriteToken(writeToken) +.withLogWriteToken(logWriteToken) +.withFileExtension(HoodieLogFile.DELTA_EXTENSION).build(); Review Comment: So one of the pre-requisites of the CDC is: - When we're issuing normal Data query (and not a CDC one), there should be **no performance impact** to it Moreover, we should clearly disambiguate the CDC infra from the Data infra w/o the need to even fetch the first block of the file (we can still use the same Log format, but we should definitely create separate naming scheme for CDC Log files to not mix these up w/ the Data Delta Log files) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi
alexeykudinkin commented on code in PR #6476: URL: https://github.com/apache/hudi/pull/6476#discussion_r971358810 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -399,9 +453,65 @@ protected void writeIncomingRecords() throws IOException { } } + protected SerializableRecord createCDCRecord(HoodieCDCOperation operation, String recordKey, String partitionPath, + GenericRecord oldRecord, GenericRecord newRecord) { +GenericData.Record record; +if (cdcSupplementalLoggingMode.equals(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE_WITH_BEFORE_AFTER)) { + record = HoodieCDCUtils.cdcRecord(operation.getValue(), instantTime, + oldRecord, addCommitMetadata(newRecord, recordKey, partitionPath)); +} else if (cdcSupplementalLoggingMode.equals(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE_WITH_BEFORE)) { + record = HoodieCDCUtils.cdcRecord(operation.getValue(), recordKey, oldRecord); +} else { + record = HoodieCDCUtils.cdcRecord(operation.getValue(), recordKey); +} +return new SerializableRecord(record); + } + + protected GenericRecord addCommitMetadata(GenericRecord record, String recordKey, String partitionPath) { Review Comment: Meta fields carry purely semantical information related to their _persistence_ by Hudi. These aren't the part of the record's payload and we shouldn't be carrying them w/in CDC payload. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi
alexeykudinkin commented on code in PR #6476: URL: https://github.com/apache/hudi/pull/6476#discussion_r971357775 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -102,6 +118,15 @@ protected Map> keyToNewRecords; protected Set writtenRecordKeys; protected HoodieFileWriter fileWriter; + // a flag that indicate whether allow the change data to write out a cdc log file. + protected boolean cdcEnabled = false; + protected String cdcSupplementalLoggingMode; Review Comment: Was reviewing this before i caught up w/ an updated version of the RFC so got confused. Yeah, let's use enum for this one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6341: [SUPPORT] Hudi delete not working via spark apis
nsivabalan commented on issue #6341: URL: https://github.com/apache/hudi/issues/6341#issuecomment-1247373856 sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6101: [SUPPORT] Hudi Delete Not working with EMR, AWS Glue & S3
nsivabalan commented on issue #6101: URL: https://github.com/apache/hudi/issues/6101#issuecomment-1247372251 I assume you are referring to delete_partitions right? how are you triggering delete_partition. are you passing in a regular dataframe as you would for other write operations. Or are you setting the config https://hudi.apache.org/docs/configurations#hoodiedatasourcewritepartitionstodelete . you can set comma separated list of partition values that needs to be deleted. I might need to reproduce your exact scenario and go from there. in the mean time, if you have a reproducible script, let me know. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha commented on pull request #6674: [DOCS] Standardize blog images sizes
bhasudha commented on PR #6674: URL: https://github.com/apache/hudi/pull/6674#issuecomment-1247371882 > Done. @yihua redrawing the entire image might be a bigger effort. I changed the images so their ratio does not change and are closer to 1200 * 600 either in width or height. Please take a look and flag any image that comes out odd. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6463: [SUPPORT]Caused by: java.lang.IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)
nsivabalan commented on issue #6463: URL: https://github.com/apache/hudi/issues/6463#issuecomment-1247370168 let us know if you are still looking for any assistance. if not, we can go ahead and close out the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6463: [SUPPORT]Caused by: java.lang.IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)
nsivabalan commented on issue #6463: URL: https://github.com/apache/hudi/issues/6463#issuecomment-1247369694 common configs requried for any lock provider: hoodie.write.concurrency.mode=optimistic_concurrency_control hoodie.cleaner.policy.failed.writes=LAZY hoodie.write.lock.provider= configs for zookeeper based lock ``` hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider hoodie.write.lock.zookeeper.url hoodie.write.lock.zookeeper.port hoodie.write.lock.zookeeper.lock_key hoodie.write.lock.zookeeper.base_path ``` Configs for hive metastore based lock: ``` hoodie.write.lock.provider=org.apache.hudi.hive.HiveMetastoreBasedLockProvider hoodie.write.lock.hivemetastore.database hoodie.write.lock.hivemetastore.table ``` DynamoDb based lock: ``` hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider hoodie.write.lock.dynamodb.table hoodie.write.lock.dynamodb.partition_key hoodie.write.lock.dynamodb.region hoodie.write.lock.dynamodb.endpoint_url hoodie.write.lock.dynamodb.billing_mode ``` ``` hoodie.aws.access.key hoodie.aws.secret.key hoodie.aws.session.token ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6672: [HUDI-4757] Create pyspark examples
hudi-bot commented on PR #6672: URL: https://github.com/apache/hudi/pull/6672#issuecomment-1247369500 ## CI report: * 7864afbc773d4dde0fca7fad439d2da39cfa8c78 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11364) * 25fad9af64012f22e0bb00d1a454026de0902f92 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11367) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6463: [SUPPORT]Caused by: java.lang.IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)
nsivabalan commented on issue #6463: URL: https://github.com/apache/hudi/issues/6463#issuecomment-1247368648 yes, you can find configs that need to set for zookeeper based lock or for hive metastore based lock here. https://hudi.apache.org/docs/concurrency_control we also have dynamoDB based lock if you are interested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6591: [SUPPORT]Duplicate records in MOR
nsivabalan commented on issue #6591: URL: https://github.com/apache/hudi/issues/6591#issuecomment-1247367471 yes, I get it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table
hudi-bot commented on PR #4676: URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247360153 ## CI report: * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN * e5af3c2bc8310bf3d41560fed377bfdd078505be Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11362) * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6516: [HUDI-4729] Fix fq can not be queried in pending compaction when query ro table with spark
hudi-bot commented on PR #6516: URL: https://github.com/apache/hudi/pull/6516#issuecomment-1247357228 ## CI report: * 8b06e2b181eb0d913a3d9a465e06082cd040bfec Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11361) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4759) Fix website Quick start guide to add validations
[ https://issues.apache.org/jira/browse/HUDI-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4759: - Labels: pull-request-available (was: ) > Fix website Quick start guide to add validations > > > Key: HUDI-4759 > URL: https://issues.apache.org/jira/browse/HUDI-4759 > Project: Apache Hudi > Issue Type: Bug >Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jonvex opened a new pull request, #6675: [HUDI-4759] added validations and some pyspark edits to the quick start guide
jonvex opened a new pull request, #6675: URL: https://github.com/apache/hudi/pull/6675 ### Change Logs Added pyspark and scala validations to the quickstart. Added pyspark insert overwrite example. Fixed some errors with the existing pyspark examples ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none ** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table
hudi-bot commented on PR #4676: URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247355671 ## CI report: * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN * b0c4d706cad14fba7cd31f3f22090f3867fbd2a7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11360) * e5af3c2bc8310bf3d41560fed377bfdd078505be Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11362) * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #6674: [DOCS] Standardize blog images sizes
yihua commented on PR #6674: URL: https://github.com/apache/hudi/pull/6674#issuecomment-1247343367 > I definitely agree. But a question or a thought is should it be okay to decouple that and fix individual images in a second PR. I want to do the bulk change quickly and go from there. But open to ideas. What do you think? Given this is affecting how the website is visualized, let's have the changes to make images look better in one PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha commented on pull request #6674: [DOCS] Standardize blog images sizes
bhasudha commented on PR #6674: URL: https://github.com/apache/hudi/pull/6674#issuecomment-1247338781 > I definitely agree. But a question or a thought is should it be okay to decouple that and fix individual images in a second PR. I want to do the bulk change quickly and go from there. But open to ideas. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-1275) Incremental TImeline Syncing causes compaction to fail with FileNotFound exception
[ https://issues.apache.org/jira/browse/HUDI-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin closed HUDI-1275. - Resolution: Incomplete > Incremental TImeline Syncing causes compaction to fail with FileNotFound > exception > -- > > Key: HUDI-1275 > URL: https://issues.apache.org/jira/browse/HUDI-1275 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Affects Versions: 0.9.0 >Reporter: Balaji Varadarajan >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.12.1 > > > Context: [https://github.com/apache/hudi/issues/2020] > > > {{20/08/25 07:17:13 WARN TaskSetManager: Lost task 3.0 in stage 41.0 (TID > 2540, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): > org.apache.hudi.exception.HoodieException: java.io.FileNotFoundException: No > such file or directory > 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet' > at > org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:207) > at > org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:190) > at > org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:139) > at > org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.lambda$compact$644ebad7$1(HoodieMergeOnReadTableCompactor.java:98) > at > org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1040) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at > org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349) > at > org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1182) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory > 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:617) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553) > at > org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300) > at > org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202) > ... 26 more}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-1275) Incremental TImeline Syncing causes compaction to fail with FileNotFound exception
[ https://issues.apache.org/jira/browse/HUDI-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604989#comment-17604989 ] Alexey Kudinkin commented on HUDI-1275: --- Seems like original issue was filed against Hudi 0.5.3. At this point i don't think we captured enough context to even try to reproduce this issue, so will have to be closing it w/o resolution unfortunately. > Incremental TImeline Syncing causes compaction to fail with FileNotFound > exception > -- > > Key: HUDI-1275 > URL: https://issues.apache.org/jira/browse/HUDI-1275 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Affects Versions: 0.9.0 >Reporter: Balaji Varadarajan >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.12.1 > > > Context: [https://github.com/apache/hudi/issues/2020] > > > {{20/08/25 07:17:13 WARN TaskSetManager: Lost task 3.0 in stage 41.0 (TID > 2540, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): > org.apache.hudi.exception.HoodieException: java.io.FileNotFoundException: No > such file or directory > 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet' > at > org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:207) > at > org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:190) > at > org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:139) > at > org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.lambda$compact$644ebad7$1(HoodieMergeOnReadTableCompactor.java:98) > at > org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1040) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at > org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349) > at > org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1182) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory > 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:617) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553) > at > org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300) > at > org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202) > ... 26 more}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] bhasudha opened a new pull request, #6674: [DOCS] Standardize blog images sizes
bhasudha opened a new pull request, #6674: URL: https://github.com/apache/hudi/pull/6674 ### Change Logs Changed the image size to a standard size of 1200 * 600 for most images for better rendering of blogs landing page. ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3915) Error upserting bucketType UPDATE for partition :0
[ https://issues.apache.org/jira/browse/HUDI-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3915: -- Status: Open (was: In Progress) > Error upserting bucketType UPDATE for partition :0 > -- > > Key: HUDI-3915 > URL: https://issues.apache.org/jira/browse/HUDI-3915 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Neetu Gupta >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.12.1 > > > I have updated the hudi column partition from 'year,month' to 'year. Then I > ran the process in overwrite mode. The process executed successfully and hudi > table got created. > However, when the process got triggered in 'append' mode, I started getting > the error mentioned below: > ' > Task 0 in stage 32.0 failed 4 times; aborting job java.lang.Exception: Job > aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 32.0 (TID 1207, > ip-10-73-110-184.ec2.internal, executor 6): > org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType > UPDATE for partition :0 at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:305) > ' > Then I reverted the partition columns back to 'year,month' but still got the > same error. But, when I am writing data in different folder in 'append' mode, > the script ran fine and I could see the Hudi table. > In short, the process is not working when I am trying to append the data in > the same path. Can you please look into this. This is critical to us because > the jobs are stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3915) Error upserting bucketType UPDATE for partition :0
[ https://issues.apache.org/jira/browse/HUDI-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3915: -- Sprint: 2022/09/19 > Error upserting bucketType UPDATE for partition :0 > -- > > Key: HUDI-3915 > URL: https://issues.apache.org/jira/browse/HUDI-3915 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Neetu Gupta >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.12.1 > > > I have updated the hudi column partition from 'year,month' to 'year. Then I > ran the process in overwrite mode. The process executed successfully and hudi > table got created. > However, when the process got triggered in 'append' mode, I started getting > the error mentioned below: > ' > Task 0 in stage 32.0 failed 4 times; aborting job java.lang.Exception: Job > aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 32.0 (TID 1207, > ip-10-73-110-184.ec2.internal, executor 6): > org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType > UPDATE for partition :0 at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:305) > ' > Then I reverted the partition columns back to 'year,month' but still got the > same error. But, when I am writing data in different folder in 'append' mode, > the script ran fine and I could see the Hudi table. > In short, the process is not working when I am trying to append the data in > the same path. Can you please look into this. This is critical to us because > the jobs are stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3915) Error upserting bucketType UPDATE for partition :0
[ https://issues.apache.org/jira/browse/HUDI-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3915: -- Sprint: (was: 2022/09/05) > Error upserting bucketType UPDATE for partition :0 > -- > > Key: HUDI-3915 > URL: https://issues.apache.org/jira/browse/HUDI-3915 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Neetu Gupta >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.12.1 > > > I have updated the hudi column partition from 'year,month' to 'year. Then I > ran the process in overwrite mode. The process executed successfully and hudi > table got created. > However, when the process got triggered in 'append' mode, I started getting > the error mentioned below: > ' > Task 0 in stage 32.0 failed 4 times; aborting job java.lang.Exception: Job > aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 32.0 (TID 1207, > ip-10-73-110-184.ec2.internal, executor 6): > org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType > UPDATE for partition :0 at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:305) > ' > Then I reverted the partition columns back to 'year,month' but still got the > same error. But, when I am writing data in different folder in 'append' mode, > the script ran fine and I could see the Hudi table. > In short, the process is not working when I am trying to append the data in > the same path. Can you please look into this. This is critical to us because > the jobs are stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-3915) Error upserting bucketType UPDATE for partition :0
[ https://issues.apache.org/jira/browse/HUDI-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604987#comment-17604987 ] Alexey Kudinkin commented on HUDI-3915: --- [~ngupta2206] can you please provide the full stack-trace? Also what's Spark, Hudi versions are you using? > Error upserting bucketType UPDATE for partition :0 > -- > > Key: HUDI-3915 > URL: https://issues.apache.org/jira/browse/HUDI-3915 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Neetu Gupta >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.12.1 > > > I have updated the hudi column partition from 'year,month' to 'year. Then I > ran the process in overwrite mode. The process executed successfully and hudi > table got created. > However, when the process got triggered in 'append' mode, I started getting > the error mentioned below: > ' > Task 0 in stage 32.0 failed 4 times; aborting job java.lang.Exception: Job > aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 32.0 (TID 1207, > ip-10-73-110-184.ec2.internal, executor 6): > org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType > UPDATE for partition :0 at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:305) > ' > Then I reverted the partition columns back to 'year,month' but still got the > same error. But, when I am writing data in different folder in 'append' mode, > the script ran fine and I could see the Hudi table. > In short, the process is not working when I am trying to append the data in > the same path. Can you please look into this. This is critical to us because > the jobs are stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3915) Error upserting bucketType UPDATE for partition :0
[ https://issues.apache.org/jira/browse/HUDI-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3915: -- Status: In Progress (was: Open) > Error upserting bucketType UPDATE for partition :0 > -- > > Key: HUDI-3915 > URL: https://issues.apache.org/jira/browse/HUDI-3915 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Neetu Gupta >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.12.1 > > > I have updated the hudi column partition from 'year,month' to 'year. Then I > ran the process in overwrite mode. The process executed successfully and hudi > table got created. > However, when the process got triggered in 'append' mode, I started getting > the error mentioned below: > ' > Task 0 in stage 32.0 failed 4 times; aborting job java.lang.Exception: Job > aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 32.0 (TID 1207, > ip-10-73-110-184.ec2.internal, executor 6): > org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType > UPDATE for partition :0 at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:305) > ' > Then I reverted the partition columns back to 'year,month' but still got the > same error. But, when I am writing data in different folder in 'append' mode, > the script ran fine and I could see the Hudi table. > In short, the process is not working when I am trying to append the data in > the same path. Can you please look into this. This is critical to us because > the jobs are stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4363) Support Clustering row writer to improve performance
[ https://issues.apache.org/jira/browse/HUDI-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4363: -- Status: In Progress (was: Open) > Support Clustering row writer to improve performance > > > Key: HUDI-4363 > URL: https://issues.apache.org/jira/browse/HUDI-4363 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, writer-core >Reporter: Hui An >Assignee: Hui An >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-07-05 at 17.25.13.png > > > 1. Integrate clustering with datasource read and write api, in this way, > - enable clustering use Dataset api > - Unify the read and write operations together, if read/write logic has > improvement, clustering can also benefit, such as vectorized read > 2. Use {{hoodie.datasource.read.paths}} to pass paths for each > clusteringOperation > 3. Introduce {{HoodieInternalWriteStatusCoordinator}} to persist the > {{InternalWriteStatus}} of a clustering action. As we can not get it if using > Spark datasource. > 4. Add new configures to control this behavior. > h4. Test performance > A test table has 21 columns, 710716 rows, raw data size 929g(in spark > memory), after compressed: 38.3g > executor memory: 50g, 20 instances, and enable global_sort > Without clustering as row: 32mins, 12sec > Using clustering as row: 9mins, 51sec > We can also see the performance improve from test: > {{TestHoodieSparkMergeOnReadTableClustering}} and > {{testLayoutOptimizationFunctional}} > !Screen Shot 2022-07-05 at 17.25.13.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)