[GitHub] [hudi] hudi-bot commented on pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros
hudi-bot commented on pull request #4548: URL: https://github.com/apache/hudi/pull/4548#issuecomment-1008613232 ## CI report: * afe7fac6c45a7ee1f0935e17896d0616f124fca3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5047) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros
hudi-bot removed a comment on pull request #4548: URL: https://github.com/apache/hudi/pull/4548#issuecomment-1008586981 ## CI report: * afe7fac6c45a7ee1f0935e17896d0616f124fca3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5047) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
hudi-bot removed a comment on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008603238 ## CI report: * c3295aa79ecd15281ffc573c86e73a2637f3533f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5041) * 52cad3508ddf12c73f1c5c60180fe1137232192d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guoch commented on issue #4545: [SUPPORT] Hudi(0.10.0) backward compatibility for Flink 1.11/1.12 version
guoch commented on issue #4545: URL: https://github.com/apache/hudi/issues/4545#issuecomment-1008604526 > Got it. Thanks for the info. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
hudi-bot commented on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008604559 ## CI report: * c3295aa79ecd15281ffc573c86e73a2637f3533f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5041) * 52cad3508ddf12c73f1c5c60180fe1137232192d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5048) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guoch closed issue #4545: [SUPPORT] Hudi(0.10.0) backward compatibility for Flink 1.11/1.12 version
guoch closed issue #4545: URL: https://github.com/apache/hudi/issues/4545 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
hudi-bot removed a comment on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008532182 ## CI report: * c3295aa79ecd15281ffc573c86e73a2637f3533f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5041) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
hudi-bot commented on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008603238 ## CI report: * c3295aa79ecd15281ffc573c86e73a2637f3533f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5041) * 52cad3508ddf12c73f1c5c60180fe1137232192d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros
hudi-bot commented on pull request #4548: URL: https://github.com/apache/hudi/pull/4548#issuecomment-1008586981 ## CI report: * afe7fac6c45a7ee1f0935e17896d0616f124fca3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5047) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros
hudi-bot removed a comment on pull request #4548: URL: https://github.com/apache/hudi/pull/4548#issuecomment-1008585610 ## CI report: * afe7fac6c45a7ee1f0935e17896d0616f124fca3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros
hudi-bot commented on pull request #4548: URL: https://github.com/apache/hudi/pull/4548#issuecomment-1008585610 ## CI report: * afe7fac6c45a7ee1f0935e17896d0616f124fca3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] AirToSupply opened a new pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros
AirToSupply opened a new pull request #4548: URL: https://github.com/apache/hudi/pull/4548 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request hudi-flink module support timestamp-micros. [(HUDI-3184)](https://issues.apache.org/jira/browse/HUDI-3184) ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4546: [MINOR] Fix port number in setupKafka.sh
hudi-bot removed a comment on pull request #4546: URL: https://github.com/apache/hudi/pull/4546#issuecomment-1008560857 ## CI report: * d494dc6ad14f71036c0d939f588313adc84dcf8f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5046) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4546: [MINOR] Fix port number in setupKafka.sh
hudi-bot commented on pull request #4546: URL: https://github.com/apache/hudi/pull/4546#issuecomment-1008583528 ## CI report: * d494dc6ad14f71036c0d939f588313adc84dcf8f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5046) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4535: [WIP][HUDI-3161] Add Call Produce Command for spark sql
hudi-bot removed a comment on pull request #4535: URL: https://github.com/apache/hudi/pull/4535#issuecomment-1008559321 ## CI report: * 49b18f6d40a8b859927dcc9d606d40fd4162f0b1 UNKNOWN * 450ccaa4c73197ad56f26c37260f66fc27873f36 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5032) * a39a6cda867038f96d379ff17b7e1216fa2326fb UNKNOWN * f56b53b80f3cfc8949eb2f4d14ee2a8a762252da Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5045) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4535: [WIP][HUDI-3161] Add Call Produce Command for spark sql
hudi-bot commented on pull request #4535: URL: https://github.com/apache/hudi/pull/4535#issuecomment-1008582494 ## CI report: * 49b18f6d40a8b859927dcc9d606d40fd4162f0b1 UNKNOWN * a39a6cda867038f96d379ff17b7e1216fa2326fb UNKNOWN * f56b53b80f3cfc8949eb2f4d14ee2a8a762252da Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5045) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] arpanrkl7 commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt
arpanrkl7 commented on issue #2509: URL: https://github.com/apache/hudi/issues/2509#issuecomment-1008567437 When i am trying to read using spark-sql getting below error which was same mentioned by @zuyanton . java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #3588: [MINOR] Fix wording and table in the marker blog
yihua commented on a change in pull request #3588: URL: https://github.com/apache/hudi/pull/3588#discussion_r780906284 ## File path: website/blog/2021-08-18-improving-marker-mechanism.md ## @@ -47,26 +47,26 @@ Note that the worker thread always checks whether the marker has already been cr ## Marker-related write options -We introduce the following new marker-related write options in `0.9.0` release, to configure the marker mechanism. +We introduce the following new marker-related write options in `0.9.0` release, to configure the marker mechanism. Note that the timeline-server-based marker mechanism is not yet supported for HDFS in `0.9.0` release, and we plan to support the timeline-server-based marker mechanism for HDFS in the future. | Property Name | Default | Meaning| | - | --- | :-:| -| `hoodie.write.markers.type` | direct | Marker type to use. Two modes are supported: (1) `direct`: individual marker file corresponding to each data file is directly created by the writer; (2) `timeline_server_based`: marker operations are all handled at the timeline service which serves as a proxy. New marker entries are batch processed and stored in a limited number of underlying files for efficiency. | +| `hoodie.write.markers.type` | direct | Marker type to use. Two modes are supported: (1) `direct`: individual marker file corresponding to each data file is directly created by the executor; (2) `timeline_server_based`: marker operations are all handled at the timeline service which serves as a proxy. New marker entries are batch processed and stored in a limited number of underlying files for efficiency. | | `hoodie.markers.timeline_server_based.batch.num_threads` | 20 | Number of threads to use for batch processing marker creation requests at the timeline server. | | `hoodie.markers.timeline_server_based.batch.interval_ms` | 50 | The batch interval in milliseconds for marker creation batch processing. | ## Performance -We evaluate the write performance over both direct and timeline-server-based marker mechanisms by bulk-inserting a large dataset using Amazon EMR with Spark and S3. The input data is around 100GB. We configure the write operation to generate a large number of data files concurrently by setting the max parquet file size to be 1MB and parallelism to be 240. As we noted before, while the latency of direct marker mechanism is acceptable for incremental writes with smaller number of data files written, it increases dramatically for large bulk inserts/writes which produce much more data files. +We evaluate the write performance over both direct and timeline-server-based marker mechanisms by bulk-inserting a large dataset using Amazon EMR with Spark and S3. The input data is around 100GB. We configure the write operation to generate a large number of data files concurrently by setting the max parquet file size to be 1MB and parallelism to be 240. Note that it is unlikely to set max parquet file size to 1MB in production and such a setup is only to evaluate the performance regarding the marker mechanisms. As we noted before, while the latency of direct marker mechanism is acceptable for incremental writes with smaller number of data files written, it increases dramatically for large bulk inserts/writes which produce much more data files. -As shown below, the timeline-server-based marker mechanism generates much fewer files storing markers because of the batch processing, leading to much less time on marker-related I/O operations, thus achieving 31% lower write completion time compared to the direct marker file mechanism. +As shown below, direct marker mechanism works really well, when a part of the table is written, e.g., 1K out of 165K data files. However, the time of direct marker operations is non-trivial when we need to write significant number of data files. Compared to the direct marker mechanism, the timeline-server-based marker mechanism generates much fewer files storing markers because of the batch processing, leading to much less time on marker-related I/O operations, thus achieving 31% lower write completion time compared to the direct marker file mechanism. -| Marker Type | Total Files | Num data files written | Files created for markers | Marker deletion time | Bulk Insert Time (including marker deletion) | +| Marker Type | Input data size | Num data files written | Files created for markers | Marker deletion time | Bulk Insert Time (including marker deletion) | | --- | - | :-: | :-: | :-: | :-: | -| Direct | 165K | 1k | 165k | 5.4secs | - | -| Direct | 165K | 165k | 165k | 15min | 55min | -| Timeline-server-based | 165K | 165k | 20 | ~3s | 38min | +| Direct | 600MB | 1k | 1k | 5.4secs | - | Review comment: Somehow missed the comment. I put a PR to fix that: #454
[GitHub] [hudi] yihua opened a new pull request #4547: [MINOR] Fix performance table in marker blog
yihua opened a new pull request #4547: URL: https://github.com/apache/hudi/pull/4547 ## What is the purpose of the pull request Fix performance table content in marker blog. ## Verify this pull request The site can build and launch. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4546: [MINOR] Fix port number in setupKafka.sh
hudi-bot removed a comment on pull request #4546: URL: https://github.com/apache/hudi/pull/4546#issuecomment-1008560091 ## CI report: * d494dc6ad14f71036c0d939f588313adc84dcf8f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4546: [MINOR] Fix port number in setupKafka.sh
hudi-bot commented on pull request #4546: URL: https://github.com/apache/hudi/pull/4546#issuecomment-1008560857 ## CI report: * d494dc6ad14f71036c0d939f588313adc84dcf8f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5046) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4546: [MINOR] Fix port number in setupKafka.sh
hudi-bot commented on pull request #4546: URL: https://github.com/apache/hudi/pull/4546#issuecomment-1008560091 ## CI report: * d494dc6ad14f71036c0d939f588313adc84dcf8f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #4545: [SUPPORT] Hudi(0.10.0) backward compatibility for Flink 1.11/1.12 version
danny0405 commented on issue #4545: URL: https://github.com/apache/hudi/issues/4545#issuecomment-1008559861 I think we can after the flink version is stable, for e,g the flink 1.14.x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request #4546: [MINOR] Fix port number in setupKafka.sh
yihua opened a new pull request #4546: URL: https://github.com/apache/hudi/pull/4546 ## What is the purpose of the pull request This PR fixes port number in `setupKafka.sh`. ## Verify this pull request Run through the Quick Start Guide of Kafka Connect Sink for Hudi to make sure the script does not throw errors anymore. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4535: [WIP][HUDI-3161] Add Call Produce Command for spark sql
hudi-bot commented on pull request #4535: URL: https://github.com/apache/hudi/pull/4535#issuecomment-1008559321 ## CI report: * 49b18f6d40a8b859927dcc9d606d40fd4162f0b1 UNKNOWN * 450ccaa4c73197ad56f26c37260f66fc27873f36 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5032) * a39a6cda867038f96d379ff17b7e1216fa2326fb UNKNOWN * f56b53b80f3cfc8949eb2f4d14ee2a8a762252da Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5045) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4535: [WIP][HUDI-3161] Add Call Produce Command for spark sql
hudi-bot removed a comment on pull request #4535: URL: https://github.com/apache/hudi/pull/4535#issuecomment-1008556417 ## CI report: * 49b18f6d40a8b859927dcc9d606d40fd4162f0b1 UNKNOWN * 450ccaa4c73197ad56f26c37260f66fc27873f36 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5032) * a39a6cda867038f96d379ff17b7e1216fa2326fb UNKNOWN * f56b53b80f3cfc8949eb2f4d14ee2a8a762252da UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot commented on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008557130 ## CI report: * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot removed a comment on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008529426 ## CI report: * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4535: [WIP][HUDI-3161] Add Call Produce Command for spark sql
hudi-bot removed a comment on pull request #4535: URL: https://github.com/apache/hudi/pull/4535#issuecomment-1008547036 ## CI report: * 49b18f6d40a8b859927dcc9d606d40fd4162f0b1 UNKNOWN * 450ccaa4c73197ad56f26c37260f66fc27873f36 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5032) * a39a6cda867038f96d379ff17b7e1216fa2326fb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4535: [WIP][HUDI-3161] Add Call Produce Command for spark sql
hudi-bot commented on pull request #4535: URL: https://github.com/apache/hudi/pull/4535#issuecomment-1008556417 ## CI report: * 49b18f6d40a8b859927dcc9d606d40fd4162f0b1 UNKNOWN * 450ccaa4c73197ad56f26c37260f66fc27873f36 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5032) * a39a6cda867038f96d379ff17b7e1216fa2326fb UNKNOWN * f56b53b80f3cfc8949eb2f4d14ee2a8a762252da UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gatsby-Lee commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt
Gatsby-Lee commented on issue #2509: URL: https://github.com/apache/hudi/issues/2509#issuecomment-1008552876 @nsivabalan after I got your msg, I queried to RT table. It still fails. I heard from AWS that the fix will be shipped out at the end of Jan 2022. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] waywtdcc closed issue #4305: [SUPPORT] Duplicate Flink write record
waywtdcc closed issue #4305: URL: https://github.com/apache/hudi/issues/4305 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] waywtdcc closed issue #4508: [SUPPORT]Duplicate Flink Hudi data
waywtdcc closed issue #4508: URL: https://github.com/apache/hudi/issues/4508 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #3533: [SUPPORT]How to use MOR Table to Merge small file?
nsivabalan commented on issue #3533: URL: https://github.com/apache/hudi/issues/3533#issuecomment-1008552108 @aresa7796 : will go ahead and close due to inactivity. Feel free to reopen if need be. will be happy to help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #3533: [SUPPORT]How to use MOR Table to Merge small file?
nsivabalan closed issue #3533: URL: https://github.com/apache/hudi/issues/3533 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2509: [SUPPORT] Hudi Spark DataSource saves TimestampType as bigInt
nsivabalan commented on issue #2509: URL: https://github.com/apache/hudi/issues/2509#issuecomment-1008551882 @umehrot2 @zhedoubushishi : Do you folks have any pointers on this. @Gatsby-Lee : I guess athena added support for real time query in one of the latest versions. Did you try using latest athena? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2936: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload not work in mor table
nsivabalan commented on issue #2936: URL: https://github.com/apache/hudi/issues/2936#issuecomment-1008551168 @shenbinglife : let us know if you are looking for any more help. Or feel free to close the issue if you got it resolved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #3478: [SUPPORT] Unexpected Hive behaviour
nsivabalan commented on issue #3478: URL: https://github.com/apache/hudi/issues/3478#issuecomment-1008550588 @affei : hey, any updates for us in this regard please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #3713: [SUPPORT] Cannot read from Hudi table created by same Spark job
nsivabalan commented on issue #3713: URL: https://github.com/apache/hudi/issues/3713#issuecomment-1008550188 Closing this due to inactivity. Feel free to re-open if need be. would be happy to help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #3713: [SUPPORT] Cannot read from Hudi table created by same Spark job
nsivabalan closed issue #3713: URL: https://github.com/apache/hudi/issues/3713 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #3731: [SUPPORT] Concurrent write (OCC) on distinct partitions random errors
nsivabalan commented on issue #3731: URL: https://github.com/apache/hudi/issues/3731#issuecomment-1008549934 what kind of lock provider are you using? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4082: [SUPPORT] How to write multiple HUDi tables simultaneously in a Spark Streaming task?
nsivabalan commented on issue #4082: URL: https://github.com/apache/hudi/issues/4082#issuecomment-1008548662 @xuranyang : are you referring to MultiTableDeltastreamer. I don't think we have any such functionality for now to stream from multiple and write to diff hudi tables. Had to be done manually at the application layer by the user. If you can build some simple framework to get this, please consider upstreaming the functionality to benefit others in the community. thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3163) Validate/certify hudi against diff spark 3 versions
[ https://issues.apache.org/jira/browse/HUDI-3163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3163: - Status: In Progress (was: Open) > Validate/certify hudi against diff spark 3 versions > > > Key: HUDI-3163 > URL: https://issues.apache.org/jira/browse/HUDI-3163 > Project: Apache Hudi > Issue Type: Task > Components: Spark Integration >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Major > Labels: user-support-issues > Fix For: 0.10.1 > > > We have diff spark3 versions. Lets validate/certify diff spark3 versions > against 0.10.0 and master. > > I do see this in our github readme. If its already certified, feel free to > close it out(link to original ticket where verifications are documented. > {code:java} > # Build against Spark 3.2.0 (default build shipped with the public jars) > mvn clean package -DskipTests -Dspark3# Build against Spark 3.1.2 > mvn clean package -DskipTests -Dspark3.1.x# Build against Spark 3.0.3 > mvn clean package -DskipTests -Dspark3.0.x {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot removed a comment on pull request #4535: [WIP][HUDI-3161] Add Call Produce Command for spark sql
hudi-bot removed a comment on pull request #4535: URL: https://github.com/apache/hudi/pull/4535#issuecomment-1008329300 ## CI report: * 49b18f6d40a8b859927dcc9d606d40fd4162f0b1 UNKNOWN * 450ccaa4c73197ad56f26c37260f66fc27873f36 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5032) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4535: [WIP][HUDI-3161] Add Call Produce Command for spark sql
hudi-bot commented on pull request #4535: URL: https://github.com/apache/hudi/pull/4535#issuecomment-1008547036 ## CI report: * 49b18f6d40a8b859927dcc9d606d40fd4162f0b1 UNKNOWN * 450ccaa4c73197ad56f26c37260f66fc27873f36 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5032) * a39a6cda867038f96d379ff17b7e1216fa2326fb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4544: [HUDI-2735] Allow empty commits in Kafka Connect Sink for Hudi
hudi-bot commented on pull request #4544: URL: https://github.com/apache/hudi/pull/4544#issuecomment-1008543620 ## CI report: * 8ca9f2823977584fb07efc737ccc175a6e33f115 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5043) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4544: [HUDI-2735] Allow empty commits in Kafka Connect Sink for Hudi
hudi-bot removed a comment on pull request #4544: URL: https://github.com/apache/hudi/pull/4544#issuecomment-1008512050 ## CI report: * 8ca9f2823977584fb07efc737ccc175a6e33f115 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5043) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
nsivabalan commented on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008535743 @xiarixiaoyao : hey, can you review this patch please. Touches part of the code authored by you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
hudi-bot removed a comment on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008501965 ## CI report: * dc6e817b518774152944d658e4c239cfcce30c9f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5016) * c3295aa79ecd15281ffc573c86e73a2637f3533f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5041) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
hudi-bot commented on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008532182 ## CI report: * c3295aa79ecd15281ffc573c86e73a2637f3533f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5041) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #4446: [HUDI-2917] rollback insert data appended to log file when using Hbase Index
danny0405 commented on a change in pull request #4446: URL: https://github.com/apache/hudi/pull/4446#discussion_r780733934 ## File path: hudi-client/hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/BaseJavaCommitActionExecutor.java ## @@ -90,27 +90,29 @@ public BaseJavaCommitActionExecutor(HoodieEngineContext context, public HoodieWriteMetadata> execute(List> inputRecords) { HoodieWriteMetadata> result = new HoodieWriteMetadata<>(); -WorkloadProfile profile = null; +WorkloadProfile inputProfile = null; if (isWorkloadProfileNeeded()) { - profile = new WorkloadProfile(buildProfile(inputRecords)); - LOG.info("Workload profile :" + profile); + inputProfile = new WorkloadProfile(buildProfile(inputRecords)); + LOG.info("Input workload profile :" + inputProfile); +} + +final Partitioner partitioner = getPartitioner(inputProfile); +try { + WorkloadProfile executionProfile = partitioner.getExecutionWorkloadProfile(); + LOG.info("Execution workload profile :" + inputProfile); + saveWorkloadProfileMetadataToInflight(executionProfile, instantTime); Review comment: And why we must use the execution profile here ? I know the original profile also works only for bloomfilter index but we should fix the profile building instead of fetch it from the partitioner, if we have a way to distinguish between `INSERT`s and `UPDATE`s before write. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot removed a comment on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008523582 ## CI report: * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot commented on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008529426 ## CI report: * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YuweiXiao commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
YuweiXiao commented on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008528378 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot commented on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008523582 ## CI report: * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot removed a comment on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008500469 ## CI report: * 1277b45508e2b713a3c8416a87893b1d059c375a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5037) * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guanziyue commented on a change in pull request #4446: [HUDI-2917] rollback insert data appended to log file when using Hbase Index
guanziyue commented on a change in pull request #4446: URL: https://github.com/apache/hudi/pull/4446#discussion_r780881267 ## File path: hudi-client/hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/BaseJavaCommitActionExecutor.java ## @@ -90,27 +90,29 @@ public BaseJavaCommitActionExecutor(HoodieEngineContext context, public HoodieWriteMetadata> execute(List> inputRecords) { HoodieWriteMetadata> result = new HoodieWriteMetadata<>(); -WorkloadProfile profile = null; +WorkloadProfile inputProfile = null; if (isWorkloadProfileNeeded()) { - profile = new WorkloadProfile(buildProfile(inputRecords)); - LOG.info("Workload profile :" + profile); + inputProfile = new WorkloadProfile(buildProfile(inputRecords)); + LOG.info("Input workload profile :" + inputProfile); +} + +final Partitioner partitioner = getPartitioner(inputProfile); +try { + WorkloadProfile executionProfile = partitioner.getExecutionWorkloadProfile(); + LOG.info("Execution workload profile :" + inputProfile); + saveWorkloadProfileMetadataToInflight(executionProfile, instantTime); Review comment: > I do this because the logic assign records to log file is covered by partitioner and minimize the change to existing code. We could move all assignment logic of insert records from partitioner to profile generation. I will modify this part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on a change in pull request #4350: [HUDI-3047] Basic Implementation of Spark Datasource V2
boneanxs commented on a change in pull request #4350: URL: https://github.com/apache/hudi/pull/4350#discussion_r780880750 ## File path: hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala ## @@ -92,4 +95,31 @@ trait SparkAdapter extends Serializable { * ParserInterface#parseMultipartIdentifier is supported since spark3, for spark2 this should not be called. */ def parseMultipartIdentifier(parser: ParserInterface, sqlText: String): Seq[String] + + def isHoodieTable(table: LogicalPlan, spark: SparkSession): Boolean = { Review comment: Is there any difference with **hoodieSqlCommonUtils.isHoodieTable**? I see sometimes we use **adapter.isHoodieTable**, sometimes use **hoodieSqlCommonUtils.isHoodieTable** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guanziyue commented on a change in pull request #4446: [HUDI-2917] rollback insert data appended to log file when using Hbase Index
guanziyue commented on a change in pull request #4446: URL: https://github.com/apache/hudi/pull/4446#discussion_r780880230 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java ## @@ -182,14 +182,28 @@ public abstract void preCompact( .withOperationField(config.allowOperationMetadataField()) .withPartition(operation.getPartitionPath()) .build(); -if (!scanner.iterator().hasNext()) { - scanner.close(); - return new ArrayList<>(); -} Option oldDataFileOpt = operation.getBaseFile(metaClient.getBasePath(), operation.getPartitionPath()); +// Considering following scenario: if all log blocks in this fileSlice is rollback, it returns an empty scanner. +// But in this case, we need to give it a base file. Otherwise, it will lose base file in following fileSlice. +if (!scanner.iterator().hasNext()) { + if (!oldDataFileOpt.isPresent()) { +scanner.close(); +return new ArrayList<>(); + } else { +// TODO: we may directly rename original parquet file if there is not evolution/devolution of schema Review comment: > If the file slice only has parquet files, why we still trigger compaction ? Before we actually do compaction, it is quite difficult to know that new fileSlice only has parquet file. There do have one or more log Files exists which have no valid log blocks in them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guanziyue commented on a change in pull request #4446: [HUDI-2917] rollback insert data appended to log file when using Hbase Index
guanziyue commented on a change in pull request #4446: URL: https://github.com/apache/hudi/pull/4446#discussion_r780879568 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java ## @@ -182,14 +182,28 @@ public abstract void preCompact( .withOperationField(config.allowOperationMetadataField()) .withPartition(operation.getPartitionPath()) .build(); -if (!scanner.iterator().hasNext()) { - scanner.close(); - return new ArrayList<>(); -} Option oldDataFileOpt = operation.getBaseFile(metaClient.getBasePath(), operation.getPartitionPath()); +// Considering following scenario: if all log blocks in this fileSlice is rollback, it returns an empty scanner. +// But in this case, we need to give it a base file. Otherwise, it will lose base file in following fileSlice. +if (!scanner.iterator().hasNext()) { + if (!oldDataFileOpt.isPresent()) { +scanner.close(); +return new ArrayList<>(); + } else { +// TODO: we may directly rename original parquet file if there is not evolution/devolution of schema Review comment: Correct me if I misunderstand you question. The reason why we try to generate a new base file here rather than end up this compaction operation is that any upsert occurs after compaction plan generated will use compaction commit time as new log file base commit time. Such a fileSlice is comprised by new log file and basefile generated by compaction. If hoodieCompactor didn't generate a basefile for this fileSlice, Filegroup will lose all data in baseFile in new and following Fileslices. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #3420: [HUDI-2283] Support Clustering Command For Spark Sql
yihua commented on pull request #3420: URL: https://github.com/apache/hudi/pull/3420#issuecomment-1008519568 > @nsivabalan @yihua If no one take this up, i am glad to. @YannByron Feel free to take a stab at this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
hudi-bot removed a comment on pull request #4449: URL: https://github.com/apache/hudi/pull/4449#issuecomment-1008497224 ## CI report: * dc9fe1b878dc47eaed13911fc5ca7eaffb80fb2f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4753) * ce8a8d9547819b23368115ba640caed1cb385213 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5039) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
hudi-bot commented on pull request #4449: URL: https://github.com/apache/hudi/pull/4449#issuecomment-1008518397 ## CI report: * ce8a8d9547819b23368115ba640caed1cb385213 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5039) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guoch opened a new issue #4545: [SUPPORT] Hudi(0.10.0) backward compatibility for Flink 1.11/1.12 version
guoch opened a new issue #4545: URL: https://github.com/apache/hudi/issues/4545 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? Yes - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** Hudi flink bundle version 0.10.0 and master version cannot be running in the flink 1.11. While flink 1.11 version are still widely used in many situation, and hudi is keeping doing great job to support flink better in newer version. According to the discussion here https://github.com/apache/hudi/pull/3291, the community was unlikely to backward support for old flink version. (Hudi did something like hudi 0.8-flink 1.11, hudi 0.9- flink 1.12, hudi 0.10 - flink 1.13, new hudi version cannot work in old flink) While old Spark version (2.4/3.1) are always supported with different mvn compiling profiles, is there any possibility to retain the support for old flink version using similar trick like Spark ? **Expected behavior** New version can be backward compatible with old flink version using different profiles. **Environment Description** * Hudi version : 0.10.0 and master branch * Spark version : 3.2.0 * Hive version : 3.1.2 * Hadoop version : 3.3.1 * Flink version: 1.11.3 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4544: [HUDI-2735] Allow empty commits in Kafka Connect Sink for Hudi
hudi-bot commented on pull request #4544: URL: https://github.com/apache/hudi/pull/4544#issuecomment-1008512050 ## CI report: * 8ca9f2823977584fb07efc737ccc175a6e33f115 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5043) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4544: [HUDI-2735] Allow empty commits in Kafka Connect Sink for Hudi
hudi-bot removed a comment on pull request #4544: URL: https://github.com/apache/hudi/pull/4544#issuecomment-1008510590 ## CI report: * 8ca9f2823977584fb07efc737ccc175a6e33f115 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4544: [HUDI-2735] Allow empty commits in Kafka Connect Sink for Hudi
hudi-bot commented on pull request #4544: URL: https://github.com/apache/hudi/pull/4544#issuecomment-1008510590 ## CI report: * 8ca9f2823977584fb07efc737ccc175a6e33f115 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request #4544: [HUDI-2735] Allow empty commits in Kafka Connect Sink for Hudi
yihua opened a new pull request #4544: URL: https://github.com/apache/hudi/pull/4544 ## What is the purpose of the pull request This PR makes Kafka Connect Sink for Hudi to write empty commits when there are no new messages from the Kafka topic. This avoids constant rollbacks if the Kafka topic has no new message. Regardless of whether there are new messages or not, the write commit logic, including archival, is always executed, resolving the problem of no archival of rollbacks when there is no new message as well. ## Brief change log - Removes the check of the size of write status list from all participants in `ConnectTransactionCoordinator`. - Adds a new test for empty status list. ## Verify this pull request This change added tests and can be verified as follows: - Run Kafka Connect Sink for Hudi using Quick Start Guide - Publish some messages to the Kafka topic: `bash setupKafka.sh -n 100 -b 6` - Wait for some time so the Sink ingests all messages and writes empty commits - Publish more messages to the topic: `bash setupKafka.sh -n 100 -b 6 -o 600 -t` - Verify the table timeline using hudi-cli: ``` hudi:hudi-test-topic->commits show ╔═══╤═╤═══╤═╤══╤═══╤══╤══╗ ║ CommitTime│ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║ ╠═══╪═╪═══╪═╪══╪═══╪══╪══╣ ║ 20220109184255282 │ 76.1 KB │ 0 │ 20 │ 5│ 300 │ 300 │ 0║ ╟───┼─┼───┼─┼──┼───┼──┼──╢ ║ 20220109184129070 │ 75.7 KB │ 0 │ 20 │ 5│ 300 │ 300 │ 0║ ╟───┼─┼───┼─┼──┼───┼──┼──╢ ║ 20220109183955630 │ 0.0 B │ 0 │ 0 │ 0│ 0 │ 0 │ 0║ ╟───┼─┼───┼─┼──┼───┼──┼──╢ ║ 20220109183755160 │ 0.0 B │ 0 │ 0 │ 0│ 0 │ 0 │ 0║ ╟───┼─┼───┼─┼──┼───┼──┼──╢ ║ 20220109183554995 │ 0.0 B │ 0 │ 0 │ 0│ 0 │ 0 │ 0║ ╟───┼─┼───┼─┼──┼───┼──┼──╢ ║ 20220109183354904 │ 0.0 B │ 0 │ 0 │ 0│ 0 │ 0 │ 0║ ╟───┼─┼───┼─┼──┼───┼──┼──╢ ║ 20220109183225656 │ 75.7 KB │ 0 │ 20 │ 5│ 300 │ 300 │ 0║ ╟───┼─┼───┼─┼──┼───┼──┼──╢ ║ 20220109183055068 │ 71.8 KB │ 0 │ 16 │ 5│ 300 │ 300 │ 0║ ╚═══╧═╧═══╧═╧══╧═══╧══╧══╝ ``` ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] N
[hudi] branch master updated (56f93f4 -> 251d4eb)
This is an automated email from the ASF dual-hosted git repository. codope pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from 56f93f4 Removing rollbacks instants from timeline for restore operation (#4518) add 251d4eb [HUDI-3030] InProcessLockPovider as default when any async servcies enabled with no lock provider override (#4406) No new revisions were added by this update. Summary of changes: .../hudi/client/AbstractHoodieWriteClient.java | 2 +- .../hudi/client/transaction/lock/LockManager.java | 15 ++- .../org/apache/hudi/config/HoodieWriteConfig.java | 34 ++- .../apache/hudi/config/TestHoodieWriteConfig.java | 102 + .../apache/hudi/common/config/HoodieConfig.java| 6 +- 5 files changed, 149 insertions(+), 10 deletions(-)
[GitHub] [hudi] codope merged pull request #4406: [HUDI-3030] InProcessLockPovider as default when any async servcies enabled with no lock provider override
codope merged pull request #4406: URL: https://github.com/apache/hudi/pull/4406 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
hudi-bot removed a comment on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008501296 ## CI report: * dc6e817b518774152944d658e4c239cfcce30c9f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5016) * c3295aa79ecd15281ffc573c86e73a2637f3533f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
hudi-bot commented on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008501965 ## CI report: * dc6e817b518774152944d658e4c239cfcce30c9f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5016) * c3295aa79ecd15281ffc573c86e73a2637f3533f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5041) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
hudi-bot removed a comment on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008006702 ## CI report: * dc6e817b518774152944d658e4c239cfcce30c9f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5016) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4540: [HUDI-3194][WIP] fix MOR snapshot query (HIVE) during compaction
hudi-bot commented on pull request #4540: URL: https://github.com/apache/hudi/pull/4540#issuecomment-1008501296 ## CI report: * dc6e817b518774152944d658e4c239cfcce30c9f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5016) * c3295aa79ecd15281ffc573c86e73a2637f3533f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot commented on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008500469 ## CI report: * 1277b45508e2b713a3c8416a87893b1d059c375a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5037) * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot removed a comment on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008483743 ## CI report: * 1277b45508e2b713a3c8416a87893b1d059c375a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5037) * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-2779) Cache BaseDir if HudiTableNotFound Exception thrown
[ https://issues.apache.org/jira/browse/HUDI-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui An closed HUDI-2779. > Cache BaseDir if HudiTableNotFound Exception thrown > --- > > Key: HUDI-2779 > URL: https://issues.apache.org/jira/browse/HUDI-2779 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0, 0.10.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] manojpec commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
manojpec commented on a change in pull request #4449: URL: https://github.com/apache/hudi/pull/4449#discussion_r780869585 ## File path: hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java ## @@ -62,6 +64,7 @@ // Scanner used to read individual keys. This is cached to prevent the overhead of opening the scanner for each // key retrieval. private HFileScanner keyScanner; + private final String keyField = HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY; Review comment: Unlike HFile writer, readers don't pass in hfile config or any other writer config. Callers make use of the factory static methods to construct the reader. Factory and the reader are the hudi-common package and hence it cannot make use of the hudi-client storage configs where the new hfile properties are available. Factory can pass in the key schema field as an extra arg, but that doesn't cover all cases. There are callers who can instantiate HFileReader from the serialized contents and they are also at the hudi-common package level with no access to new storage configs. In https://github.com/apache/hudi/pull/4447, I made all the HFileReader callers to pass in the key schema field and thats what made the patch to touch a lot of places all around. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] manojpec commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
manojpec commented on a change in pull request #4449: URL: https://github.com/apache/hudi/pull/4449#discussion_r780868566 ## File path: hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java ## @@ -151,15 +154,15 @@ public BloomFilter readBloomFilter() { } public List> readAllRecords(Schema writerSchema, Schema readerSchema) throws IOException { +final Option keySchemaField = Option.ofNullable(readerSchema.getField(keyField)); List> recordList = new LinkedList<>(); try { final HFileScanner scanner = reader.getScanner(false, false); if (scanner.seekTo()) { do { Cell c = scanner.getKeyValue(); - byte[] keyBytes = Arrays.copyOfRange(c.getRowArray(), c.getRowOffset(), c.getRowOffset() + c.getRowLength()); - R record = getRecordFromCell(c, writerSchema, readerSchema); - recordList.add(new Pair<>(new String(keyBytes), record)); + final Pair keyAndRecordPair = getRecordFromCell(c, writerSchema, readerSchema, keySchemaField); + recordList.add(new Pair<>(keyAndRecordPair.getFirst(), keyAndRecordPair.getSecond())); Review comment: fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] manojpec commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
manojpec commented on a change in pull request #4449: URL: https://github.com/apache/hudi/pull/4449#discussion_r780868540 ## File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java ## @@ -507,6 +519,255 @@ public void testMetadataTableWithPendingCompaction(boolean simulateFailedCompact } } + /** + * Test arguments - Table type, populate meta fields, exclude key from payload. + */ + public static List testMetadataRecordKeyExcludeFromPayloadArgs() { +return asList( +Arguments.of(COPY_ON_WRITE, true), +Arguments.of(COPY_ON_WRITE, false), +Arguments.of(MERGE_ON_READ, true), +Arguments.of(MERGE_ON_READ, false) +); + } + + /** Review comment: I initially had the testing at HFile writer and reader level, but it did not cover the compaction use case for the metadata table. The test here is more of everything combined and checks exactly what is needed for the metadata table records. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
hudi-bot commented on pull request #4449: URL: https://github.com/apache/hudi/pull/4449#issuecomment-1008497224 ## CI report: * dc9fe1b878dc47eaed13911fc5ca7eaffb80fb2f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4753) * ce8a8d9547819b23368115ba640caed1cb385213 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5039) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
hudi-bot removed a comment on pull request #4449: URL: https://github.com/apache/hudi/pull/4449#issuecomment-1008496412 ## CI report: * dc9fe1b878dc47eaed13911fc5ca7eaffb80fb2f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4753) * ce8a8d9547819b23368115ba640caed1cb385213 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] manojpec commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
manojpec commented on a change in pull request #4449: URL: https://github.com/apache/hudi/pull/4449#discussion_r780868368 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java ## @@ -122,7 +128,13 @@ public boolean canWrite() { @Override public void writeAvro(String recordKey, IndexedRecord object) throws IOException { -byte[] value = HoodieAvroUtils.avroToBytes((GenericRecord)object); +byte[] value = HoodieAvroUtils.avroToBytes((GenericRecord) object); Review comment: We should not empty/change the passed in record object 'key' field, else the caller will have the in-memory copy of the record object with key missing and affects all users of it. So, i need a copy of the record object, where i can empty the key field and then save to disk. The second de-serialization back to a new record object where i can change the field is needed. If there are any other better ways to doing this, happy to change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
hudi-bot commented on pull request #4449: URL: https://github.com/apache/hudi/pull/4449#issuecomment-1008496412 ## CI report: * dc9fe1b878dc47eaed13911fc5ca7eaffb80fb2f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4753) * ce8a8d9547819b23368115ba640caed1cb385213 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
hudi-bot removed a comment on pull request #4449: URL: https://github.com/apache/hudi/pull/4449#issuecomment-1001797582 ## CI report: * dc9fe1b878dc47eaed13911fc5ca7eaffb80fb2f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4753) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] manojpec commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
manojpec commented on a change in pull request #4449: URL: https://github.com/apache/hudi/pull/4449#discussion_r780867818 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java ## @@ -162,6 +158,20 @@ protected void createRecordsFromContentBytes() throws IOException { return records; } + /** + * Serialize the record to byte buffer. + * + * @param record - Record to serialize + * @param schemaKeyField - Key field in the schema + * @return Serialized byte buffer for the record + */ + private byte[] serializeRecord(final IndexedRecord record, final Option schemaKeyField) { +if (schemaKeyField.isPresent()) { + record.put(schemaKeyField.get().pos(), ""); Review comment: fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] manojpec commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
manojpec commented on a change in pull request #4449: URL: https://github.com/apache/hudi/pull/4449#discussion_r780867793 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java ## @@ -77,6 +81,8 @@ public HoodieHFileWriter(String instantTime, Path file, HoodieHFileConfig hfileC this.file = HoodieWrapperFileSystem.convertToHoodiePath(file, conf); this.fs = (HoodieWrapperFileSystem) this.file.getFileSystem(conf); this.hfileConfig = hfileConfig; +this.schema = schema; +this.schemaRecordKeyField = Option.ofNullable(schema.getField(HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY)); Review comment: Incorporated vinoth's suggestion on using the storage config property and letting the HFileWriter use the config to let the writer know about the key field. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Guanpx commented on issue #4539: [SUPPORT] spark 2.4.0 write data to hudi ERROR (0.10.0)
Guanpx commented on issue #4539: URL: https://github.com/apache/hudi/issues/4539#issuecomment-1008493909 > 2.4.0 is not supported. Can you try with 2.4.3 or higher spark versions. our spark can not upgrade, so, if replace hudi source code SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY TO string "__partition_columns" or delete that code will Impact on other functions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (e9a7f49 -> 56f93f4)
This is an automated email from the ASF dual-hosted git repository. codope pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from e9a7f49 [HUDI-3112] Fix KafkaConnect cannot sync to Hive Problem (#4458) add 56f93f4 Removing rollbacks instants from timeline for restore operation (#4518) No new revisions were added by this update. Summary of changes: .../hudi/table/action/restore/BaseRestoreActionExecutor.java | 10 ++ .../functional/TestHoodieClientOnCopyOnWriteStorage.java | 2 ++ 2 files changed, 12 insertions(+)
[GitHub] [hudi] codope merged pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation
codope merged pull request #4518: URL: https://github.com/apache/hudi/pull/4518 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-3065) spark auto partition discovery does not work from 0.9.0
[ https://issues.apache.org/jira/browse/HUDI-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-3065. Reviewers: Forward Xu, Raymond Xu (was: Raymond Xu) Resolution: Won't Fix > spark auto partition discovery does not work from 0.9.0 > --- > > Key: HUDI-3065 > URL: https://issues.apache.org/jira/browse/HUDI-3065 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: sivabalan narayanan >Assignee: Yann Byron >Priority: Major > Labels: core-flow-ds, sev:critical, spark > Fix For: 0.10.1 > > > with 0.8.0, if partition is of the format "/partitionKey=partitionValue", > Spark auto partition discovery will kick in. we can see explicit fields in > hudi's table schema. > But with 0.9.0, it does not happen. > // launch spark shell with 0.8.0 > {code:scala} > import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val tableName = "hudi_trips_cow" > val basePath = "file:///tmp/hudi_trips_cow" > val dataGen = new DataGenerator > val inserts = convertToStringList(dataGen.generateInserts(10)) > val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", > "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city=")) > newDf.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "ts"). > option(RECORDKEY_FIELD_OPT_KEY, "uuid"). > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). > option(TABLE_NAME, tableName). > mode(Overwrite).save(basePath) > val tripsSnapshotDF = spark. > read. > format("hudi"). > load(basePath) > tripsSnapshotDF.printSchema > {code} > // output : check for continent, country, city in the end. > {code} > |– _hoodie_commit_time: string (nullable = true)| > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- begin_lat: double (nullable = true) > |-- begin_lon: double (nullable = true) > |-- driver: string (nullable = true) > |-- end_lat: double (nullable = true) > |-- end_lon: double (nullable = true) > |-- fare: double (nullable = true) > |-- partitionpath: string (nullable = true) > |-- rider: string (nullable = true) > |-- ts: long (nullable = true) > |-- uuid: string (nullable = true) > |-- continent: string (nullable = true) > |-- country: string (nullable = true) > |-- city: string (nullable = true) > {code} > > Lets run this with 0.9.0. > {code:scala} > import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val tableName = "hudi_trips_cow" > val basePath = "file:///tmp/hudi_trips_cow" > val dataGen = new DataGenerator > val inserts = convertToStringList(dataGen.generateInserts(10)) > val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", > "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city=")) > newDf.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "ts"). > option(RECORDKEY_FIELD_OPT_KEY, "uuid"). > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). > option(TABLE_NAME, tableName). > mode(Overwrite). save(basePath) > val tripsSnapshotDF = spark. > | read. > | format("hudi"). > | load(basePath ) > tripsSnapshotDF.printSchema > {code} > //output: continent, country, city is missing. > {code} > root > |-- _hoodie_commit_time: string (nullable = true) > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- begin_lat: double (nullable = true) > |-- begin_lon: double (nullable = true) > |-- driver: string (nullable = true) > |-- end_lat: double (nullable = true) > |-- end_lon: double (nullable = true) > |-- fare: double (nullable = true) > |-- rider: string (nullable = true) > |-- ts: long (nullable = true) > |-- uuid: string (nullable = true) > |-- partitionpath: string (nullable = true) > {code} > Ref issue: [https://github.com/apache/hudi/issues/3984] > > > > --
[jira] [Commented] (HUDI-3065) spark auto partition discovery does not work from 0.9.0
[ https://issues.apache.org/jira/browse/HUDI-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471637#comment-17471637 ] Raymond Xu commented on HUDI-3065: -- After discussion with [~x1q1j1] [~biyan900...@gmail.com], we think that auto partition discovery behavior should be address separately. In the end state, we should have a keygen or a flag to help user enable partition discovery. Without the keygen or partition discover flag, we respect user's setting and take partition paths as is. i.e., no partition auto discovery. Will close this as won't fix and the next steps are recorded in the linked tickets. cc @ > spark auto partition discovery does not work from 0.9.0 > --- > > Key: HUDI-3065 > URL: https://issues.apache.org/jira/browse/HUDI-3065 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: sivabalan narayanan >Assignee: Yann Byron >Priority: Major > Labels: core-flow-ds, sev:critical, spark > Fix For: 0.10.1 > > > with 0.8.0, if partition is of the format "/partitionKey=partitionValue", > Spark auto partition discovery will kick in. we can see explicit fields in > hudi's table schema. > But with 0.9.0, it does not happen. > // launch spark shell with 0.8.0 > {code:scala} > import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val tableName = "hudi_trips_cow" > val basePath = "file:///tmp/hudi_trips_cow" > val dataGen = new DataGenerator > val inserts = convertToStringList(dataGen.generateInserts(10)) > val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", > "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city=")) > newDf.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "ts"). > option(RECORDKEY_FIELD_OPT_KEY, "uuid"). > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). > option(TABLE_NAME, tableName). > mode(Overwrite).save(basePath) > val tripsSnapshotDF = spark. > read. > format("hudi"). > load(basePath) > tripsSnapshotDF.printSchema > {code} > // output : check for continent, country, city in the end. > {code} > |– _hoodie_commit_time: string (nullable = true)| > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- begin_lat: double (nullable = true) > |-- begin_lon: double (nullable = true) > |-- driver: string (nullable = true) > |-- end_lat: double (nullable = true) > |-- end_lon: double (nullable = true) > |-- fare: double (nullable = true) > |-- partitionpath: string (nullable = true) > |-- rider: string (nullable = true) > |-- ts: long (nullable = true) > |-- uuid: string (nullable = true) > |-- continent: string (nullable = true) > |-- country: string (nullable = true) > |-- city: string (nullable = true) > {code} > > Lets run this with 0.9.0. > {code:scala} > import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val tableName = "hudi_trips_cow" > val basePath = "file:///tmp/hudi_trips_cow" > val dataGen = new DataGenerator > val inserts = convertToStringList(dataGen.generateInserts(10)) > val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", > "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city=")) > newDf.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "ts"). > option(RECORDKEY_FIELD_OPT_KEY, "uuid"). > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). > option(TABLE_NAME, tableName). > mode(Overwrite). save(basePath) > val tripsSnapshotDF = spark. > | read. > | format("hudi"). > | load(basePath ) > tripsSnapshotDF.printSchema > {code} > //output: continent, country, city is missing. > {code} > root > |-- _hoodie_commit_time: string (nullable = true) > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- begin_lat: double (nullable = true) > |-- begin
[jira] [Comment Edited] (HUDI-3065) spark auto partition discovery does not work from 0.9.0
[ https://issues.apache.org/jira/browse/HUDI-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471637#comment-17471637 ] Raymond Xu edited comment on HUDI-3065 at 1/10/22, 2:04 AM: After discussion with [~x1q1j1] [~biyan900...@gmail.com], we think that auto partition discovery behavior should be address separately. In the end state, we should have a keygen or a flag to help user enable partition discovery. Without the keygen or partition discover flag, we respect user's setting and take partition paths as is. i.e., no partition auto discovery. Will close this as won't fix and the next steps are recorded in the linked tickets. cc [~shivnarayan] was (Author: xushiyan): After discussion with [~x1q1j1] [~biyan900...@gmail.com], we think that auto partition discovery behavior should be address separately. In the end state, we should have a keygen or a flag to help user enable partition discovery. Without the keygen or partition discover flag, we respect user's setting and take partition paths as is. i.e., no partition auto discovery. Will close this as won't fix and the next steps are recorded in the linked tickets. cc @ > spark auto partition discovery does not work from 0.9.0 > --- > > Key: HUDI-3065 > URL: https://issues.apache.org/jira/browse/HUDI-3065 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: sivabalan narayanan >Assignee: Yann Byron >Priority: Major > Labels: core-flow-ds, sev:critical, spark > Fix For: 0.10.1 > > > with 0.8.0, if partition is of the format "/partitionKey=partitionValue", > Spark auto partition discovery will kick in. we can see explicit fields in > hudi's table schema. > But with 0.9.0, it does not happen. > // launch spark shell with 0.8.0 > {code:scala} > import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val tableName = "hudi_trips_cow" > val basePath = "file:///tmp/hudi_trips_cow" > val dataGen = new DataGenerator > val inserts = convertToStringList(dataGen.generateInserts(10)) > val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", > "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city=")) > newDf.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "ts"). > option(RECORDKEY_FIELD_OPT_KEY, "uuid"). > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). > option(TABLE_NAME, tableName). > mode(Overwrite).save(basePath) > val tripsSnapshotDF = spark. > read. > format("hudi"). > load(basePath) > tripsSnapshotDF.printSchema > {code} > // output : check for continent, country, city in the end. > {code} > |– _hoodie_commit_time: string (nullable = true)| > |-- _hoodie_commit_seqno: string (nullable = true) > |-- _hoodie_record_key: string (nullable = true) > |-- _hoodie_partition_path: string (nullable = true) > |-- _hoodie_file_name: string (nullable = true) > |-- begin_lat: double (nullable = true) > |-- begin_lon: double (nullable = true) > |-- driver: string (nullable = true) > |-- end_lat: double (nullable = true) > |-- end_lon: double (nullable = true) > |-- fare: double (nullable = true) > |-- partitionpath: string (nullable = true) > |-- rider: string (nullable = true) > |-- ts: long (nullable = true) > |-- uuid: string (nullable = true) > |-- continent: string (nullable = true) > |-- country: string (nullable = true) > |-- city: string (nullable = true) > {code} > > Lets run this with 0.9.0. > {code:scala} > import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val tableName = "hudi_trips_cow" > val basePath = "file:///tmp/hudi_trips_cow" > val dataGen = new DataGenerator > val inserts = convertToStringList(dataGen.generateInserts(10)) > val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", > "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city=")) > newDf.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "ts"). > option(RECORDKEY_FIELD_OPT_KEY, "uuid"). > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). > option(TABLE_NAME, tableName). > mode(Ov
[jira] [Updated] (HUDI-3200) File Index config affects partition fields shown in printSchema results
[ https://issues.apache.org/jira/browse/HUDI-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3200: - Description: Discovered in HUDI-3065, disabling file index config should not affect partition fields shown in printSchema. It looks like since 0.9.0 - file index = true: it enables partition auto discovery - file index = false: it disables partition auto discovery > File Index config affects partition fields shown in printSchema results > --- > > Key: HUDI-3200 > URL: https://issues.apache.org/jira/browse/HUDI-3200 > Project: Apache Hudi > Issue Type: Bug >Reporter: Raymond Xu >Priority: Major > Fix For: 0.11.0 > > > Discovered in HUDI-3065, disabling file index config should not affect > partition fields shown in printSchema. > It looks like since 0.9.0 > - file index = true: it enables partition auto discovery > - file index = false: it disables partition auto discovery -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3202) Add keygen to support partition discovery
[ https://issues.apache.org/jira/browse/HUDI-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3202: - Reviewers: Forward Xu, Raymond Xu, Yann Byron > Add keygen to support partition discovery > - > > Key: HUDI-3202 > URL: https://issues.apache.org/jira/browse/HUDI-3202 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Raymond Xu >Priority: Major > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3201) Make partition auto discovery configurable
[ https://issues.apache.org/jira/browse/HUDI-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3201: - Reviewers: Forward Xu, Raymond Xu, Yann Byron > Make partition auto discovery configurable > -- > > Key: HUDI-3201 > URL: https://issues.apache.org/jira/browse/HUDI-3201 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Raymond Xu >Priority: Major > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3200) File Index config affects partition fields shown in printSchema results
[ https://issues.apache.org/jira/browse/HUDI-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3200: - Reviewers: Forward Xu, Raymond Xu, Yann Byron > File Index config affects partition fields shown in printSchema results > --- > > Key: HUDI-3200 > URL: https://issues.apache.org/jira/browse/HUDI-3200 > Project: Apache Hudi > Issue Type: Bug >Reporter: Raymond Xu >Priority: Major > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3202) Add keygen to support partition discovery
Raymond Xu created HUDI-3202: Summary: Add keygen to support partition discovery Key: HUDI-3202 URL: https://issues.apache.org/jira/browse/HUDI-3202 Project: Apache Hudi Issue Type: Improvement Reporter: Raymond Xu Fix For: 0.11.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3201) Make partition auto discovery configurable
Raymond Xu created HUDI-3201: Summary: Make partition auto discovery configurable Key: HUDI-3201 URL: https://issues.apache.org/jira/browse/HUDI-3201 Project: Apache Hudi Issue Type: Improvement Reporter: Raymond Xu Fix For: 0.11.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3200) File Index config affects partition fields shown in printSchema results
Raymond Xu created HUDI-3200: Summary: File Index config affects partition fields shown in printSchema results Key: HUDI-3200 URL: https://issues.apache.org/jira/browse/HUDI-3200 Project: Apache Hudi Issue Type: Bug Reporter: Raymond Xu Fix For: 0.11.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)