[GitHub] [hudi] xuzifu666 commented on a change in pull request #4245: [MINOR] remove unuse construction method
xuzifu666 commented on a change in pull request #4245: URL: https://github.com/apache/hudi/pull/4245#discussion_r772902282 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java ## @@ -579,10 +579,6 @@ public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter repo protected CombineFileSplit inputSplitShim; private Map pathToPartitionInfo; -public CombineHiveInputSplit() throws IOException { Review comment: no, this would not be called by serialization code for e.g -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4364: [HUDI-3060] drop table for spark sql
hudi-bot commented on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998550027 ## CI report: * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627) * edb0803691023e502011e270ee83b69bc87c1ffb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4630) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4364: [HUDI-3060] drop table for spark sql
hudi-bot removed a comment on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998548499 ## CI report: * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627) * edb0803691023e502011e270ee83b69bc87c1ffb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields
hudi-bot commented on pull request #4308: URL: https://github.com/apache/hudi/pull/4308#issuecomment-998549910 ## CI report: * 7d046f914a059b2623d7f2a7627c44b15ccc0ddb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4628) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields
hudi-bot removed a comment on pull request #4308: URL: https://github.com/apache/hudi/pull/4308#issuecomment-998510982 ## CI report: * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620) * 7d046f914a059b2623d7f2a7627c44b15ccc0ddb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4628) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4364: [HUDI-3060] drop table for spark sql
hudi-bot commented on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998548499 ## CI report: * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627) * edb0803691023e502011e270ee83b69bc87c1ffb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4364: [HUDI-3060] drop table for spark sql
hudi-bot removed a comment on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998529708 ## CI report: * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on pull request #3929: [HUDI-1881] Make multi table delta streamer to use thread pool for table sync asynchronously.
pratyakshsharma commented on pull request #3929: URL: https://github.com/apache/hudi/pull/3929#issuecomment-998543878 @jadireddi Were you able to test it out? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-3085) Refactor fileId & writeHandler logic into partitioner for bulk_insert
Yuwei Xiao created HUDI-3085: Summary: Refactor fileId & writeHandler logic into partitioner for bulk_insert Key: HUDI-3085 URL: https://issues.apache.org/jira/browse/HUDI-3085 Project: Apache Hudi Issue Type: Improvement Reporter: Yuwei Xiao a better partitioner abstraction for bulk_insert -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (HUDI-2998) Claim RFC number for RFC for Consistent Hashing Index
[ https://issues.apache.org/jira/browse/HUDI-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuwei Xiao resolved HUDI-2998. -- > Claim RFC number for RFC for Consistent Hashing Index > - > > Key: HUDI-2998 > URL: https://issues.apache.org/jira/browse/HUDI-2998 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Yuwei Xiao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] xushiyan edited a comment on pull request #4270: [HUDI-2811] Support Spark 3.2
xushiyan edited a comment on pull request #4270: URL: https://github.com/apache/hudi/pull/4270#issuecomment-998529462 @leesf i mainly concern about cherry-picking some spark sql fixes won't work after this lands in master. That's why i suggested this be a feature branch that keeps rebasing on master. Any new feature depends on spark 3.2 support should not be blocked as those shall be merged into this feature branch. cc @YannByron @nsivabalan Alternatively, we can finalize all spark sql related fixes for 0.10.1 in the next few days and land those in master. Then we can land this so we know those fixes can be cherry-picked later. Sounds better? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4364: [HUDI-3060] drop table for spark sql
hudi-bot commented on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998529708 ## CI report: * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4364: [HUDI-3060] drop table for spark sql
hudi-bot removed a comment on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998484771 ## CI report: * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #4270: [HUDI-2811] Support Spark 3.2
xushiyan commented on pull request #4270: URL: https://github.com/apache/hudi/pull/4270#issuecomment-998529462 @leesf i mainly concern about cherry-picking some spark sql fixes won't work after this lands in master. That's why i suggested this be a feature branch that keeps rebasing on master. Any new feature depends on spark 3.2 support should not be blocked as those shall be merged into this feature branch. cc @YannByron @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on pull request #4270: [HUDI-2811] Support Spark 3.2
leesf commented on pull request #4270: URL: https://github.com/apache/hudi/pull/4270#issuecomment-998522239 > @leesf @YannByron shall we keep this open until 0.10.1 is cut? given this won't be included in 0.10.1 and any bug fix PR on spark sql may have major conflicts with this change. I suggest we keep this as a feature branch and keep updating it and merge after 0.10.1. WDYT? @xushiyan Agree that it should goes to 0.11.0, but as the 0.10.1 is not going release in recent days, should we wait for the cut? it would block the development of new features, so i think it should merge into master branch and then just cherry-pick some bug fixes from master into 0.10.1 CC @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4342: [HUDI-735] Fixing error messages on record key not found
hudi-bot commented on pull request #4342: URL: https://github.com/apache/hudi/pull/4342#issuecomment-998521581 ## CI report: * 4ece718901801f81a97ebd4667e72a81e39b18e9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4626) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4342: [HUDI-735] Fixing error messages on record key not found
hudi-bot removed a comment on pull request #4342: URL: https://github.com/apache/hudi/pull/4342#issuecomment-998484733 ## CI report: * 0638721fb418fc2e8d2fff47657617ea1d203b6d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4597) * 4ece718901801f81a97ebd4667e72a81e39b18e9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4626) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve
scxwhite commented on a change in pull request #4400: URL: https://github.com/apache/hudi/pull/4400#discussion_r772869290 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java ## @@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan( .getLatestFileSlices(partitionPath) .filter(slice -> !fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId())) .map(s -> { + // We can think that the latest data is in the latest delta log file, so we sort it from large Review comment: > Have a clarification on the first fix. Could you add some UTs for this? OK, I'll try to add some UTs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve
scxwhite commented on a change in pull request #4400: URL: https://github.com/apache/hudi/pull/4400#discussion_r772868883 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java ## @@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan( .getLatestFileSlices(partitionPath) .filter(slice -> !fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId())) .map(s -> { + // We can think that the latest data is in the latest delta log file, so we sort it from large Review comment: You're right, but in most cases, the new data is often in the latest delta log, so we sort it from large to small according to the instance time. The program will avoid updating the data in the externalspillablemap to save compact time. What do you think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4408: [MINOR] unused method in HoodieColumnProjectionUtils removed
hudi-bot removed a comment on pull request #4408: URL: https://github.com/apache/hudi/pull/4408#issuecomment-998514524 ## CI report: * 60e07eccbc1750c6e7fe4275274d48e7095ff407 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4408: [MINOR] unused method in HoodieColumnProjectionUtils removed
hudi-bot commented on pull request #4408: URL: https://github.com/apache/hudi/pull/4408#issuecomment-998515702 ## CI report: * 60e07eccbc1750c6e7fe4275274d48e7095ff407 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4629) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4408: [MINOR] unused method in HoodieColumnProjectionUtils removed
hudi-bot commented on pull request #4408: URL: https://github.com/apache/hudi/pull/4408#issuecomment-998514524 ## CI report: * 60e07eccbc1750c6e7fe4275274d48e7095ff407 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 opened a new pull request #4408: [MINOR] unused method in HoodieColumnProjectionUtils removed
xuzifu666 opened a new pull request #4408: URL: https://github.com/apache/hudi/pull/4408 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] mtami commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part
mtami commented on issue #3429: URL: https://github.com/apache/hudi/issues/3429#issuecomment-998512708 Hi @nsivabalan It's a timestamp string, i cast it to timestamp. `input_df = input_df.withColumn('updated', f.to_timestamp(f.col('updated')))` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] waywtdcc commented on issue #4305: [SUPPORT] Duplicate Flink write record
waywtdcc commented on issue #4305: URL: https://github.com/apache/hudi/issues/4305#issuecomment-998512387 streaming.It is ok when open set 'index.global.enabled' for 'true' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…
hudi-bot commented on pull request #4336: URL: https://github.com/apache/hudi/pull/4336#issuecomment-998511039 ## CI report: * 8f454b734d8848aee4cb6883999a658a7f007fc2 UNKNOWN * 6d39c38f416f2e0f8249f8bcff2434c07fe929aa UNKNOWN * 6e8aba03f04cc266b5e14f7f991770f1e024 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4367) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4625) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields
hudi-bot commented on pull request #4308: URL: https://github.com/apache/hudi/pull/4308#issuecomment-998510982 ## CI report: * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620) * 7d046f914a059b2623d7f2a7627c44b15ccc0ddb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4628) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…
hudi-bot removed a comment on pull request #4336: URL: https://github.com/apache/hudi/pull/4336#issuecomment-998483702 ## CI report: * 8f454b734d8848aee4cb6883999a658a7f007fc2 UNKNOWN * 6d39c38f416f2e0f8249f8bcff2434c07fe929aa UNKNOWN * 6e8aba03f04cc266b5e14f7f991770f1e024 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4367) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4625) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] harsh1231 commented on a change in pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields
harsh1231 commented on a change in pull request #4308: URL: https://github.com/apache/hudi/pull/4308#discussion_r772862716 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala ## @@ -123,6 +121,25 @@ case class HoodieFileIndex( } } + /** + * This method traverses StructType recursively to build map of columnName -> StructField + * Note : If there is nesting of columns like ["a.b.c.d", "a.b.c.e"] -> final map will have keys corresponding + * only to ["a.b.c.d", "a.b.c.e"] and not for subsets like ["a.b.c", "a.b"] + * @param structField + * @return map of ( columns names -> StructField ) + */ + private def generateNameFieldMap(structField : Either[StructField, StructType]) : Map[String, StructField] = { +structField match { Review comment: Done, thanks for pointing out code style of scala. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields
hudi-bot removed a comment on pull request #4308: URL: https://github.com/apache/hudi/pull/4308#issuecomment-998509790 ## CI report: * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620) * 7d046f914a059b2623d7f2a7627c44b15ccc0ddb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields
hudi-bot commented on pull request #4308: URL: https://github.com/apache/hudi/pull/4308#issuecomment-998509790 ## CI report: * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620) * 7d046f914a059b2623d7f2a7627c44b15ccc0ddb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields
hudi-bot removed a comment on pull request #4308: URL: https://github.com/apache/hudi/pull/4308#issuecomment-998465363 ## CI report: * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 commented on pull request #4245: [MINOR] remove unuse construction method
xuzifu666 commented on pull request #4245: URL: https://github.com/apache/hudi/pull/4245#issuecomment-998509370 @yanghua please have a review, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on pull request #4270: [HUDI-2811] Support Spark 3.2
YannByron commented on pull request #4270: URL: https://github.com/apache/hudi/pull/4270#issuecomment-998503222 > WDYT I agree. this is a feature should in a major version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] RocMarshal commented on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.
RocMarshal commented on pull request #3813: URL: https://github.com/apache/hudi/pull/3813#issuecomment-998502857 @vinothchandar I made some change based on your comments. PTAL. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-3083) Support component data types for flink bulk_insert
[ https://issues.apache.org/jira/browse/HUDI-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dalongliu reassigned HUDI-3083: --- Assignee: dalongliu > Support component data types for flink bulk_insert > -- > > Key: HUDI-3083 > URL: https://issues.apache.org/jira/browse/HUDI-3083 > Project: Apache Hudi > Issue Type: Improvement > Components: Flink Integration >Reporter: Danny Chen >Assignee: dalongliu >Priority: Major > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] prashantwason commented on a change in pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…
prashantwason commented on a change in pull request #4336: URL: https://github.com/apache/hudi/pull/4336#discussion_r772843537 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java ## @@ -706,7 +706,20 @@ protected void compactIfNecessary(AbstractHoodieWriteClient writeClient, String } } - protected void doClean(AbstractHoodieWriteClient writeClient, String instantTime) { + protected void cleanIfNecessary(AbstractHoodieWriteClient writeClient, String instantTime) { +Option lastCompletedCompactionInstant = metadataMetaClient.reloadActiveTimeline() +.getCommitTimeline().filterCompletedInstants().lastInstant(); +if (lastCompletedCompactionInstant.isPresent() +&& metadataMetaClient.getActiveTimeline().filterCompletedInstants() + .findInstantsAfter(lastCompletedCompactionInstant.get().getTimestamp()).countInstants() < 3) { + // do not clean the log files immediately after compaction to give some buffer time for metadata table reader, Review comment: So this problem should also exist in the MOR table data path? Is there any solution there? ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java ## @@ -154,7 +154,7 @@ protected void commit(HoodieData hoodieDataRecords, String partiti metadataMetaClient.reloadActiveTimeline(); Review comment: reloadActiveTimelice called here so not necessary in ccleanIfNeceasry/ ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java ## @@ -706,7 +706,20 @@ protected void compactIfNecessary(AbstractHoodieWriteClient writeClient, String } } - protected void doClean(AbstractHoodieWriteClient writeClient, String instantTime) { + protected void cleanIfNecessary(AbstractHoodieWriteClient writeClient, String instantTime) { +Option lastCompletedCompactionInstant = metadataMetaClient.reloadActiveTimeline() Review comment: is reloadActiveTimeline() neceassary here? ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java ## @@ -706,7 +706,20 @@ protected void compactIfNecessary(AbstractHoodieWriteClient writeClient, String } } - protected void doClean(AbstractHoodieWriteClient writeClient, String instantTime) { + protected void cleanIfNecessary(AbstractHoodieWriteClient writeClient, String instantTime) { +Option lastCompletedCompactionInstant = metadataMetaClient.reloadActiveTimeline() Review comment: Also, can you check if there is already a metadata table function to get the last compaction timestamp? I guess there are other code paths where this is required. So would be a good idea to create a utility function if does not exist. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-3084) Fix the link of flink guide page
Danny Chen created HUDI-3084: Summary: Fix the link of flink guide page Key: HUDI-3084 URL: https://issues.apache.org/jira/browse/HUDI-3084 Project: Apache Hudi Issue Type: Bug Components: Docs Reporter: Danny Chen Fix For: 0.11.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3083) Support component data types for flink bulk_insert
Danny Chen created HUDI-3083: Summary: Support component data types for flink bulk_insert Key: HUDI-3083 URL: https://issues.apache.org/jira/browse/HUDI-3083 Project: Apache Hudi Issue Type: Improvement Components: Flink Integration Reporter: Danny Chen Fix For: 0.11.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462993#comment-17462993 ] Harsha Teja Kanna commented on HUDI-3066: - {*}Note{*}: I ran the recent query from 'master' as I needed a fix of running clustering in parallel from master. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462993#comment-17462993 ] Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 5:29 AM: {*}Note{*}: I ran the recent query using 'master' as I needed a fix of running clustering in parallel from master. was (Author: h7kanna): {*}Note{*}: I ran the recent query from 'master' as I needed a fix of running clustering in parallel from master. > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-000
[GitHub] [hudi] hudi-bot commented on pull request #4342: [HUDI-735] Fixing error messages on record key not found
hudi-bot commented on pull request #4342: URL: https://github.com/apache/hudi/pull/4342#issuecomment-998484733 ## CI report: * 0638721fb418fc2e8d2fff47657617ea1d203b6d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4597) * 4ece718901801f81a97ebd4667e72a81e39b18e9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4626) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4342: [HUDI-735] Fixing error messages on record key not found
hudi-bot removed a comment on pull request #4342: URL: https://github.com/apache/hudi/pull/4342#issuecomment-998483731 ## CI report: * 0638721fb418fc2e8d2fff47657617ea1d203b6d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4597) * 4ece718901801f81a97ebd4667e72a81e39b18e9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4364: [HUDI-3060] drop table for spark sql
hudi-bot commented on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998484771 ## CI report: * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4364: [HUDI-3060] drop table for spark sql
hudi-bot removed a comment on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998466293 ## CI report: * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] XuQianJin-Stars commented on pull request #4364: [HUDI-3060] drop table for spark sql
XuQianJin-Stars commented on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998484586 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4342: [HUDI-735] Fixing error messages on record key not found
hudi-bot commented on pull request #4342: URL: https://github.com/apache/hudi/pull/4342#issuecomment-998483731 ## CI report: * 0638721fb418fc2e8d2fff47657617ea1d203b6d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4597) * 4ece718901801f81a97ebd4667e72a81e39b18e9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4342: [HUDI-735] Fixing error messages on record key not found
hudi-bot removed a comment on pull request #4342: URL: https://github.com/apache/hudi/pull/4342#issuecomment-997972323 ## CI report: * 0638721fb418fc2e8d2fff47657617ea1d203b6d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4597) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…
hudi-bot commented on pull request #4336: URL: https://github.com/apache/hudi/pull/4336#issuecomment-998483702 ## CI report: * 8f454b734d8848aee4cb6883999a658a7f007fc2 UNKNOWN * 6d39c38f416f2e0f8249f8bcff2434c07fe929aa UNKNOWN * 6e8aba03f04cc266b5e14f7f991770f1e024 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4367) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4625) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…
hudi-bot removed a comment on pull request #4336: URL: https://github.com/apache/hudi/pull/4336#issuecomment-995532732 ## CI report: * 8f454b734d8848aee4cb6883999a658a7f007fc2 UNKNOWN * 6d39c38f416f2e0f8249f8bcff2434c07fe929aa UNKNOWN * 6e8aba03f04cc266b5e14f7f991770f1e024 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4367) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…
danny0405 commented on pull request #4336: URL: https://github.com/apache/hudi/pull/4336#issuecomment-998483112 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] harsh1231 commented on a change in pull request #4342: [HUDI-735] Fixing error messages on record key not found
harsh1231 commented on a change in pull request #4342: URL: https://github.com/apache/hudi/pull/4342#discussion_r772836433 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -229,7 +229,11 @@ object HoodieSparkSqlWriter { } sparkContext.getConf.registerAvroSchemas(schema) log.info(s"Registered avro schema : ${schema.toString(true)}") - +val columnSet = df.columns.toSet +keyGenerator.getRecordKeyFieldNames.foreach(fieldName => if(!columnSet.contains(fieldName)) { + throw new Exception(s"record key '$fieldName' does not exist in existing table schema " + Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on pull request #2768: [HUDI-485]: corrected the check for incremental sql
pratyakshsharma commented on pull request #2768: URL: https://github.com/apache/hudi/pull/2768#issuecomment-998481547 Ack. Let me close this in a day or two. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: (was: Screen Shot 2021-12-20 at 10.17.44 PM-1.png) > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, > stderr_part2.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: (was: stderr_part2-1.txt) > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for l
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: (was: timeline-1.txt) > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfi
[GitHub] [hudi] hudi-bot commented on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.
hudi-bot commented on pull request #3813: URL: https://github.com/apache/hudi/pull/3813#issuecomment-998478837 ## CI report: * f05b7dc67c6fb1d6b9bf75fac0c37b42925bfa23 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4621) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.
hudi-bot removed a comment on pull request #3813: URL: https://github.com/apache/hudi/pull/3813#issuecomment-998444978 ## CI report: * fb97a5759a60ffa76ae776ed1c53f9c33f8eb81b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3628) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3622) * f05b7dc67c6fb1d6b9bf75fac0c37b42925bfa23 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4621) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-2970) Archival fails with Delete_partition commits
[ https://issues.apache.org/jira/browse/HUDI-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-2970. > Archival fails with Delete_partition commits > > > Key: HUDI-2970 > URL: https://issues.apache.org/jira/browse/HUDI-2970 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core >Reporter: sivabalan narayanan >Assignee: Raymond Xu >Priority: Blocker > Labels: pull-request-available, sev:critical > Fix For: 0.11.0, 0.10.1 > > > We need to fix the archival in data table which has delete partition > operations. archival does not sit well with replace commit files created for > "delete partition" operation. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[hudi] branch master updated: [HUDI-2970] Add test for archiving replace commit (#4345)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 32a44bb [HUDI-2970] Add test for archiving replace commit (#4345) 32a44bb is described below commit 32a44bbe062c997b5a41266290fbe34d6323bfa6 Author: Raymond Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Mon Dec 20 21:01:59 2021 -0800 [HUDI-2970] Add test for archiving replace commit (#4345) --- ...dieSparkCopyOnWriteTableArchiveWithReplace.java | 103 + .../TestHoodieSparkMergeOnReadTableClustering.java | 12 +-- ...HoodieSparkMergeOnReadTableIncrementalRead.java | 6 +- ...dieSparkMergeOnReadTableInsertUpdateDelete.java | 4 +- .../SparkClientFunctionalTestHarness.java | 4 +- .../common/testutils/HoodieTestDataGenerator.java | 3 +- 6 files changed, 118 insertions(+), 14 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkCopyOnWriteTableArchiveWithReplace.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkCopyOnWriteTableArchiveWithReplace.java new file mode 100644 index 000..1c66023 --- /dev/null +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkCopyOnWriteTableArchiveWithReplace.java @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.table.functional; + +import org.apache.hudi.client.SparkRDDWriteClient; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.testutils.HoodieTestDataGenerator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieCompactionConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.testutils.SparkClientFunctionalTestHarness; + +import org.junit.jupiter.api.Tag; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.ValueSource; + +import java.io.IOException; +import java.util.Arrays; + +import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH; +import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS; +import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH; +import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_THIRD_PARTITION_PATH; +import static org.apache.hudi.testutils.HoodieClientTestUtils.countRecordsOptionallySince; +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +@Tag("functional") +public class TestHoodieSparkCopyOnWriteTableArchiveWithReplace extends SparkClientFunctionalTestHarness { + + @ParameterizedTest + @ValueSource(booleans = {false, true}) + public void testDeletePartitionAndArchive(boolean metadataEnabled) throws IOException { +HoodieTableMetaClient metaClient = getHoodieMetaClient(HoodieTableType.COPY_ON_WRITE); +HoodieWriteConfig writeConfig = getConfigBuilder(true) + .withCompactionConfig(HoodieCompactionConfig.newBuilder().archiveCommitsWith(2, 3).retainCommits(1).build()) + .withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(metadataEnabled).build()) +.build(); +try (SparkRDDWriteClient client = getHoodieWriteClient(writeConfig); + HoodieTestDataGenerator dataGen = new HoodieTestDataGenerator(DEFAULT_PARTITION_PATHS)) { + + // 1st write batch; 3 commits for 3 partitions + String instantTime1 = HoodieActiveTimeline.createNewInstantTime(1000); + client.startCommitWithTime(instantTime1); + client.insert(jsc().parallelize(dataGen.generateInsertsForPartition(instantTime1, 10, DEFAULT_FIRST_PARTITION_PA
[GitHub] [hudi] nsivabalan merged pull request #4345: [HUDI-2970] Add test for archiving replace commit
nsivabalan merged pull request #4345: URL: https://github.com/apache/hudi/pull/4345 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Govindassamy updated HUDI-3066: - Status: In Progress (was: Open) > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Movin
[jira] [Commented] (HUDI-2834) Validate against supported hive versions
[ https://issues.apache.org/jira/browse/HUDI-2834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462988#comment-17462988 ] Raymond Xu commented on HUDI-2834: -- [~codope] [~shivnarayan] This was deprioritized from 0.10.0 release. How do you think we should handle hive versions in 0.11.0? > Validate against supported hive versions > > > Key: HUDI-2834 > URL: https://issues.apache.org/jira/browse/HUDI-2834 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Critical > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] xushiyan opened a new pull request #3744: [HUDI-2108] Fix flakiness in TestHoodieBackedMetadata
xushiyan opened a new pull request #3744: URL: https://github.com/apache/hudi/pull/3744 ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan closed pull request #3744: [HUDI-2108] Fix flakiness in TestHoodieBackedMetadata
xushiyan closed pull request #3744: URL: https://github.com/apache/hudi/pull/3744 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan closed pull request #4138: [HUDI-2781] Set spark3 in azure pipelines
xushiyan closed pull request #4138: URL: https://github.com/apache/hudi/pull/4138 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-3066: - Priority: Blocker (was: Major) > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-3066: - Fix Version/s: 0.11.0 > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > read
[GitHub] [hudi] dongkelun commented on pull request #4016: [HUDI-2675] Fix the exception 'Not an Avro data file' when archive and clean
dongkelun commented on pull request #4016: URL: https://github.com/apache/hudi/pull/4016#issuecomment-998466970 > sure thanks @dongkelun . Looks like there is a write conflict. Can you rebase with latest master. OK, submit it with the test case later -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4364: [HUDI-3060] drop table for spark sql
hudi-bot commented on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998466293 ## CI report: * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4364: [HUDI-3060] drop table for spark sql
hudi-bot removed a comment on pull request #4364: URL: https://github.com/apache/hudi/pull/4364#issuecomment-998443381 ## CI report: * b7eb121dc6ec52b4ca0e55e6db862a4c7e948004 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4616) * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1185) KeyGenerator class/interfaces need to be decoupled from Spark
[ https://issues.apache.org/jira/browse/HUDI-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-1185: -- Status: Open (was: In Progress) > KeyGenerator class/interfaces need to be decoupled from Spark > - > > Key: HUDI-1185 > URL: https://issues.apache.org/jira/browse/HUDI-1185 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > > https://github.com/apache/hudi/pull/1834#discussion_r466386893 has the context -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986 ] Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 4:37 AM: Complete log files for Slow run (Metadata reader on) [^stderr_part1.txt] [^stderr_part2.txt] was (Author: h7kanna): Complete log files [^stderr_part1.txt] [^stderr_part2.txt] > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.lo
[jira] [Updated] (HUDI-2235) [UMBRELLA] Add virtual key support to Hudi
[ https://issues.apache.org/jira/browse/HUDI-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-2235: -- Status: Open (was: In Progress) > [UMBRELLA] Add virtual key support to Hudi > -- > > Key: HUDI-2235 > URL: https://issues.apache.org/jira/browse/HUDI-2235 > Project: Apache Hudi > Issue Type: New Feature > Components: Writer Core >Reporter: sivabalan narayanan >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-umbrellas > Fix For: 0.11.0 > > > Add virtual key support to Hudi > > meta fields should not be persisted and existing columns should be leveraged. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986 ] Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 4:36 AM: Complete log files [^stderr_part1.txt] [^stderr_part2.txt] was (Author: h7kanna): Complete log files [^stderr_part1.txt] [^stderr_part1.txt][^stderr_part2.txt] > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-61
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: stderr_part2-1.txt > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile >
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: stderr_part1.txt stderr_part2.txt > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, > stderr_part2.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to th
[jira] [Updated] (HUDI-3035) Unify Parquet writers
[ https://issues.apache.org/jira/browse/HUDI-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3035: -- Priority: Critical (was: Blocker) > Unify Parquet writers > - > > Key: HUDI-3035 > URL: https://issues.apache.org/jira/browse/HUDI-3035 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.11.0 > > > Currently we have at least 3 implementations of the ParquetWriters (which is > 3x more than we actually need): > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieParquetWriter.java] > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java] > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowParquetWriter.java] > > Implementations (while identical in principle) have diverged, essentially > living their own lifecycle. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986 ] Harsha Teja Kanna commented on HUDI-3066: - Complete log files [^stderr_part1.txt] [^stderr_part1.txt][^stderr_part2.txt] > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatRead
[GitHub] [hudi] hudi-bot removed a comment on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields
hudi-bot removed a comment on pull request #4308: URL: https://github.com/apache/hudi/pull/4308#issuecomment-998443321 ## CI report: * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields
hudi-bot commented on pull request #4308: URL: https://github.com/apache/hudi/pull/4308#issuecomment-998465363 ## CI report: * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2989) Hive sync to Glue tables not updating S3 location
[ https://issues.apache.org/jira/browse/HUDI-2989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2989: - Status: In Progress (was: Open) > Hive sync to Glue tables not updating S3 location > - > > Key: HUDI-2989 > URL: https://issues.apache.org/jira/browse/HUDI-2989 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.11.0, 0.10.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive
[ https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3082: -- Status: In Progress (was: Open) > [Phase 1] Unify MOR table access across Spark, Hive > --- > > Key: HUDI-3082 > URL: https://issues.apache.org/jira/browse/HUDI-3082 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > > This is Phase 1 of what outlined in HUDI-3081 > > The goal is > * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, > {{{}RealtimeUnmergedRecordReader{}}}) > ** _These Readers should only differ in the way they handle the payload, > everything else should remain constant_ > * Abstract w/in common component (name TBD) > ** Listing current file-slices at the requested instant (handling the > timeline) > ** Creating Record Iterator for the provided file-slice -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive
[ https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin reassigned HUDI-3082: - Assignee: Alexey Kudinkin > [Phase 1] Unify MOR table access across Spark, Hive > --- > > Key: HUDI-3082 > URL: https://issues.apache.org/jira/browse/HUDI-3082 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > > This is Phase 1 of what outlined in HUDI-3081 > > The goal is > * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, > {{{}RealtimeUnmergedRecordReader{}}}) > ** _These Readers should only differ in the way they handle the payload, > everything else should remain constant_ > * Abstract w/in common component (name TBD) > ** Listing current file-slices at the requested instant (handling the > timeline) > ** Creating Record Iterator for the provided file-slice -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive
[ https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3082: -- Priority: Blocker (was: Major) > [Phase 1] Unify MOR table access across Spark, Hive > --- > > Key: HUDI-3082 > URL: https://issues.apache.org/jira/browse/HUDI-3082 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > > This is Phase 1 of what outlined in HUDI-3081 > > The goal is > * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, > {{{}RealtimeUnmergedRecordReader{}}}) > ** _These Readers should only differ in the way they handle the payload, > everything else should remain constant_ > * Abstract w/in common component (name TBD) > ** Listing current file-slices at the requested instant (handling the > timeline) > ** Creating Record Iterator for the provided file-slice -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive
Alexey Kudinkin created HUDI-3082: - Summary: [Phase 1] Unify MOR table access across Spark, Hive Key: HUDI-3082 URL: https://issues.apache.org/jira/browse/HUDI-3082 Project: Apache Hudi Issue Type: Bug Reporter: Alexey Kudinkin This is Phase 1 of what outlined in HUDI-3081 The goal is * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, {{{}RealtimeUnmergedRecordReader{}}}) ** _These Readers should only differ in the way they handle the payload, everything else should remain constant_ * Abstract w/in common component (name TBD) ** Listing current file-slices at the requested instant (handling the timeline) ** Creating Record Iterator for the provided file-slice -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive
[ https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3082: -- Issue Type: Improvement (was: Bug) > [Phase 1] Unify MOR table access across Spark, Hive > --- > > Key: HUDI-3082 > URL: https://issues.apache.org/jira/browse/HUDI-3082 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > > This is Phase 1 of what outlined in HUDI-3081 > > The goal is > * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, > {{{}RealtimeUnmergedRecordReader{}}}) > ** _These Readers should only differ in the way they handle the payload, > everything else should remain constant_ > * Abstract w/in common component (name TBD) > ** Listing current file-slices at the requested instant (handling the > timeline) > ** Creating Record Iterator for the provided file-slice -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive
[ https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3082: -- Fix Version/s: 0.11.0 > [Phase 1] Unify MOR table access across Spark, Hive > --- > > Key: HUDI-3082 > URL: https://issues.apache.org/jira/browse/HUDI-3082 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > > This is Phase 1 of what outlined in HUDI-3081 > > The goal is > * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, > {{{}RealtimeUnmergedRecordReader{}}}) > ** _These Readers should only differ in the way they handle the payload, > everything else should remain constant_ > * Abstract w/in common component (name TBD) > ** Listing current file-slices at the requested instant (handling the > timeline) > ** Creating Record Iterator for the provided file-slice -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] xushiyan closed pull request #4113: [HUDI-2735] Fix clean rollback archiving logic
xushiyan closed pull request #4113: URL: https://github.com/apache/hudi/pull/4113 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3081) [UMBRELLA] Revisiting Read Path Infra across Query Engines
[ https://issues.apache.org/jira/browse/HUDI-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3081: -- Status: In Progress (was: Open) > [UMBRELLA] Revisiting Read Path Infra across Query Engines > -- > > Key: HUDI-3081 > URL: https://issues.apache.org/jira/browse/HUDI-3081 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > > Currently, our Read-path infrastructure is mostly disparate for each > individual Query Engine having the same flow replicated multiple times: > * Hive leverages hierarchy based off `InputFormat` class > * Spark leverages hierarchy based off `SnapshotRelation` > This leads to substantial duplication of virtually the same flows being > replicated multiple times and unfortunately now diverging due to out of sync > lifecycle (bug-fixes, etc). > h3. Proposal > > *Phase 1: Abstracting Common Functionality* > > {_}T-shirt{_}: 1-1.5 weeks > {_}Goal{_}: Abstract following common items to avoid duplication of the > complex sequences across Engines > * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, > {{{}RealtimeUnmergedRecordReader{}}}) > ** _These Readers should only differ in the way they handle the payload, > everything else should remain constant_ > * Abstract w/in common component (name TBD) > ** Listing current file-slices at the requested instant (handling the > timeline) > ** Creating Record Iterator for the provided file-slice > > *Phase 2: Revisiting Record Handling* > > {_}T-shirt{_}: 1-1.5 weeks > {_}Goal{_}: Avoid tight coupling with particular record representation on the > Read Path (currently Avro) and enable > * Common record handling API for combining records (Merge API) > * Avoiding unnecessary serde by abstracting away standardized Record access > routines (getting key, merging, etc) > ** Behind the interface we'd rely on engine-specific representation to carry > the payload (`InternalRow` for Spark, `ArrayWritable` for Hive, etc) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3081) [UMBRELLA] Revisiting Read Path Infra across Query Engines
[ https://issues.apache.org/jira/browse/HUDI-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin reassigned HUDI-3081: - Assignee: Alexey Kudinkin > [UMBRELLA] Revisiting Read Path Infra across Query Engines > -- > > Key: HUDI-3081 > URL: https://issues.apache.org/jira/browse/HUDI-3081 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > > Currently, our Read-path infrastructure is mostly disparate for each > individual Query Engine having the same flow replicated multiple times: > * Hive leverages hierarchy based off `InputFormat` class > * Spark leverages hierarchy based off `SnapshotRelation` > This leads to substantial duplication of virtually the same flows being > replicated multiple times and unfortunately now diverging due to out of sync > lifecycle (bug-fixes, etc). > h3. Proposal > > *Phase 1: Abstracting Common Functionality* > > {_}T-shirt{_}: 1-1.5 weeks > {_}Goal{_}: Abstract following common items to avoid duplication of the > complex sequences across Engines > * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, > {{{}RealtimeUnmergedRecordReader{}}}) > ** _These Readers should only differ in the way they handle the payload, > everything else should remain constant_ > * Abstract w/in common component (name TBD) > ** Listing current file-slices at the requested instant (handling the > timeline) > ** Creating Record Iterator for the provided file-slice > > *Phase 2: Revisiting Record Handling* > > {_}T-shirt{_}: 1-1.5 weeks > {_}Goal{_}: Avoid tight coupling with particular record representation on the > Read Path (currently Avro) and enable > * Common record handling API for combining records (Merge API) > * Avoiding unnecessary serde by abstracting away standardized Record access > routines (getting key, merging, etc) > ** Behind the interface we'd rely on engine-specific representation to carry > the payload (`InternalRow` for Spark, `ArrayWritable` for Hive, etc) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3070) Improve Test
[ https://issues.apache.org/jira/browse/HUDI-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3070: - Component/s: Testing > Improve Test > > > Key: HUDI-3070 > URL: https://issues.apache.org/jira/browse/HUDI-3070 > Project: Apache Hudi > Issue Type: Improvement > Components: Testing >Reporter: Yue Zhang >Assignee: Yue Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0, 0.10.1 > > > Improve Robustness, robustness, stability. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3081) [UMBRELLA] Revisiting Read Path Infra across Query Engines
[ https://issues.apache.org/jira/browse/HUDI-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3081: -- Priority: Blocker (was: Major) > [UMBRELLA] Revisiting Read Path Infra across Query Engines > -- > > Key: HUDI-3081 > URL: https://issues.apache.org/jira/browse/HUDI-3081 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > > Currently, our Read-path infrastructure is mostly disparate for each > individual Query Engine having the same flow replicated multiple times: > * Hive leverages hierarchy based off `InputFormat` class > * Spark leverages hierarchy based off `SnapshotRelation` > This leads to substantial duplication of virtually the same flows being > replicated multiple times and unfortunately now diverging due to out of sync > lifecycle (bug-fixes, etc). > h3. Proposal > > *Phase 1: Abstracting Common Functionality* > > {_}T-shirt{_}: 1-1.5 weeks > {_}Goal{_}: Abstract following common items to avoid duplication of the > complex sequences across Engines > * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, > {{{}RealtimeUnmergedRecordReader{}}}) > ** _These Readers should only differ in the way they handle the payload, > everything else should remain constant_ > * Abstract w/in common component (name TBD) > ** Listing current file-slices at the requested instant (handling the > timeline) > ** Creating Record Iterator for the provided file-slice > > *Phase 2: Revisiting Record Handling* > > {_}T-shirt{_}: 1-1.5 weeks > {_}Goal{_}: Avoid tight coupling with particular record representation on the > Read Path (currently Avro) and enable > * Common record handling API for combining records (Merge API) > * Avoiding unnecessary serde by abstracting away standardized Record access > routines (getting key, merging, etc) > ** Behind the interface we'd rely on engine-specific representation to carry > the payload (`InternalRow` for Spark, `ArrayWritable` for Hive, etc) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3081) [UMBRELLA] Revisiting Read Path Infra across Query Engines
Alexey Kudinkin created HUDI-3081: - Summary: [UMBRELLA] Revisiting Read Path Infra across Query Engines Key: HUDI-3081 URL: https://issues.apache.org/jira/browse/HUDI-3081 Project: Apache Hudi Issue Type: Bug Reporter: Alexey Kudinkin Currently, our Read-path infrastructure is mostly disparate for each individual Query Engine having the same flow replicated multiple times: * Hive leverages hierarchy based off `InputFormat` class * Spark leverages hierarchy based off `SnapshotRelation` This leads to substantial duplication of virtually the same flows being replicated multiple times and unfortunately now diverging due to out of sync lifecycle (bug-fixes, etc). h3. Proposal *Phase 1: Abstracting Common Functionality* {_}T-shirt{_}: 1-1.5 weeks {_}Goal{_}: Abstract following common items to avoid duplication of the complex sequences across Engines * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, {{{}RealtimeUnmergedRecordReader{}}}) ** _These Readers should only differ in the way they handle the payload, everything else should remain constant_ * Abstract w/in common component (name TBD) ** Listing current file-slices at the requested instant (handling the timeline) ** Creating Record Iterator for the provided file-slice *Phase 2: Revisiting Record Handling* {_}T-shirt{_}: 1-1.5 weeks {_}Goal{_}: Avoid tight coupling with particular record representation on the Read Path (currently Avro) and enable * Common record handling API for combining records (Merge API) * Avoiding unnecessary serde by abstracting away standardized Record access routines (getting key, merging, etc) ** Behind the interface we'd rely on engine-specific representation to carry the payload (`InternalRow` for Spark, `ArrayWritable` for Hive, etc) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3070) Improve Test
[ https://issues.apache.org/jira/browse/HUDI-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3070: - Fix Version/s: 0.11.0 0.10.1 > Improve Test > > > Key: HUDI-3070 > URL: https://issues.apache.org/jira/browse/HUDI-3070 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Yue Zhang >Assignee: Yue Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0, 0.10.1 > > > Improve Robustness, robustness, stability. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3070) Improve Test
[ https://issues.apache.org/jira/browse/HUDI-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-3070. Assignee: Yue Zhang Resolution: Done > Improve Test > > > Key: HUDI-3070 > URL: https://issues.apache.org/jira/browse/HUDI-3070 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Yue Zhang >Assignee: Yue Zhang >Priority: Major > Labels: pull-request-available > > Improve Robustness, robustness, stability. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462981#comment-17462981 ] Harsha Teja Kanna commented on HUDI-3066: - Metadata on reader side disabled !Screen Shot 2021-12-20 at 10.17.44 PM.png! > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: Screen Shot 2021-12-20 at 10.17.44 PM-1.png > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, > Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, > metadata_timeline_archived.txt, timeline-1.txt, timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://dat
[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsha Teja Kanna updated HUDI-3066: Attachment: Screen Shot 2021-12-20 at 10.17.44 PM.png > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: Manoj Govindassamy >Priority: Major > Labels: performance, pull-request-available > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > metadata_timeline.txt, metadata_timeline_archived.txt, timeline-1.txt, > timeline.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590', > fileLen=0} > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613 > at instant 20211216183448389 > 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.fil
[jira] [Updated] (HUDI-735) Improve deltastreamer error message when case mismatch of commandline arguments.
[ https://issues.apache.org/jira/browse/HUDI-735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harshal Patil updated HUDI-735: --- Status: Patch Available (was: In Progress) > Improve deltastreamer error message when case mismatch of commandline > arguments. > > > Key: HUDI-735 > URL: https://issues.apache.org/jira/browse/HUDI-735 > Project: Apache Hudi > Issue Type: Improvement > Components: Code Cleanup, DeltaStreamer, Usability >Reporter: Vinoth Chandar >Assignee: Harshal Patil >Priority: Major > Labels: core-flow-ds, pull-request-available, sev:normal, > user-support-issues > > Team, > When following the blog "Change Capture Using AWS Database Migration > Service and Hudi" with my own data set, the initial load works perfectly. > When issuing the command with the DMS CDC files on S3, I get the following > error: > {code} > 20/03/24 17:56:28 ERROR HoodieDeltaStreamer: Got error running delta sync > once. Shutting down > org.apache.hudi.exception.HoodieException: Please provide a valid schema > provider class! at > org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226) > {code} > I tried using the --schemaprovider-class > org.apache.hudi.utilities.schema.FilebasedSchemaProvider.Source and provide > the schema. The error does not occur but there are no write to Hudi. > I am not performing any transformations (other than the DMS transform) and > using default record key strategy. > If the team has any pointers, please let me know. > Thank you! > --- > Thank you Vinoth. I was able to find the issue. All my column names were in > high caps case. I switched column names and table names to lower case and > it works perfectly. -- This message was sent by Atlassian Jira (v8.20.1#820001)