[GitHub] [hudi] hudi-bot commented on pull request #7241: [HUDI-5241] Optimize HoodieDefaultTimeline API
hudi-bot commented on PR #7241: URL: https://github.com/apache/hudi/pull/7241#issuecomment-1319652082 ## CI report: * 3045f14ac99e049be4b40d14906b8aef0f3ed34d UNKNOWN * e9344436bb6ece2731b3a97ce13d3764686609ed UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7220: [HUDI-5230] Lazy init secondaryView in PriorityBasedFileSystemView
hudi-bot commented on PR #7220: URL: https://github.com/apache/hudi/pull/7220#issuecomment-1319651958 ## CI report: * b5bb91f69dffcbb35b2ae69925e8aa2354832925 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13053) * 163886f2adf52086b859d0a6fb7c4cfe34d8aec2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7241: [HUDI-5241] Optimize HoodieDefaultTimeline API
hudi-bot commented on PR #7241: URL: https://github.com/apache/hudi/pull/7241#issuecomment-1319648240 ## CI report: * 3045f14ac99e049be4b40d14906b8aef0f3ed34d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7240: [HUDI-5239] support HoodieJavaWriteClient compact
hudi-bot commented on PR #7240: URL: https://github.com/apache/hudi/pull/7240#issuecomment-1319648171 ## CI report: * 94ecac2f8d8bd18d080a7d6b03fa498f812705f7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13104) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query
hudi-bot commented on PR #7138: URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319647857 ## CI report: * 1c66c4283d9daf64548806289c4ccb0467976d21 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13097) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #7212: [HUDI-5179] Optimized release guide document
nsivabalan commented on code in PR #7212: URL: https://github.com/apache/hudi/pull/7212#discussion_r1026100732 ## release/release_guide.md: ## @@ -0,0 +1,678 @@ + + +# Introduction + +This release process document is based on [Apache Beam Release Guide](https://beam.apache.org/contribute/release-guide/) +and [Apache Flink Release Guide](https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release). + +The Apache Hudi project periodically declares and publishes releases. A release is one or more packages of the project +artifact(s) that are approved for general public distribution and use. They may come with various degrees of caveat +regarding their perceived quality and potential for change, such as “alpha”, “beta”, “stable”, etc. + +Hudi community treats releases with great importance. They are a public face of the project and most users interact with +the project only through the releases. Releases are signed off by the entire Hudi community in a public vote. + +Each release is executed by a Release Manager, who is selected among the Hudi PMC members. This document describes the +process that the Release Manager follows to perform a release. Any changes to this process should be discussed and +adopted on the dev@ mailing list. + +Please remember that publishing software has legal consequences. This guide complements the +foundation-wide [Product Release Policy](http://www.apache.org/dev/release.html) +and [Release Distribution Policy](http://www.apache.org/dev/release-distribution). + +# Overview + +![](release_guide_overview.jpg) + +The release process consists of several steps: + +1. Decide to release +2. Prepare for the release +3. Build a release candidate +4. Vote on the release candidate +5. During vote process, run validation tests +6. If necessary, fix any issues and go back to step 3. +7. Finalize the release +8. Promote the release + +# Decide to release + +Deciding to release and selecting a Release Manager is the first step of the release process. This is a consensus-based +decision of the entire community. + +Anybody can propose a release on the dev@ mailing list, giving a solid argument and nominating a committer as the +Release Manager (including themselves). There’s no formal process, no vote requirements, and no timing requirements. Any +objections should be resolved by consensus before starting the release. + +In general, the community prefers to have a rotating set of 3-5 Release Managers. Keeping a small core set of managers +allows enough people to build expertise in this area and improve processes over time, without Release Managers needing +to re-learn the processes for each release. That said, if you are a committer interested in serving the community in +this way, please reach out to the community on the dev@ mailing list. + +## Checklist to proceed to the next step + +1. Community agrees to release +2. Community selects a Release Manager + +# Prepare for the release + +As a release manager, you should create a private Slack channel, named `hudi-_release_work` (e.g. +hudi-0_12_0_release_work) in Apache Hudi Slack for coordination. Invite all committers to the channel. + +Before your first release, you should perform one-time configuration steps. This will set up your security keys for +signing the release and access to various release repositories. + +To prepare for each release, you should audit the project status in the JIRA issue tracker, and do the necessary +bookkeeping. Finally, you should create a release branch from which individual release candidates will be built. + +**NOTE**: If you are +using [GitHub two-factor authentication](https://help.github.com/articles/securing-your-account-with-two-factor-authentication-2fa/) +and haven’t configure HTTPS access, please +follow [the guide](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/) to configure +command line access. + +## One-time Setup Instructions + +You need to have a GPG key to sign the release artifacts. Please be aware of the +ASF-wide [release signing guidelines](https://www.apache.org/dev/release-signing.html). If you don’t have a GPG key +associated with your Apache account, please follow the section below. + +### For Linux users + +There are 2 ways to configure your GPG key for release, either using release automation script(which is recommended), or +running all commands manually. If using Mac, please see below to handle known issues. + + Use preparation_before_release.sh to setup GPG + +- Script: preparation_before_release.sh +- Usage ./hudi/scripts/release/preparation_before_release.sh +- Tasks included +1. Help you create a new GPG key if you want. +2. Configure git user.signingkey with chosen pubkey. +3. Add chosen pubkey into dev KEYS and release KEYS **NOTES**: Only PMC can write into release repo. +4. Start GPG agents. + + Run all commands manually + +- Get more entropy for c
[jira] [Updated] (HUDI-5241) Optimize HoodieDefaultTimeline API
[ https://issues.apache.org/jira/browse/HUDI-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-5241: - Labels: pull-request-available (was: ) > Optimize HoodieDefaultTimeline API > -- > > Key: HUDI-5241 > URL: https://issues.apache.org/jira/browse/HUDI-5241 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Yann Byron >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] YannByron opened a new pull request, #7241: [HUDI-5241] Optimize HoodieDefaultTimeline API
YannByron opened a new pull request, #7241: URL: https://github.com/apache/hudi/pull/7241 ### Change Logs - rename the origin `getInstants` to `getInstantsAsStream`. - add a new `getInstants` that return a list. - make sure that only use `getInstants` interface when using `this.instants`. ### Impact LOW ### Risk level (write none, low medium or high below) LOW ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5239) support hoddiejavawriteclient compact
[ https://issues.apache.org/jira/browse/HUDI-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-5239: - Labels: pull-request-available (was: ) > support hoddiejavawriteclient compact > - > > Key: HUDI-5239 > URL: https://issues.apache.org/jira/browse/HUDI-5239 > Project: Apache Hudi > Issue Type: Improvement >Reporter: zhaoyangming >Priority: Major > Labels: pull-request-available > > support hoddiejavawriteclient compact -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7240: [HUDI-5239] support HoodieJavaWriteClient compact
hudi-bot commented on PR #7240: URL: https://github.com/apache/hudi/pull/7240#issuecomment-1319643797 ## CI report: * 94ecac2f8d8bd18d080a7d6b03fa498f812705f7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7239: [HUDI-4442] add in field sanitization and use of aliases
hudi-bot commented on PR #7239: URL: https://github.com/apache/hudi/pull/7239#issuecomment-1319643757 ## CI report: * 43b69a7a0fa1a6ca57f651d61bfca3113ffcf47d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13103) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-5241) Optimize HoodieDefaultTimeline API
Yann Byron created HUDI-5241: Summary: Optimize HoodieDefaultTimeline API Key: HUDI-5241 URL: https://issues.apache.org/jira/browse/HUDI-5241 Project: Apache Hudi Issue Type: Improvement Components: core Reporter: Yann Byron -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4442) Converting from json to avro does not sanitize field names
[ https://issues.apache.org/jira/browse/HUDI-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4442: - Labels: pull-request-available (was: ) > Converting from json to avro does not sanitize field names > -- > > Key: HUDI-4442 > URL: https://issues.apache.org/jira/browse/HUDI-4442 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Minor > Labels: pull-request-available > Fix For: 0.13.0 > > > There are cases where a source of json data will have `$` and other illegal > characters in the field name. If the user provides a valid schema with those > chars sanitized in the field name, the MercifulJsonConverter should be able > to translate the json into those sanitized field names. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7239: [HUDI-4442] add in field sanitization and use of aliases
hudi-bot commented on PR #7239: URL: https://github.com/apache/hudi/pull/7239#issuecomment-1319639798 ## CI report: * 43b69a7a0fa1a6ca57f651d61bfca3113ffcf47d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive
hudi-bot commented on PR #6725: URL: https://github.com/apache/hudi/pull/6725#issuecomment-1319638954 ## CI report: * ce84f60bac968a89090d4091845c4dd15ea70ee4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13096) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #7140: [HUDI-5163] Fixing failure handling with spark datasource write
Zouxxyy commented on PR #7140: URL: https://github.com/apache/hudi/pull/7140#issuecomment-1319635657 @nsivabalan I made a [fix](https://github.com/nsivabalan/hudi/pull/12) based on the comments, hope that helps, but sorry couldn't find a concrete test case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5240) Clean content when recursive Invocation inflate
[ https://issues.apache.org/jira/browse/HUDI-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-5240: --- Attachment: image-2022-11-18-14-57-06-393.png Description: !image-2022-11-18-14-57-06-393.png! > Clean content when recursive Invocation inflate > --- > > Key: HUDI-5240 > URL: https://issues.apache.org/jira/browse/HUDI-5240 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > Attachments: image-2022-11-18-14-57-06-393.png > > > !image-2022-11-18-14-57-06-393.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5240) Clean content when recursive Invocation inflate
[ https://issues.apache.org/jira/browse/HUDI-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-5240: -- Assignee: loukey_j > Clean content when recursive Invocation inflate > --- > > Key: HUDI-5240 > URL: https://issues.apache.org/jira/browse/HUDI-5240 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5240) Clean content when recursive Invocation inflate
loukey_j created HUDI-5240: -- Summary: Clean content when recursive Invocation inflate Key: HUDI-5240 URL: https://issues.apache.org/jira/browse/HUDI-5240 Project: Apache Hudi Issue Type: Bug Reporter: loukey_j -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] zhangyue19921010 commented on pull request #7238: [HUDI-3963] Cleaning up `QueueBasedExecutor` impls
zhangyue19921010 commented on PR #7238: URL: https://github.com/apache/hudi/pull/7238#issuecomment-1319611318 Ack Will finish my review this week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-5239) support hoddiejavawriteclient compact
[ https://issues.apache.org/jira/browse/HUDI-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635684#comment-17635684 ] zhaoyangming commented on HUDI-5239: https://github.com/apache/hudi/pull/7240 > support hoddiejavawriteclient compact > - > > Key: HUDI-5239 > URL: https://issues.apache.org/jira/browse/HUDI-5239 > Project: Apache Hudi > Issue Type: Improvement >Reporter: zhaoyangming >Priority: Major > > support hoddiejavawriteclient compact -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] leesf commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function
leesf commented on code in PR #7235: URL: https://github.com/apache/hudi/pull/7235#discussion_r1026069202 ## rfc/rfc-63/rfc-63.md: ## @@ -0,0 +1,370 @@ + + +# RFC-63: Index Function for Optimizing Query Performance + +## Proposers + +- @yihua +- @alexeykudinkin + +## Approvers + +- @vinothchandar +- @xushiyan +- @nsivabalan + +## Status + +JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512) + +## Abstract + +In this RFC, we address the problem of accelerating queries containing predicates based on functions defined on a +column, by introducing **Index Function**, a new indexing capability for efficient file pruning. + +## Background + +To make the queries finish faster, one major optimization technique is to scan less data by pruning rows that are not +needed by the query. This is usually done in two ways: + +- **Partition pruning**: The partition pruning relies on a table with physical partitioning, such as Hive partitioning. + A partitioned table uses a chosen column such as the date of `timestamp` and stores the rows with the same date to the + files under the same folder or physical partition, such as `date=2022-10-01/`. When the predicate in a query + references the partition column of the physical partitioning, the files in the partitions not matching the predicate + are filtered out, without scanning. For example, for the predicate `date between '2022-10-01' and '2022-10-02'`, the + partition pruning only returns the files from two partitions, `2022-10-01` and `2022-10-02`, for further processing. + The granularity of the pruning is at the partition level. + + +- **File pruning**: The file pruning carries out the pruning of the data at the file level, with the help of file-level + or record-level index. For example, with column stats index containing minimum and maximum values of a column for each + file, the files falling out of the range of the values compared to the predicate can be pruned. For a predicate + with `age < 20`, the file pruning filters out a file with columns stats of `[30, 40]` as the minimum and maximum + values of the column `age`. + +While Apache Hudi already supports partition pruning and file pruning with data skipping for different query engines, we +recognize that the following use cases need better query performance and usability: + +- File pruning based on functions defined on a column +- Efficient file pruning for files without physical partitioning +- Effective file pruning after partition evolution, without rewriting data + +Next, we explain these use cases in detail. + +### Use Case 1: Pruning files based on functions defined on a column + +Let's consider a non-partitioned table containing the events with a `timestamp` column. The events with naturally +increasing time are ingested into the table with bulk inserts every hour. In this case, assume that each file should +contain rows for a particular hour: + +| File Name | Min of `timestamp` | Max of `timestamp` | Note | +|-|||| +| base_file_1.parquet | 1664582400 | 1664586000 | 2022-10-01 12-1 AM | +| base_file_2.parquet | 1664586000 | 1664589600 | 2022-10-01 1-2 AM | +| ... | ...| ...| ... | +| base_file_13.parquet | 1664625600 | 1664629200 | 2022-10-01 12-1 PM | +| base_file_14.parquet | 1664629200 | 1664632800 | 2022-10-01 1-2 PM | +| ... | ...| ...| ... | +| base_file_37.parquet | 1664712000 | 1664715600 | 2022-10-02 12-1 PM | +| base_file_38.parquet | 1664715600 | 1664719200 | 2022-10-02 1-2 PM | + +For a query to get the number of events between 12PM and 2PM each day in a month for time-of-day analysis, the +predicates look like `DATE_FORMAT(timestamp, '%Y-%m-%d') between '2022-10-01' and '2022-10-31'` +and `DATE_FORMAT(timestamp, '%H') between '12' and '13'`. If the data is in a good layout as above, we only need to scan +two files (instead of 24 files) for each day of data, e.g., `base_file_13.parquet` and `base_file_14.parquet` containing +the data for 2022-10-01 12-2 PM. + +Currently, such a fine-grained file pruning based on a function on a column cannot be achieved in Hudi, because +transforming the `timestamp` to the hour of day is not order-preserving, thus the file pruning cannot directly leverage +the file-level column stats of the original column of `timestamp`. In this case, Hudi has to scan all the files for a +day and push the predicate down when reading parquet files, increasing the amount of data to be scanned. + +### Use Case 2: Efficient file pruning for files without physical partitioning + +Let's consider the same non-partitioned table as in the Use Case 1, containing the events with a `timest
[GitHub] [hudi] shengchiqu commented on issue #7229: [SUPPORT] flink connector sink Update the partition value, the old data is still there
shengchiqu commented on issue #7229: URL: https://github.com/apache/hudi/issues/7229#issuecomment-1319602176 i try set changelog.enabled=false, the problem solved Is this because the changelog mode does not support global indexes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ymZhao1001 opened a new pull request, #7240: support hoddiejavawriteclient compact
ymZhao1001 opened a new pull request, #7240: URL: https://github.com/apache/hudi/pull/7240 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-5239) support hoddiejavawriteclient compact
zhaoyangming created HUDI-5239: -- Summary: support hoddiejavawriteclient compact Key: HUDI-5239 URL: https://issues.apache.org/jira/browse/HUDI-5239 Project: Apache Hudi Issue Type: Bug Reporter: zhaoyangming support hoddiejavawriteclient compact -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5239) support hoddiejavawriteclient compact
[ https://issues.apache.org/jira/browse/HUDI-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyangming updated HUDI-5239: --- Issue Type: Improvement (was: Bug) > support hoddiejavawriteclient compact > - > > Key: HUDI-5239 > URL: https://issues.apache.org/jira/browse/HUDI-5239 > Project: Apache Hudi > Issue Type: Improvement >Reporter: zhaoyangming >Priority: Major > > support hoddiejavawriteclient compact -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] the-other-tim-brown opened a new pull request, #7239: add in field sanitization and use of aliases
the-other-tim-brown opened a new pull request, #7239: URL: https://github.com/apache/hudi/pull/7239 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function
leesf commented on code in PR #7235: URL: https://github.com/apache/hudi/pull/7235#discussion_r1026062508 ## rfc/rfc-63/rfc-63.md: ## @@ -0,0 +1,370 @@ + + +# RFC-63: Index Function for Optimizing Query Performance + +## Proposers + +- @yihua +- @alexeykudinkin + +## Approvers + +- @vinothchandar +- @xushiyan +- @nsivabalan + +## Status + +JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512) + +## Abstract + +In this RFC, we address the problem of accelerating queries containing predicates based on functions defined on a +column, by introducing **Index Function**, a new indexing capability for efficient file pruning. + +## Background + +To make the queries finish faster, one major optimization technique is to scan less data by pruning rows that are not +needed by the query. This is usually done in two ways: + +- **Partition pruning**: The partition pruning relies on a table with physical partitioning, such as Hive partitioning. + A partitioned table uses a chosen column such as the date of `timestamp` and stores the rows with the same date to the + files under the same folder or physical partition, such as `date=2022-10-01/`. When the predicate in a query + references the partition column of the physical partitioning, the files in the partitions not matching the predicate + are filtered out, without scanning. For example, for the predicate `date between '2022-10-01' and '2022-10-02'`, the + partition pruning only returns the files from two partitions, `2022-10-01` and `2022-10-02`, for further processing. + The granularity of the pruning is at the partition level. + + +- **File pruning**: The file pruning carries out the pruning of the data at the file level, with the help of file-level + or record-level index. For example, with column stats index containing minimum and maximum values of a column for each + file, the files falling out of the range of the values compared to the predicate can be pruned. For a predicate + with `age < 20`, the file pruning filters out a file with columns stats of `[30, 40]` as the minimum and maximum + values of the column `age`. + +While Apache Hudi already supports partition pruning and file pruning with data skipping for different query engines, we +recognize that the following use cases need better query performance and usability: + +- File pruning based on functions defined on a column +- Efficient file pruning for files without physical partitioning +- Effective file pruning after partition evolution, without rewriting data + +Next, we explain these use cases in detail. + +### Use Case 1: Pruning files based on functions defined on a column + +Let's consider a non-partitioned table containing the events with a `timestamp` column. The events with naturally +increasing time are ingested into the table with bulk inserts every hour. In this case, assume that each file should +contain rows for a particular hour: + +| File Name | Min of `timestamp` | Max of `timestamp` | Note | +|-|||| +| base_file_1.parquet | 1664582400 | 1664586000 | 2022-10-01 12-1 AM | +| base_file_2.parquet | 1664586000 | 1664589600 | 2022-10-01 1-2 AM | +| ... | ...| ...| ... | +| base_file_13.parquet | 1664625600 | 1664629200 | 2022-10-01 12-1 PM | +| base_file_14.parquet | 1664629200 | 1664632800 | 2022-10-01 1-2 PM | +| ... | ...| ...| ... | +| base_file_37.parquet | 1664712000 | 1664715600 | 2022-10-02 12-1 PM | +| base_file_38.parquet | 1664715600 | 1664719200 | 2022-10-02 1-2 PM | + +For a query to get the number of events between 12PM and 2PM each day in a month for time-of-day analysis, the +predicates look like `DATE_FORMAT(timestamp, '%Y-%m-%d') between '2022-10-01' and '2022-10-31'` +and `DATE_FORMAT(timestamp, '%H') between '12' and '13'`. If the data is in a good layout as above, we only need to scan +two files (instead of 24 files) for each day of data, e.g., `base_file_13.parquet` and `base_file_14.parquet` containing +the data for 2022-10-01 12-2 PM. + +Currently, such a fine-grained file pruning based on a function on a column cannot be achieved in Hudi, because +transforming the `timestamp` to the hour of day is not order-preserving, thus the file pruning cannot directly leverage Review Comment: so here we will use spark defined transformers first? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7021: [Minor] fix multi deser avro payload
hudi-bot commented on PR #7021: URL: https://github.com/apache/hudi/pull/7021#issuecomment-1319595538 ## CI report: * 06cbb491c812065b5078d4fcc02415af561928e2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13034) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13043) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13050) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13056) * f634430fecf9464d734dc6b5abfec8461ec59866 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13102) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function
leesf commented on code in PR #7235: URL: https://github.com/apache/hudi/pull/7235#discussion_r1026061223 ## rfc/rfc-63/rfc-63.md: ## @@ -0,0 +1,370 @@ + + +# RFC-63: Index Function for Optimizing Query Performance + +## Proposers + +- @yihua +- @alexeykudinkin + +## Approvers + +- @vinothchandar +- @xushiyan +- @nsivabalan + +## Status + +JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512) + +## Abstract + +In this RFC, we address the problem of accelerating queries containing predicates based on functions defined on a +column, by introducing **Index Function**, a new indexing capability for efficient file pruning. + +## Background + +To make the queries finish faster, one major optimization technique is to scan less data by pruning rows that are not +needed by the query. This is usually done in two ways: + +- **Partition pruning**: The partition pruning relies on a table with physical partitioning, such as Hive partitioning. + A partitioned table uses a chosen column such as the date of `timestamp` and stores the rows with the same date to the + files under the same folder or physical partition, such as `date=2022-10-01/`. When the predicate in a query + references the partition column of the physical partitioning, the files in the partitions not matching the predicate + are filtered out, without scanning. For example, for the predicate `date between '2022-10-01' and '2022-10-02'`, the + partition pruning only returns the files from two partitions, `2022-10-01` and `2022-10-02`, for further processing. + The granularity of the pruning is at the partition level. + + +- **File pruning**: The file pruning carries out the pruning of the data at the file level, with the help of file-level + or record-level index. For example, with column stats index containing minimum and maximum values of a column for each + file, the files falling out of the range of the values compared to the predicate can be pruned. For a predicate + with `age < 20`, the file pruning filters out a file with columns stats of `[30, 40]` as the minimum and maximum + values of the column `age`. + +While Apache Hudi already supports partition pruning and file pruning with data skipping for different query engines, we +recognize that the following use cases need better query performance and usability: + +- File pruning based on functions defined on a column +- Efficient file pruning for files without physical partitioning +- Effective file pruning after partition evolution, without rewriting data Review Comment: partition evolution here means change partition column or sth else? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function
leesf commented on code in PR #7235: URL: https://github.com/apache/hudi/pull/7235#discussion_r1026061223 ## rfc/rfc-63/rfc-63.md: ## @@ -0,0 +1,370 @@ + + +# RFC-63: Index Function for Optimizing Query Performance + +## Proposers + +- @yihua +- @alexeykudinkin + +## Approvers + +- @vinothchandar +- @xushiyan +- @nsivabalan + +## Status + +JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512) + +## Abstract + +In this RFC, we address the problem of accelerating queries containing predicates based on functions defined on a +column, by introducing **Index Function**, a new indexing capability for efficient file pruning. + +## Background + +To make the queries finish faster, one major optimization technique is to scan less data by pruning rows that are not +needed by the query. This is usually done in two ways: + +- **Partition pruning**: The partition pruning relies on a table with physical partitioning, such as Hive partitioning. + A partitioned table uses a chosen column such as the date of `timestamp` and stores the rows with the same date to the + files under the same folder or physical partition, such as `date=2022-10-01/`. When the predicate in a query + references the partition column of the physical partitioning, the files in the partitions not matching the predicate + are filtered out, without scanning. For example, for the predicate `date between '2022-10-01' and '2022-10-02'`, the + partition pruning only returns the files from two partitions, `2022-10-01` and `2022-10-02`, for further processing. + The granularity of the pruning is at the partition level. + + +- **File pruning**: The file pruning carries out the pruning of the data at the file level, with the help of file-level + or record-level index. For example, with column stats index containing minimum and maximum values of a column for each + file, the files falling out of the range of the values compared to the predicate can be pruned. For a predicate + with `age < 20`, the file pruning filters out a file with columns stats of `[30, 40]` as the minimum and maximum + values of the column `age`. + +While Apache Hudi already supports partition pruning and file pruning with data skipping for different query engines, we +recognize that the following use cases need better query performance and usability: + +- File pruning based on functions defined on a column +- Efficient file pruning for files without physical partitioning +- Effective file pruning after partition evolution, without rewriting data Review Comment: partition evolution here means change partition column? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function
leesf commented on code in PR #7235: URL: https://github.com/apache/hudi/pull/7235#discussion_r1026060609 ## rfc/rfc-63/rfc-63.md: ## @@ -0,0 +1,370 @@ + + +# RFC-63: Index Function for Optimizing Query Performance + +## Proposers + +- @yihua +- @alexeykudinkin + +## Approvers + +- @vinothchandar +- @xushiyan +- @nsivabalan + +## Status + +JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512) + +## Abstract + +In this RFC, we address the problem of accelerating queries containing predicates based on functions defined on a +column, by introducing **Index Function**, a new indexing capability for efficient file pruning. + +## Background + +To make the queries finish faster, one major optimization technique is to scan less data by pruning rows that are not +needed by the query. This is usually done in two ways: + +- **Partition pruning**: The partition pruning relies on a table with physical partitioning, such as Hive partitioning. + A partitioned table uses a chosen column such as the date of `timestamp` and stores the rows with the same date to the + files under the same folder or physical partition, such as `date=2022-10-01/`. When the predicate in a query + references the partition column of the physical partitioning, the files in the partitions not matching the predicate + are filtered out, without scanning. For example, for the predicate `date between '2022-10-01' and '2022-10-02'`, the + partition pruning only returns the files from two partitions, `2022-10-01` and `2022-10-02`, for further processing. + The granularity of the pruning is at the partition level. + + +- **File pruning**: The file pruning carries out the pruning of the data at the file level, with the help of file-level + or record-level index. For example, with column stats index containing minimum and maximum values of a column for each + file, the files falling out of the range of the values compared to the predicate can be pruned. For a predicate + with `age < 20`, the file pruning filters out a file with columns stats of `[30, 40]` as the minimum and maximum + values of the column `age`. + +While Apache Hudi already supports partition pruning and file pruning with data skipping for different query engines, we +recognize that the following use cases need better query performance and usability: + +- File pruning based on functions defined on a column Review Comment: what functions do we going to support? years/months/days/hours defined in spark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7174: [HUDI-5190] Consuming records from Iterator directly instead of using inner message queue
alexeykudinkin commented on code in PR #7174: URL: https://github.com/apache/hudi/pull/7174#discussion_r1026059829 ## hudi-common/src/main/java/org/apache/hudi/common/util/queue/SimpleHoodieExecutor.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.util.queue; + +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Iterator; +import java.util.concurrent.CompletableFuture; +import java.util.function.Function; + +/** + * Single Writer and Single Reader mode. Also this SimpleHoodieExecutor has no inner message queue and no inner lock. + * Consuming and writing records from iterator directly. + * + * Compared with queue based Executor + * Advantages: there is no need for additional memory and cpu resources due to lock or multithreading. + * Disadvantages: lost some benefits such as speed limit. And maybe lower throughput. + */ +public class SimpleHoodieExecutor extends HoodieExecutorBase { Review Comment: Let's actually simplify this even further and just inherit from `HoodieExecutor` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7236: [MINOR] Fix the npe caused by alter table add column.
hudi-bot commented on PR #7236: URL: https://github.com/apache/hudi/pull/7236#issuecomment-1319592122 ## CI report: * 80ffbed9a906d526cdf712942cb2cd52309e1f17 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13098) * 29bb43cc6562348d81a80210d65e49a81e03a2e0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13101) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7230: core flow tests working, but issues still to tackle and documentation…
hudi-bot commented on PR #7230: URL: https://github.com/apache/hudi/pull/7230#issuecomment-1319592054 ## CI report: * b52fb7392f7257ab5ef1d6dd35f6dbdfffc0a4f1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13093) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7021: [Minor] fix multi deser avro payload
hudi-bot commented on PR #7021: URL: https://github.com/apache/hudi/pull/7021#issuecomment-1319591609 ## CI report: * 06cbb491c812065b5078d4fcc02415af561928e2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13034) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13043) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13050) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13056) * f634430fecf9464d734dc6b5abfec8461ec59866 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6782: [HUDI-4911][HUDI-3301] Fixing `HoodieMetadataLogRecordReader` to avoid flushing cache for every lookup
alexeykudinkin commented on code in PR #6782: URL: https://github.com/apache/hudi/pull/6782#discussion_r1026051823 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -18,37 +18,32 @@ package org.apache.hudi.common.table.log; -import org.apache.hudi.common.model.DeleteRecord; +import org.apache.avro.Schema; Review Comment: It's done automatically by the IDEA whenever it cleans up dead imports. Let me see if i can adjust it to respect checkstyle instead. ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -188,40 +179,41 @@ protected AbstractHoodieLogRecordReader(FileSystem fs, String basePath, List close() { throw new HoodieUpsertException("Failed to close UpdateHandle", e); } } + newRecordKeysSorted.clear(); Review Comment: It's final unfortunately, and there's not a lot of value in setting it null (since handles goes out of scope anyway) ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -734,15 +696,22 @@ private void processQueuedBlocksForInstant(Deque logBlocks, int progress = (numLogFilesSeen - 1) / logFilePaths.size(); Review Comment: We wouldn't get to this method if it would be empty ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java: ## @@ -106,30 +109,85 @@ protected HoodieMergedLogRecordScanner(FileSystem fs, String basePath, List keys) { +if (forceFullScan) { + return; // no-op +} + +List missingKeys = keys.stream() +.filter(key -> !records.containsKey(key)) +.collect(Collectors.toList()); + +if (missingKeys.isEmpty()) { + // All the required records are already fetched, no-op + return; +} + +scanInternal(Option.of(KeySpec.fullKeySpec(missingKeys)), false); + } + + /** + * Provides incremental scanning capability where only keys matching provided key-prefixes + * will be looked up in the delta-log files, scanned and subsequently materialized into + * the internal cache + * + * @param keyPrefixes to be looked up + */ + public void scanByKeyPrefixes(List keyPrefixes) { +// TODO add caching for queried prefixes Review Comment: I think i'll actually address it in this PR ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java: ## @@ -330,6 +390,16 @@ public Builder withUseScanV2(boolean useScanV2) { return this; } +public Builder withKeyFiledOverride(String keyFieldOverride) { Review Comment: How did you make such suggestion change? It's pretty cool ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java: ## @@ -240,38 +239,35 @@ public List>>> getRecord return result; } - private Map>> readLogRecords(HoodieMetadataMergedLogRecordReader logRecordScanner, + private Map>> readLogRecords(HoodieMetadataLogRecordReader logRecordReader, List keys, boolean fullKey, List timings) { HoodieTimer timer = HoodieTimer.start(); -if (logRecordScanner == null) { +if (logRecordReader == null) { timings.add(timer.endTimer()); return Collections.emptyMap(); } -String partitionName = logRecordScanner.getPartitionName().get(); +Map>> logRecords = new HashMap<>(keys.size()); -Map>> logRecords = new HashMap<>(); -if (isFullScanAllowedForPartition(partitionName)) { - checkArgument(fullKey, "If full-scan is required, only full keys could be used!"); - // Path which does full scan of log files - for (String key : keys) { -logRecords.put(key, logRecordScanner.getRecordByKey(key).get(0).getValue()); - } -} else { - // This path will do seeks pertaining to the keys passed in - List>>> logRecordsList = - fullKey ? logRecordScanner.getRecordsByKeys(keys) - : logRecordScanner.getRecordsByKeyPrefixes(keys) - .stream() - .map(record -> Pair.of(record.getRecordKey(), Option.of(record))) - .collect(Collectors.toList()); - - for (Pair>> entry : logRecordsList) { -logRecords.put(entry.getKey(), entry.getValue()); - } +// First, fetch the keys being looked up +List>>> logRecordsList = Review Comment: So this PR makes sure that we're not flushing the records cache w/in the Scanner whenever we do `getRecord*` (previously batch APIs, were always flushing it). As such, there's now essentially no difference b/w these 2 branches. #
[GitHub] [hudi] hudi-bot commented on pull request #7236: [MINOR] Fix the npe caused by alter table add column.
hudi-bot commented on PR #7236: URL: https://github.com/apache/hudi/pull/7236#issuecomment-1319587735 ## CI report: * 80ffbed9a906d526cdf712942cb2cd52309e1f17 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13098) * 29bb43cc6562348d81a80210d65e49a81e03a2e0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4904) Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider
[ https://issues.apache.org/jira/browse/HUDI-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-4904: Status: In Progress (was: Open) > Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider > --- > > Key: HUDI-4904 > URL: https://issues.apache.org/jira/browse/HUDI-4904 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > In proto we can have a schema that is recursive. We should limit the > "unraveling" of a schema to N levels and let the user specify that amount of > levels as a config. After hitting depth N in the recursion, we will create a > Record with a byte array and string. The remaining data for that branch of > the recursion will be written out as a proto byte array and we record the > descriptor string for context of what is in the byte array. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-4904) Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider
[ https://issues.apache.org/jira/browse/HUDI-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown resolved HUDI-4904. - > Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider > --- > > Key: HUDI-4904 > URL: https://issues.apache.org/jira/browse/HUDI-4904 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > In proto we can have a schema that is recursive. We should limit the > "unraveling" of a schema to N levels and let the user specify that amount of > levels as a config. After hitting depth N in the recursion, we will create a > Record with a byte array and string. The remaining data for that branch of > the recursion will be written out as a proto byte array and we record the > descriptor string for context of what is in the byte array. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-4905) Protobuf type handling improvements
[ https://issues.apache.org/jira/browse/HUDI-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown resolved HUDI-4905. - > Protobuf type handling improvements > --- > > Key: HUDI-4905 > URL: https://issues.apache.org/jira/browse/HUDI-4905 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > Two improvements have come out of discussions with others trying to use > protobuf and Hudi. > > # We can support uint64 as a decimal without losing precision and > representing the value in the lake as a positive value > # Proto Timestamps can be converted to long with LogicalType timestamp-micros > # Treat elements within a `oneof` as nullable -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS
[ https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5238: -- Sprint: 2022/11/15 > Hudi throwing "PipeBroken" exception during Merging on GCS > -- > > Key: HUDI-5238 > URL: https://issues.apache.org/jira/browse/HUDI-5238 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.1 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > > Originally reported at [https://github.com/apache/hudi/issues/7234] > --- > > Root-cause: > Basically, the reason it’s failing is following: # GCS uses > PipeInputStream/PipeOutputStream comprising reading/writing ends of the > “pipe” it’s using for unidirectional comm b/w Threads > # PipeInputStream (for whatever reason) remembers the thread that actually > wrote into the pipe > # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) > for reading and _writing_ (it’s only used in HoodieMergeHandle, and in > bulk-insert) > # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* > BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s > failing > > Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] alexeykudinkin commented on issue #7234: [SUPPORT] Upsert to Hudi table w/ 0.11.1 or above fails w/ Pipe broken
alexeykudinkin commented on issue #7234: URL: https://github.com/apache/hudi/issues/7234#issuecomment-1319578145 Created HUDI-5238 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin closed issue #7234: [SUPPORT] Upsert to Hudi table w/ 0.11.1 or above fails w/ Pipe broken
alexeykudinkin closed issue #7234: [SUPPORT] Upsert to Hudi table w/ 0.11.1 or above fails w/ Pipe broken URL: https://github.com/apache/hudi/issues/7234 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS
[ https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin reassigned HUDI-5238: - Assignee: Alexey Kudinkin > Hudi throwing "PipeBroken" exception during Merging on GCS > -- > > Key: HUDI-5238 > URL: https://issues.apache.org/jira/browse/HUDI-5238 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.1 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > > Originally reported at [https://github.com/apache/hudi/issues/7234] > --- > > Root-cause: > Basically, the reason it’s failing is following: # GCS uses > PipeInputStream/PipeOutputStream comprising reading/writing ends of the > “pipe” it’s using for unidirectional comm b/w Threads > # PipeInputStream (for whatever reason) remembers the thread that actually > wrote into the pipe > # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) > for reading and _writing_ (it’s only used in HoodieMergeHandle, and in > bulk-insert) > # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* > BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s > failing > > Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS
[ https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5238: -- Story Points: 4 > Hudi throwing "PipeBroken" exception during Merging on GCS > -- > > Key: HUDI-5238 > URL: https://issues.apache.org/jira/browse/HUDI-5238 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.1 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > > Originally reported at [https://github.com/apache/hudi/issues/7234] > --- > > Root-cause: > Basically, the reason it’s failing is following: # GCS uses > PipeInputStream/PipeOutputStream comprising reading/writing ends of the > “pipe” it’s using for unidirectional comm b/w Threads > # PipeInputStream (for whatever reason) remembers the thread that actually > wrote into the pipe > # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) > for reading and _writing_ (it’s only used in HoodieMergeHandle, and in > bulk-insert) > # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* > BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s > failing > > Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS
[ https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5238: -- Description: Originally reported at [https://github.com/apache/hudi/issues/7234] --- Root-cause: Basically, the reason it’s failing is following: # GCS uses PipeInputStream/PipeOutputStream comprising reading/writing ends of the “pipe” it’s using for unidirectional comm b/w Threads # PipeInputStream (for whatever reason) remembers the thread that actually wrote into the pipe # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) for reading and _writing_ (it’s only used in HoodieMergeHandle, and in bulk-insert) # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s failing Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files] > Hudi throwing "PipeBroken" exception during Merging on GCS > -- > > Key: HUDI-5238 > URL: https://issues.apache.org/jira/browse/HUDI-5238 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.1 >Reporter: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > > Originally reported at [https://github.com/apache/hudi/issues/7234] > --- > > Root-cause: > Basically, the reason it’s failing is following: # GCS uses > PipeInputStream/PipeOutputStream comprising reading/writing ends of the > “pipe” it’s using for unidirectional comm b/w Threads > # PipeInputStream (for whatever reason) remembers the thread that actually > wrote into the pipe > # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) > for reading and _writing_ (it’s only used in HoodieMergeHandle, and in > bulk-insert) > # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* > BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s > failing > > Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5238) Hudi throws PipeBroken
Alexey Kudinkin created HUDI-5238: - Summary: Hudi throws PipeBroken Key: HUDI-5238 URL: https://issues.apache.org/jira/browse/HUDI-5238 Project: Apache Hudi Issue Type: Bug Reporter: Alexey Kudinkin -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS
[ https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5238: -- Fix Version/s: 0.13.0 (was: 0.12.1) > Hudi throwing "PipeBroken" exception during Merging on GCS > -- > > Key: HUDI-5238 > URL: https://issues.apache.org/jira/browse/HUDI-5238 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS
[ https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5238: -- Fix Version/s: 0.12.1 > Hudi throwing "PipeBroken" exception during Merging on GCS > -- > > Key: HUDI-5238 > URL: https://issues.apache.org/jira/browse/HUDI-5238 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Priority: Blocker > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS
[ https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5238: -- Priority: Blocker (was: Major) > Hudi throwing "PipeBroken" exception during Merging on GCS > -- > > Key: HUDI-5238 > URL: https://issues.apache.org/jira/browse/HUDI-5238 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS
[ https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5238: -- Summary: Hudi throwing "PipeBroken" exception during Merging on GCS (was: Hudi throws PipeBroken ) > Hudi throwing "PipeBroken" exception during Merging on GCS > -- > > Key: HUDI-5238 > URL: https://issues.apache.org/jira/browse/HUDI-5238 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS
[ https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5238: -- Affects Version/s: 0.12.1 > Hudi throwing "PipeBroken" exception during Merging on GCS > -- > > Key: HUDI-5238 > URL: https://issues.apache.org/jira/browse/HUDI-5238 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.1 >Reporter: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7021: [Minor] fix multi deser avro payload
alexeykudinkin commented on code in PR #7021: URL: https://github.com/apache/hudi/pull/7021#discussion_r1025888585 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -364,7 +364,8 @@ private void processAppendResult(AppendResult result, List recordL updateWriteStatus(stat, result); } -if (config.isMetadataColumnStatsIndexEnabled()) { +// TODO MetadataColumnStatsIndex for spark record +if (config.isMetadataColumnStatsIndexEnabled() && recordMerger.getRecordType() == HoodieRecordType.AVRO) { Review Comment: Let's create a ticket for this. We need to fix this before 0.13 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -215,18 +216,16 @@ private Option prepareRecord(HoodieRecord hoodieRecord) { // If the format can not record the operation field, nullify the DELETE payload manually. boolean nullifyPayload = HoodieOperation.isDelete(hoodieRecord.getOperation()) && !config.allowOperationMetadataField(); recordProperties.put(HoodiePayloadProps.PAYLOAD_IS_UPDATE_RECORD_FOR_MOR, String.valueOf(isUpdateRecord)); - Option finalRecord = Option.empty(); - if (!nullifyPayload && !hoodieRecord.isDelete(tableSchema, recordProperties)) { -if (hoodieRecord.shouldIgnore(tableSchema, recordProperties)) { - return Option.of(hoodieRecord); + Option finalRecord = nullifyPayload ? Option.empty() : Option.of(hoodieRecord.deserialization(tableSchema, recordProperties)); + // Check for delete + if (finalRecord.isPresent() && !finalRecord.get().isDelete(tableSchema, recordProperties)) { +// Check for ignore ExpressionPayload +if (finalRecord.get().shouldIgnore(tableSchema, recordProperties)) { + return finalRecord; Review Comment: This is actually incorrect -- this will delete the record -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #7021: [Minor] fix multi deser avro payload
alexeykudinkin commented on PR #7021: URL: https://github.com/apache/hudi/pull/7021#issuecomment-1319572883 @wzx140 i pushed some changes to handle deleted/ignored records w/o the need to deserialize the payload. With these changes we don't actually need a separate materialization step of `deserialization` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #6782: [HUDI-4911][HUDI-3301] Fixing `HoodieMetadataLogRecordReader` to avoid flushing cache for every lookup
codope commented on code in PR #6782: URL: https://github.com/apache/hudi/pull/6782#discussion_r1025952971 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java: ## @@ -127,8 +127,8 @@ public List close() { throw new HoodieUpsertException("Failed to close UpdateHandle", e); } } + newRecordKeysSorted.clear(); Review Comment: Should we set it to null as the collection is already empty at this point? ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -124,7 +123,7 @@ public abstract class AbstractHoodieLogRecordReader { // Total log files read - for metrics private AtomicLong totalLogFiles = new AtomicLong(0); // Internal schema, used to support full schema evolution. - private InternalSchema internalSchema; + private final InternalSchema internalSchema; // Hoodie table path. private final String path; Review Comment: Looks like `path` is not used anywhere. ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -18,37 +18,32 @@ package org.apache.hudi.common.table.log; -import org.apache.hudi.common.model.DeleteRecord; +import org.apache.avro.Schema; Review Comment: Let's try to remove unrelated changes like reordering of imports. Our checkstyle puts hudi imports above others. ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -188,40 +179,41 @@ protected AbstractHoodieLogRecordReader(FileSystem fs, String basePath, List keys) { +if (forceFullScan) { + return; // no-op +} + +List missingKeys = keys.stream() +.filter(key -> !records.containsKey(key)) +.collect(Collectors.toList()); + +if (missingKeys.isEmpty()) { + // All the required records are already fetched, no-op + return; +} + +scanInternal(Option.of(KeySpec.fullKeySpec(missingKeys)), false); + } + + /** + * Provides incremental scanning capability where only keys matching provided key-prefixes + * will be looked up in the delta-log files, scanned and subsequently materialized into + * the internal cache + * + * @param keyPrefixes to be looked up + */ + public void scanByKeyPrefixes(List keyPrefixes) { +// TODO add caching for queried prefixes Review Comment: Let's add a JIRA to track this. ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -734,15 +696,22 @@ private void processQueuedBlocksForInstant(Deque logBlocks, int progress = (numLogFilesSeen - 1) / logFilePaths.size(); Review Comment: Should we guard against the empty `logFilePaths` list? ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -188,40 +179,41 @@ protected AbstractHoodieLogRecordReader(FileSystem fs, String basePath, List>>> getRecord return result; } - private Map>> readLogRecords(HoodieMetadataMergedLogRecordReader logRecordScanner, + private Map>> readLogRecords(HoodieMetadataLogRecordReader logRecordReader, List keys, boolean fullKey, List timings) { HoodieTimer timer = HoodieTimer.start(); -if (logRecordScanner == null) { +if (logRecordReader == null) { timings.add(timer.endTimer()); return Collections.emptyMap(); } -String partitionName = logRecordScanner.getPartitionName().get(); +Map>> logRecords = new HashMap<>(keys.size()); -Map>> logRecords = new HashMap<>(); -if (isFullScanAllowedForPartition(partitionName)) { - checkArgument(fullKey, "If full-scan is required, only full keys could be used!"); - // Path which does full scan of log files - for (String key : keys) { -logRecords.put(key, logRecordScanner.getRecordByKey(key).get(0).getValue()); - } -} else { - // This path will do seeks pertaining to the keys passed in - List>>> logRecordsList = - fullKey ? logRecordScanner.getRecordsByKeys(keys) - : logRecordScanner.getRecordsByKeyPrefixes(keys) - .stream() - .map(record -> Pair.of(record.getRecordKey(), Option.of(record))) - .collect(Collectors.toList()); - - for (Pair>> entry : logRecordsList) { -logRecords.put(entry.getKey(), entry.getValue()); - } +// First, fetch the keys being looked up +List>>> logRecordsList = Review Comment: Shouldn't we still check for `if (isFullScanAllowedForPartition(partitionName))`? What if full scan is not enabled for
[GitHub] [hudi] hudi-bot commented on pull request #7238: [HUDI-3963] Cleaning up `QueueBasedExecutor` impls
hudi-bot commented on PR #7238: URL: https://github.com/apache/hudi/pull/7238#issuecomment-1319547843 ## CI report: * 6241a829bc4fbcbff3b3cbcf2a8efddcdb667344 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13100) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cravib4u commented on issue #7234: [SUPPORT] Upsert to Hudi table w/ 0.11.1 or above fails w/ Pipe broken
cravib4u commented on issue #7234: URL: https://github.com/apache/hudi/issues/7234#issuecomment-1319546216 We tried even Hudi 0.12.1, Spark -2.4.8 as well and we see same issue while writing data. We tried both SQL Merge and DataSource write upsert operation as same issue. But with Datasource insert_overwrite operation is working fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7238: [HUDI-3963] Cleaning up `QueueBasedExecutor` impls
hudi-bot commented on PR #7238: URL: https://github.com/apache/hudi/pull/7238#issuecomment-1319545281 ## CI report: * 6241a829bc4fbcbff3b3cbcf2a8efddcdb667344 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7236: [MINOR] Fix the npe caused by alter table add column.
hudi-bot commented on PR #7236: URL: https://github.com/apache/hudi/pull/7236#issuecomment-1319545214 ## CI report: * 80ffbed9a906d526cdf712942cb2cd52309e1f17 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13098) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7237: [HUDI-5237] Support for HoodieUnMergedLogRecordScanner with InternalSchema
hudi-bot commented on PR #7237: URL: https://github.com/apache/hudi/pull/7237#issuecomment-1319545247 ## CI report: * 5c08745b59494bfafa9a8591576290aaed317059 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13099) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query
hudi-bot commented on PR #7138: URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319545082 ## CI report: * c3171ba5115240fd5e00a0d47cf4f9b0c182b8e6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13074) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13094) * 1c66c4283d9daf64548806289c4ccb0467976d21 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13097) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive
hudi-bot commented on PR #6725: URL: https://github.com/apache/hudi/pull/6725#issuecomment-1319544395 ## CI report: * f3b7a61c226d136cd232a062a0f28e085f060035 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13095) * ce84f60bac968a89090d4091845c4dd15ea70ee4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13096) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function
xiarixiaoyao commented on code in PR #7235: URL: https://github.com/apache/hudi/pull/7235#discussion_r1025986523 ## rfc/rfc-63/rfc-63.md: ## @@ -0,0 +1,370 @@ + + +# RFC-63: Index Function for Optimizing Query Performance + +## Proposers + +- @yihua +- @alexeykudinkin + +## Approvers + +- @vinothchandar +- @xushiyan +- @nsivabalan + +## Status + +JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512) + +## Abstract + +In this RFC, we address the problem of accelerating queries containing predicates based on functions defined on a +column, by introducing **Index Function**, a new indexing capability for efficient file pruning. + +## Background + +To make the queries finish faster, one major optimization technique is to scan less data by pruning rows that are not +needed by the query. This is usually done in two ways: + +- **Partition pruning**: The partition pruning relies on a table with physical partitioning, such as Hive partitioning. + A partitioned table uses a chosen column such as the date of `timestamp` and stores the rows with the same date to the + files under the same folder or physical partition, such as `date=2022-10-01/`. When the predicate in a query + references the partition column of the physical partitioning, the files in the partitions not matching the predicate + are filtered out, without scanning. For example, for the predicate `date between '2022-10-01' and '2022-10-02'`, the + partition pruning only returns the files from two partitions, `2022-10-01` and `2022-10-02`, for further processing. + The granularity of the pruning is at the partition level. + + +- **File pruning**: The file pruning carries out the pruning of the data at the file level, with the help of file-level + or record-level index. For example, with column stats index containing minimum and maximum values of a column for each + file, the files falling out of the range of the values compared to the predicate can be pruned. For a predicate + with `age < 20`, the file pruning filters out a file with columns stats of `[30, 40]` as the minimum and maximum + values of the column `age`. + +While Apache Hudi already supports partition pruning and file pruning with data skipping for different query engines, we +recognize that the following use cases need better query performance and usability: + +- File pruning based on functions defined on a column +- Efficient file pruning for files without physical partitioning +- Effective file pruning after partition evolution, without rewriting data + +Next, we explain these use cases in detail. + +### Use Case 1: Pruning files based on functions defined on a column + +Let's consider a non-partitioned table containing the events with a `timestamp` column. The events with naturally +increasing time are ingested into the table with bulk inserts every hour. In this case, assume that each file should +contain rows for a particular hour: + +| File Name | Min of `timestamp` | Max of `timestamp` | Note | +|-|||| +| base_file_1.parquet | 1664582400 | 1664586000 | 2022-10-01 12-1 AM | +| base_file_2.parquet | 1664586000 | 1664589600 | 2022-10-01 1-2 AM | +| ... | ...| ...| ... | +| base_file_13.parquet | 1664625600 | 1664629200 | 2022-10-01 12-1 PM | +| base_file_14.parquet | 1664629200 | 1664632800 | 2022-10-01 1-2 PM | +| ... | ...| ...| ... | +| base_file_37.parquet | 1664712000 | 1664715600 | 2022-10-02 12-1 PM | +| base_file_38.parquet | 1664715600 | 1664719200 | 2022-10-02 1-2 PM | + +For a query to get the number of events between 12PM and 2PM each day in a month for time-of-day analysis, the +predicates look like `DATE_FORMAT(timestamp, '%Y-%m-%d') between '2022-10-01' and '2022-10-31'` +and `DATE_FORMAT(timestamp, '%H') between '12' and '13'`. If the data is in a good layout as above, we only need to scan +two files (instead of 24 files) for each day of data, e.g., `base_file_13.parquet` and `base_file_14.parquet` containing +the data for 2022-10-01 12-2 PM. + +Currently, such a fine-grained file pruning based on a function on a column cannot be achieved in Hudi, because +transforming the `timestamp` to the hour of day is not order-preserving, thus the file pruning cannot directly leverage +the file-level column stats of the original column of `timestamp`. In this case, Hudi has to scan all the files for a +day and push the predicate down when reading parquet files, increasing the amount of data to be scanned. + +### Use Case 2: Efficient file pruning for files without physical partitioning + +Let's consider the same non-partitioned table as in the Use Case 1, containing the events with a
[GitHub] [hudi] hudi-bot commented on pull request #7236: [MINOR] Fix the npe caused by alter table add column.
hudi-bot commented on PR #7236: URL: https://github.com/apache/hudi/pull/7236#issuecomment-1319538703 ## CI report: * 80ffbed9a906d526cdf712942cb2cd52309e1f17 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7237: [HUDI-5237] Support for HoodieUnMergedLogRecordScanner with InternalSchema
hudi-bot commented on PR #7237: URL: https://github.com/apache/hudi/pull/7237#issuecomment-1319538717 ## CI report: * 5c08745b59494bfafa9a8591576290aaed317059 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query
hudi-bot commented on PR #7138: URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319538553 ## CI report: * c3171ba5115240fd5e00a0d47cf4f9b0c182b8e6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13074) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13094) * 1c66c4283d9daf64548806289c4ccb0467976d21 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive
hudi-bot commented on PR #6725: URL: https://github.com/apache/hudi/pull/6725#issuecomment-1319538301 ## CI report: * 82ab5bfa3d2246beae8835178671840e4e1b77b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13073) * f3b7a61c226d136cd232a062a0f28e085f060035 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13095) * ce84f60bac968a89090d4091845c4dd15ea70ee4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin opened a new pull request, #7238: [HUDI-3963] Cleaning up `QueueBasedExecutor` impls
alexeykudinkin opened a new pull request, #7238: URL: https://github.com/apache/hudi/pull/7238 ### Change Logs This is a follow-up PR after https://github.com/apache/hudi/pull/5416, further cleaning up some of the historically inherited artifacts. ### Impact No impact ### Risk level (write none, low medium or high below) Low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function
xiarixiaoyao commented on code in PR #7235: URL: https://github.com/apache/hudi/pull/7235#discussion_r1025981630 ## rfc/rfc-63/rfc-63.md: ## @@ -0,0 +1,370 @@ + + +# RFC-63: Index Function for Optimizing Query Performance + +## Proposers + +- @yihua +- @alexeykudinkin + +## Approvers + +- @vinothchandar +- @xushiyan +- @nsivabalan + +## Status + +JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512) + +## Abstract + +In this RFC, we address the problem of accelerating queries containing predicates based on functions defined on a +column, by introducing **Index Function**, a new indexing capability for efficient file pruning. + +## Background + +To make the queries finish faster, one major optimization technique is to scan less data by pruning rows that are not +needed by the query. This is usually done in two ways: + +- **Partition pruning**: The partition pruning relies on a table with physical partitioning, such as Hive partitioning. + A partitioned table uses a chosen column such as the date of `timestamp` and stores the rows with the same date to the + files under the same folder or physical partition, such as `date=2022-10-01/`. When the predicate in a query + references the partition column of the physical partitioning, the files in the partitions not matching the predicate + are filtered out, without scanning. For example, for the predicate `date between '2022-10-01' and '2022-10-02'`, the + partition pruning only returns the files from two partitions, `2022-10-01` and `2022-10-02`, for further processing. + The granularity of the pruning is at the partition level. + + +- **File pruning**: The file pruning carries out the pruning of the data at the file level, with the help of file-level + or record-level index. For example, with column stats index containing minimum and maximum values of a column for each + file, the files falling out of the range of the values compared to the predicate can be pruned. For a predicate + with `age < 20`, the file pruning filters out a file with columns stats of `[30, 40]` as the minimum and maximum + values of the column `age`. + +While Apache Hudi already supports partition pruning and file pruning with data skipping for different query engines, we +recognize that the following use cases need better query performance and usability: + +- File pruning based on functions defined on a column +- Efficient file pruning for files without physical partitioning +- Effective file pruning after partition evolution, without rewriting data + +Next, we explain these use cases in detail. + +### Use Case 1: Pruning files based on functions defined on a column + +Let's consider a non-partitioned table containing the events with a `timestamp` column. The events with naturally +increasing time are ingested into the table with bulk inserts every hour. In this case, assume that each file should +contain rows for a particular hour: + +| File Name | Min of `timestamp` | Max of `timestamp` | Note | +|-|||| +| base_file_1.parquet | 1664582400 | 1664586000 | 2022-10-01 12-1 AM | +| base_file_2.parquet | 1664586000 | 1664589600 | 2022-10-01 1-2 AM | +| ... | ...| ...| ... | +| base_file_13.parquet | 1664625600 | 1664629200 | 2022-10-01 12-1 PM | +| base_file_14.parquet | 1664629200 | 1664632800 | 2022-10-01 1-2 PM | +| ... | ...| ...| ... | +| base_file_37.parquet | 1664712000 | 1664715600 | 2022-10-02 12-1 PM | +| base_file_38.parquet | 1664715600 | 1664719200 | 2022-10-02 1-2 PM | + +For a query to get the number of events between 12PM and 2PM each day in a month for time-of-day analysis, the +predicates look like `DATE_FORMAT(timestamp, '%Y-%m-%d') between '2022-10-01' and '2022-10-31'` +and `DATE_FORMAT(timestamp, '%H') between '12' and '13'`. If the data is in a good layout as above, we only need to scan +two files (instead of 24 files) for each day of data, e.g., `base_file_13.parquet` and `base_file_14.parquet` containing +the data for 2022-10-01 12-2 PM. + +Currently, such a fine-grained file pruning based on a function on a column cannot be achieved in Hudi, because Review Comment: Looking forward to fine-grained pruning -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function
xiarixiaoyao commented on code in PR #7235: URL: https://github.com/apache/hudi/pull/7235#discussion_r1025980898 ## rfc/rfc-63/rfc-63.md: ## @@ -0,0 +1,370 @@ + + +# RFC-63: Index Function for Optimizing Query Performance + +## Proposers + +- @yihua +- @alexeykudinkin + +## Approvers + +- @vinothchandar +- @xushiyan +- @nsivabalan + +## Status + +JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512) + +## Abstract + +In this RFC, we address the problem of accelerating queries containing predicates based on functions defined on a +column, by introducing **Index Function**, a new indexing capability for efficient file pruning. + +## Background + +To make the queries finish faster, one major optimization technique is to scan less data by pruning rows that are not +needed by the query. This is usually done in two ways: + +- **Partition pruning**: The partition pruning relies on a table with physical partitioning, such as Hive partitioning. + A partitioned table uses a chosen column such as the date of `timestamp` and stores the rows with the same date to the + files under the same folder or physical partition, such as `date=2022-10-01/`. When the predicate in a query + references the partition column of the physical partitioning, the files in the partitions not matching the predicate + are filtered out, without scanning. For example, for the predicate `date between '2022-10-01' and '2022-10-02'`, the + partition pruning only returns the files from two partitions, `2022-10-01` and `2022-10-02`, for further processing. + The granularity of the pruning is at the partition level. + + +- **File pruning**: The file pruning carries out the pruning of the data at the file level, with the help of file-level + or record-level index. For example, with column stats index containing minimum and maximum values of a column for each + file, the files falling out of the range of the values compared to the predicate can be pruned. For a predicate + with `age < 20`, the file pruning filters out a file with columns stats of `[30, 40]` as the minimum and maximum + values of the column `age`. + +While Apache Hudi already supports partition pruning and file pruning with data skipping for different query engines, we +recognize that the following use cases need better query performance and usability: + +- File pruning based on functions defined on a column +- Efficient file pruning for files without physical partitioning +- Effective file pruning after partition evolution, without rewriting data + +Next, we explain these use cases in detail. + +### Use Case 1: Pruning files based on functions defined on a column + +Let's consider a non-partitioned table containing the events with a `timestamp` column. The events with naturally +increasing time are ingested into the table with bulk inserts every hour. In this case, assume that each file should +contain rows for a particular hour: + +| File Name | Min of `timestamp` | Max of `timestamp` | Note | +|-|||| +| base_file_1.parquet | 1664582400 | 1664586000 | 2022-10-01 12-1 AM | +| base_file_2.parquet | 1664586000 | 1664589600 | 2022-10-01 1-2 AM | +| ... | ...| ...| ... | +| base_file_13.parquet | 1664625600 | 1664629200 | 2022-10-01 12-1 PM | +| base_file_14.parquet | 1664629200 | 1664632800 | 2022-10-01 1-2 PM | +| ... | ...| ...| ... | +| base_file_37.parquet | 1664712000 | 1664715600 | 2022-10-02 12-1 PM | +| base_file_38.parquet | 1664715600 | 1664719200 | 2022-10-02 1-2 PM | + +For a query to get the number of events between 12PM and 2PM each day in a month for time-of-day analysis, the +predicates look like `DATE_FORMAT(timestamp, '%Y-%m-%d') between '2022-10-01' and '2022-10-31'` +and `DATE_FORMAT(timestamp, '%H') between '12' and '13'`. If the data is in a good layout as above, we only need to scan +two files (instead of 24 files) for each day of data, e.g., `base_file_13.parquet` and `base_file_14.parquet` containing +the data for 2022-10-01 12-2 PM. + +Currently, such a fine-grained file pruning based on a function on a column cannot be achieved in Hudi, because +transforming the `timestamp` to the hour of day is not order-preserving, thus the file pruning cannot directly leverage +the file-level column stats of the original column of `timestamp`. In this case, Hudi has to scan all the files for a +day and push the predicate down when reading parquet files, increasing the amount of data to be scanned. + +### Use Case 2: Efficient file pruning for files without physical partitioning + +Let's consider the same non-partitioned table as in the Use Case 1, containing the events with a
[jira] [Updated] (HUDI-5237) Support for HoodieUnMergedLogRecordScanner with InternalSchema
[ https://issues.apache.org/jira/browse/HUDI-5237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-5237: - Labels: pull-request-available (was: ) > Support for HoodieUnMergedLogRecordScanner with InternalSchema > -- > > Key: HUDI-5237 > URL: https://issues.apache.org/jira/browse/HUDI-5237 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Alexander Trushev >Assignee: Alexander Trushev >Priority: Major > Labels: pull-request-available > > Currently, only HoodieMergedLogRecordScanner has support of InternalSchema. > Implementing schema evolution in flink is required supporting for > HoodieUnMergedLogRecordScanner with InternalSchema as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] trushev opened a new pull request, #7237: [HUDI-5237] Support for HoodieUnMergedLogRecordScanner with InternalSchema
trushev opened a new pull request, #7237: URL: https://github.com/apache/hudi/pull/7237 ### Change Logs Currently, only `HoodieMergedLogRecordScanner` has support of InternalSchema. Implementing schema evolution in flink https://github.com/apache/hudi/pull/5830 is required supporting for `HoodieUnMergedLogRecordScanner` with InternalSchema as well. ### Impact Support for HoodieUnMergedLogRecordScanner with InternalSchema. ### Risk level (write none, low medium or high below) Low ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-5237) Support for HoodieUnMergedLogRecordScanner with InternalSchema
Alexander Trushev created HUDI-5237: --- Summary: Support for HoodieUnMergedLogRecordScanner with InternalSchema Key: HUDI-5237 URL: https://issues.apache.org/jira/browse/HUDI-5237 Project: Apache Hudi Issue Type: Sub-task Reporter: Alexander Trushev Assignee: Alexander Trushev Currently, only HoodieMergedLogRecordScanner has support of InternalSchema. Implementing schema evolution in flink is required supporting for HoodieUnMergedLogRecordScanner with InternalSchema as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on a diff in pull request #7231: [HUDI-5234] streaming read skip clustering
danny0405 commented on code in PR #7231: URL: https://github.com/apache/hudi/pull/7231#discussion_r1025961537 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java: ## @@ -482,13 +487,12 @@ private List filterInstantsWithRange( final String endCommit = this.conf.get(FlinkOptions.READ_END_COMMIT); instantStream = instantStream.filter(s -> HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN_OR_EQUALS, endCommit)); } -return maySkipCompaction(instantStream).collect(Collectors.toList()); +return maySkipOverwriteInstants(instantStream).collect(Collectors.toList()); } - private Stream maySkipCompaction(Stream instants) { -return this.skipCompaction -? instants.filter(instant -> !instant.getAction().equals(HoodieTimeline.COMMIT_ACTION)) -: instants; + private Stream maySkipOverwriteInstants(Stream instants) { +return instants.filter(instant -> !this.skipCompaction || !instant.getAction().equals(HoodieTimeline.COMPACTION_ACTION)) +.filter(instant -> !this.skipClustering|| !instant.getAction().equals(HoodieTimeline.REPLACE_COMMIT_ACTION)); } Review Comment: `!this.skipClustering||` -> `!this.skipClustering ||` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] scxwhite opened a new pull request, #7236: [MINOR] Fix the npe caused by alter table add column.
scxwhite opened a new pull request, #7236: URL: https://github.com/apache/hudi/pull/7236 ### Change Logs We put the public configuration in the file /etc/hudi/conf/hudi-defaults.conf,When the optimistic lock is opened, `alter table add column` command will cause npe. ![image](https://user-images.githubusercontent.com/23207189/202611981-784680e5-1b33-42a3-9afb-74a13807b222.png) Then I found that because BaseHoodieWriteClient # preWrite was not executed in the AlterHoodieTableAddColumnsCommand class, pendingInflightAndRequestedInstants was not initialized. So submit this pr to fix it. https://user-images.githubusercontent.com/23207189/202612457-cb94c86a-4ae0-4a89-a124-0faae0dccb1a.png";> ### Impact alter table add/change column with optimistic lock is opened. ### Risk level (write none, low medium or high below) low ### Documentation Update ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-5233) Fix bug when InternalSchemaUtils.collectTypeChangedCols returns all columns
[ https://issues.apache.org/jira/browse/HUDI-5233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Trushev closed HUDI-5233. --- Resolution: Fixed Fixed via master branch: e4e28836c235f96edf4c38a75dd6e95beeaecb27 > Fix bug when InternalSchemaUtils.collectTypeChangedCols returns all columns > --- > > Key: HUDI-5233 > URL: https://issues.apache.org/jira/browse/HUDI-5233 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Alexander Trushev >Assignee: Alexander Trushev >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > InternalSchemaUtils.collectTypeChangedCols returns all columns instead of > changed ones -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on a diff in pull request #7232: [HUDI-5235] clustering target size should larger than small file limit
danny0405 commented on code in PR #7232: URL: https://github.com/apache/hudi/pull/7232#discussion_r1025957414 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringOperator.java: ## @@ -136,7 +136,13 @@ public ClusteringOperator(Configuration conf, RowType rowType) { // override max parquet file size in conf this.conf.setLong(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key(), - this.conf.getLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES)); +Integer.MAX_VALUE); + +// target size should larger than small file limit + this.conf.setLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT.key(), + this.conf.getLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES) > this.conf.getLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT) ? + this.conf.getLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT) : Review Comment: Seems reasonable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #7232: [HUDI-5235] clustering target size should larger than small file limit
danny0405 commented on code in PR #7232: URL: https://github.com/apache/hudi/pull/7232#discussion_r1025957285 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringOperator.java: ## @@ -136,7 +136,13 @@ public ClusteringOperator(Configuration conf, RowType rowType) { // override max parquet file size in conf this.conf.setLong(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key(), - this.conf.getLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES)); +Integer.MAX_VALUE); Review Comment: Don' think it is right yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5233) Fix bug when InternalSchemaUtils.collectTypeChangedCols returns all columns
[ https://issues.apache.org/jira/browse/HUDI-5233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Trushev updated HUDI-5233: Fix Version/s: 0.13.0 > Fix bug when InternalSchemaUtils.collectTypeChangedCols returns all columns > --- > > Key: HUDI-5233 > URL: https://issues.apache.org/jira/browse/HUDI-5233 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Alexander Trushev >Assignee: Alexander Trushev >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > InternalSchemaUtils.collectTypeChangedCols returns all columns instead of > changed ones -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7222: [MINOR] fixed Flink's DataStream does not support creating managed table
hudi-bot commented on PR #7222: URL: https://github.com/apache/hudi/pull/7222#issuecomment-1319501614 ## CI report: * 17aef066b20b39a90f6d22f243fd3cbb58004e68 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13078) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13081) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13079) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13091) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive
hudi-bot commented on PR #6725: URL: https://github.com/apache/hudi/pull/6725#issuecomment-1319498561 ## CI report: * 82ab5bfa3d2246beae8835178671840e4e1b77b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13073) * f3b7a61c226d136cd232a062a0f28e085f060035 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13095) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #7159: [HUDI-5173]Skip if there is only one file in clusteringGroup
danny0405 commented on PR #7159: URL: https://github.com/apache/hudi/pull/7159#issuecomment-1319497037 Thanks for the contribution, I have reviewed and created a patch here: [5173.zip](https://github.com/apache/hudi/files/10037492/5173.zip) You can apply the patch with cmd: `git apply xxx.patch`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive
hudi-bot commented on PR #6725: URL: https://github.com/apache/hudi/pull/6725#issuecomment-1319495871 ## CI report: * 82ab5bfa3d2246beae8835178671840e4e1b77b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13073) * f3b7a61c226d136cd232a062a0f28e085f060035 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query
YannByron commented on PR #7138: URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319486636 Nice work. Looks good, just leave two comments to solve. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a diff in pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query
YannByron commented on code in PR #7138: URL: https://github.com/apache/hudi/pull/7138#discussion_r1025940652 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -18,7 +18,6 @@ package org.apache.hudi import org.apache.hadoop.fs.Path - Review Comment: please keep the import code style that separate the different package by a blank line. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liaooo commented on pull request #3771: [HUDI-2402] Add Kerberos configuration options to Hive Sync
liaooo commented on PR #3771: URL: https://github.com/apache/hudi/pull/3771#issuecomment-1319480279 So now HiveSync supports kerberos or not? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5023) Evaluate removing Queueing in the write path
[ https://issues.apache.org/jira/browse/HUDI-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5023: -- Sprint: 2022/11/15 (was: 2022/11/29) > Evaluate removing Queueing in the write path > > > Key: HUDI-5023 > URL: https://issues.apache.org/jira/browse/HUDI-5023 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > > We should evaluate removing _any queueing_ (BoundedInMemoryQueue, > DisruptorQueue) on the write path for multiple reasons: > *It breaks up vertical chain of transformations applied to data* > Spark (alas other engines) rely on the notion of _Iteration_ to vertically > compose all transformations applied to a single record to allow for effective > _stream_ processing, where all transformations are applied to an _Iterator, > yielding records_ from the source, that way > # Chain of transformations* is applied to every record one by one, allowing > to effectively limit amount of memory used to the number of records being > read and processed simultaneously (if the reading is not batched, it'd be > just a single record), which in turn allows > # To limit # of memory allocations required to process a single record. > Consider the opposite: if we'd do it breadth-wise, applying first > transformation to _all_ of the records, we will have to store all of > transformed records in memory which is costly from both GC overhead as well > as pure object churn perspectives. > > Enqueueing is essentially violates both of these invariants, breaking up > {_}stream{_}-like processing model and forcing records to be kept in memory > for no good reason. > > * This chain is broken up at shuffling points (collection of tasks executed > b/w these shuffling points are called stages in Spark) > > *It requires data to be allocated on the heap* > As was called out in the previous paragraph, enqueueing raw data read from > the source breaks up _stream_ processing paradigm and forces records to be > persisted in the heap. > Consider following example: plain ParquetReader from Spark actually uses > *mutable* `ColumnarBatchRow` providing a Row-based view into the batch of > data being read from the file. > Now, since it's a mutable object we can use it to _iterate_ over all of the > records (while doing stream-processing) ultimately producing some "output" > (either writing into another file, shuffle block, etc), but we +can't keep a > reference on it+ (for ex, by +enqueueing+ it) – since the object is mutable. > Instead we are forced to make a *copy* of it, which will obviously require us > to allocate it on the heap. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers
[ https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4937: -- Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15 (was: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/29) > Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT > readers > - > > Key: HUDI-4937 > URL: https://issues.apache.org/jira/browse/HUDI-4937 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core, writer-core >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup > not to reuse actual LogScanner and HFileReader used to read MT itself. > This is proving to be wasteful on a number of occasions already, including > (not an exhaustive list): > https://github.com/apache/hudi/issues/6373 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4911) Make sure LogRecordReader doesn't flush the cache before each lookup
[ https://issues.apache.org/jira/browse/HUDI-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4911: -- Sprint: 2022/11/15 (was: 2022/11/29) > Make sure LogRecordReader doesn't flush the cache before each lookup > > > Key: HUDI-4911 > URL: https://issues.apache.org/jira/browse/HUDI-4911 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > > Currently {{HoodieMetadataMergedLogRecordReader }}will flush internal record > cache before each lookup which makes every lookup essentially do > re-processing of the whole log-blocks stack again. > We should avoid that and only do the re-parsing incrementally (for the keys > that ain't already cached) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] trushev commented on a diff in pull request #6358: [HUDI-4588][HUDI-4472] Addressing schema handling issues in the write path
trushev commented on code in PR #6358: URL: https://github.com/apache/hudi/pull/6358#discussion_r1025925085 ## hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/table/HoodieFlinkCopyOnWriteTable.java: ## @@ -378,7 +377,7 @@ protected Iterator> handleUpdateInternal(HoodieMergeHandlehttps://github.com/apache/hudi/pull/5830. The day will come when one of us will have to solve merge conflict with master branch. I'd prefer to do it ASAP because no `FlinkMergeHelper` means no conflict:) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-4919) Sql MERGE INTO incurs too much memory overhead
[ https://issues.apache.org/jira/browse/HUDI-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin reassigned HUDI-4919: - Assignee: sivabalan narayanan (was: Alexey Kudinkin) > Sql MERGE INTO incurs too much memory overhead > -- > > Key: HUDI-4919 > URL: https://issues.apache.org/jira/browse/HUDI-4919 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.13.0 > > > When using spark-sql MERGE INTO, memory requirement shoots up. To merge new > incoming data for 120MB parquet file, memory requirement shoots up > 10GB. > > from user: > We are trying to process some input data which is of 5 GB (Parquet snappy > compression) and this will try to insert/update Hudi table for 4 days (Day is > partition). > My Data size in Hudi target table for each partition is like around 3.5GB to > 10GB.We are trying to process the data and our process is keep failing with > OOM (java.lang.OutOfMemoryError: GC overhead limit exceeded). > We have tried with 32GB and 64GB of executor memory as well with 3 cores. > Our process is running fine when we have less updates and more inserts. > > > Got a brief of the issue again: > Its a partitioned dataset. and each partition roughly has 3.5 to 10GB size of > data. max parquet file size is default and so 120MB files max. > input batch is spread across last 3 to 4 partitions. > Incoming data is 5GB parquet compressed. > User tried giving close to 20GB per task (64 GB executor w/ 3 cores) and > still hit memory issues and failed. > If the incoming batch had fewer updates, it works. else it fails w/ OOM. > Tried w/ both BLOOM and simple, but did not work. > Similar incremental ingestion works w/o any issues w/ spark-ds writes. Issue > is only w/ MERGE INTO w/ spark-sql. > > Specifics about the table schema: > Table has around 50 columns and there are no nested fields > All data types are generic once like String,Timestamp,Decimal > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/29 (was: 2022/08/22, 2022/09/05, 2022/10/18) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Sprint: 2022/08/22, 2022/09/05, 2022/10/18 (was: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/15) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] YannByron commented on a diff in pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query
YannByron commented on code in PR #7138: URL: https://github.com/apache/hudi/pull/7138#discussion_r1025925016 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSource.scala: ## @@ -72,57 +68,21 @@ class HoodieStreamSource( parameters.get(DataSourceReadOptions.QUERY_TYPE.key).contains(DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL) && parameters.get(DataSourceReadOptions.INCREMENTAL_FORMAT.key).contains(DataSourceReadOptions.INCREMENTAL_FORMAT_CDC_VAL) - @transient private var lastOffset: HoodieSourceOffset = _ - @transient private lazy val initialOffsets = { -val metadataLog = - new HDFSMetadataLog[HoodieSourceOffset](sqlContext.sparkSession, metadataPath) { -override def serialize(metadata: HoodieSourceOffset, out: OutputStream): Unit = { - val writer = new BufferedWriter(new OutputStreamWriter(out, StandardCharsets.UTF_8)) - writer.write("v" + VERSION + "\n") - writer.write(metadata.json) - writer.flush() -} - -/** - * Deserialize the init offset from the metadata file. - * The format in the metadata file is like this: - * -- - * v1 -- The version info in the first line - * offsetJson -- The json string of HoodieSourceOffset in the rest of the file - * --- - * @param in - * @return - */ -override def deserialize(in: InputStream): HoodieSourceOffset = { - val content = FileIOUtils.readAsUTFString(in) - // Get version from the first line - val firstLineEnd = content.indexOf("\n") - if (firstLineEnd > 0) { -val version = getVersion(content.substring(0, firstLineEnd)) -if (version > VERSION) { - throw new IllegalStateException(s"UnSupportVersion: max support version is: $VERSION" + -s" current version is: $version") -} -// Get offset from the rest line in the file -HoodieSourceOffset.fromJson(content.substring(firstLineEnd + 1)) - } else { -throw new IllegalStateException(s"Bad metadata format, failed to find the version line.") - } -} - } +val metadataLog = new HoodieMetadataLog(sqlContext.sparkSession, metadataPath) metadataLog.get(0).getOrElse { - metadataLog.add(0, INIT_OFFSET) - INIT_OFFSET -} - } - - private def getVersion(versionLine: String): Int = { -if (versionLine.startsWith("v")) { - versionLine.substring(1).toInt -} else { - throw new IllegalStateException(s"Illegal version line: $versionLine " + -s"in the streaming metadata path") + val offset = offsetRangeLimit match { +case HoodieEarliestOffsetRangeLimit => + INIT_OFFSET +case HoodieLatestOffsetRangeLimit => + getLatestOffset.getOrElse(throw new HoodieException("Cannot fetch latest offset from table, " + Review Comment: can we use INIT_OFFSET when `getLatestOffset` is empty ? I mean `getLatestOffset.getOrElse(INIT_OFFSET)`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5148) Write RFC for index function
[ https://issues.apache.org/jira/browse/HUDI-5148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-5148: Status: Patch Available (was: In Progress) > Write RFC for index function > > > Key: HUDI-5148 > URL: https://issues.apache.org/jira/browse/HUDI-5148 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5210) End-to-end PoC of index function
[ https://issues.apache.org/jira/browse/HUDI-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-5210: Status: In Progress (was: Open) > End-to-end PoC of index function > > > Key: HUDI-5210 > URL: https://issues.apache.org/jira/browse/HUDI-5210 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-4812) Lazy partition listing and file groups fetching in Spark Query
[ https://issues.apache.org/jira/browse/HUDI-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-4812. Resolution: Done > Lazy partition listing and file groups fetching in Spark Query > -- > > Key: HUDI-4812 > URL: https://issues.apache.org/jira/browse/HUDI-4812 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Yuwei Xiao >Assignee: Yuwei Xiao >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > In current spark query implementation, the FileIndex will refresh and load > all file groups in cached in order to serve subsequent queries. > > For large table with many partitions, this may introduce much overhead in > initialization. Meanwhile, the query itself may come with partition filter. > So the loading of file groups will be unnecessary. > > So to optimize, the whole refresh logic will become lazy, where actual work > will be carried out only after the partition filter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query
hudi-bot commented on PR #7138: URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319460803 ## CI report: * c3171ba5115240fd5e00a0d47cf4f9b0c182b8e6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13074) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13094) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query
boneanxs commented on PR #7138: URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319460578 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org