Re: [PR] [HUDI-7980] Optimize the configuration content when performing clustering with row writer [hudi]
hudi-bot commented on PR #11614: URL: https://github.com/apache/hudi/pull/11614#issuecomment-169886 ## CI report: * 305cfba4c163a2d70bcbeff8029c9f2a2d205a3c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24819) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7980] Optimize the configuration content when performing clustering with row writer [hudi]
hudi-bot commented on PR #11614: URL: https://github.com/apache/hudi/pull/11614#issuecomment-154563 ## CI report: * 305cfba4c163a2d70bcbeff8029c9f2a2d205a3c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]
hudi-bot commented on PR #11612: URL: https://github.com/apache/hudi/pull/11612#issuecomment-143878 ## CI report: * 8669c1c9afa99b08f866c97ed18eac0446cb1b36 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24816) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Setup Python release pipeline [hudi-rs]
xushiyan closed issue #42: Setup Python release pipeline URL: https://github.com/apache/hudi-rs/issues/42 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7980) Optimize the configuration content when performing clustering with row writer
[ https://issues.apache.org/jira/browse/HUDI-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7980: - Labels: pull-request-available (was: ) > Optimize the configuration content when performing clustering with row writer > - > > Key: HUDI-7980 > URL: https://issues.apache.org/jira/browse/HUDI-7980 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ma Jian >Priority: Major > Labels: pull-request-available > > Currently, the row writer defaults to snapshot reads for all tables. However, > this method is relatively inefficient for MOR (Merge on Read) tables when > there are no logs. Therefore, we should optimize this part of the > configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7980] Optimize the configuration content when performing clustering with row writer [hudi]
majian1998 opened a new pull request, #11614: URL: https://github.com/apache/hudi/pull/11614 Currently, the row writer defaults to snapshot reads for all tables. However, this method is relatively inefficient for MOR tables when there are no logs. Additionally, we have already configured the glob path for queries. We need to read all the files from the glob path without requiring additional configurations for time travel queries. Therefore, we should optimize this part of the configuration. ### Change Logs None ### Impact None ### Risk level (write none, low medium or high below) low ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Exception when executing log compaction : Unsupported Operation Exception [hudi]
xuzifu666 commented on issue #10982: URL: https://github.com/apache/hudi/issues/10982#issuecomment-107992 This issue can close again. @xushiyan @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7980) Optimize the configuration content when performing clustering with row writer
Ma Jian created HUDI-7980: - Summary: Optimize the configuration content when performing clustering with row writer Key: HUDI-7980 URL: https://issues.apache.org/jira/browse/HUDI-7980 Project: Apache Hudi Issue Type: Improvement Reporter: Ma Jian Currently, the row writer defaults to snapshot reads for all tables. However, this method is relatively inefficient for MOR (Merge on Read) tables when there are no logs. Therefore, we should optimize this part of the configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi-rs) branch main updated: build: add release workflow (#63)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/hudi-rs.git The following commit(s) were added to refs/heads/main by this push: new cb9f4d8 build: add release workflow (#63) cb9f4d8 is described below commit cb9f4d8a64fd6e8c2e06fc2c79a6c6740eeadba4 Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Thu Jul 11 00:52:47 2024 -0500 build: add release workflow (#63) - Add `release.yml` workflow in github to publish artifacts to crates.io and pypi.org - Add various metadata to populate artifact repository main pages - Fix python artifacts issues wrt macos aarch64 Closes #41 #42 --- .github/workflows/release.yml | 146 ++ README.md | 34 +- crates/core/Cargo.toml| 5 ++ crates/datafusion/Cargo.toml | 7 +- crates/hudi/Cargo.toml| 12 +++- crates/tests/Cargo.toml | 5 ++ python/.cargo/config.toml | 1 + python/Cargo.toml | 12 ++-- python/README.md | 46 - python/pyproject.toml | 7 +- 10 files changed, 199 insertions(+), 76 deletions(-) diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml new file mode 100644 index 000..e7aa911 --- /dev/null +++ b/.github/workflows/release.yml @@ -0,0 +1,146 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Publish artifacts + +on: + push: +tags: + - 'release-[0-9]+.[0-9]+.[0-9]+**' + +jobs: + validate-release-tag: +name: Validate git tag +runs-on: ubuntu-latest +steps: + - uses: actions/checkout@v4 + - name: compare git tag with cargo metadata +run: | + CURR_VER=$( grep version Cargo.toml | head -n 1 | awk '{print $3}' | tr -d '"' ) + if [[ "${GITHUB_REF_NAME}" != "release-${CURR_VER}" ]]; then +echo "Pushed tag ${GITHUB_REF_NAME} does not match with the Cargo package version ${CURR_VER}." +exit 1 + fi + + release-crates: +name: Release to crates.io +needs: validate-release-tag +runs-on: ubuntu-latest +strategy: + max-parallel: 1 + matrix: +# order matters here as later crates depend on previous ones +package: + - "hudi-core" + - "hudi-datafusion" + - "hudi" +steps: + - uses: actions/checkout@v4 + + - uses: actions-rs/toolchain@v1 +with: + profile: minimal + toolchain: stable + override: true + + - name: cargo publish +uses: actions-rs/cargo@v1 +env: + CARGO_REGISTRY_TOKEN: ${{ secrets.CARGO_REGISTRY_TOKEN }} +with: + command: publish + args: -p ${{ matrix.package }} --all-features + + release-pypi-mac: +name: PyPI release on Mac +needs: validate-release-tag +strategy: + fail-fast: false + matrix: +target: [ x86_64-apple-darwin, aarch64-apple-darwin ] +runs-on: macos-latest +steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 +with: + python-version: '3.12' + + - name: Publish to pypi (without sdist) +uses: PyO3/maturin-action@v1 +env: + MATURIN_PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }} + MATURIN_REPOSITORY: pypi +with: + target: ${{ matrix.target }} + command: publish + args: --skip-existing -m python/Cargo.toml --no-sdist + + release-pypi-windows: +name: PyPI release on Windows +needs: validate-release-tag +runs-on: windows-latest +steps: + - uses: actions/checkout@v4 + - uses: actions/setup-python@v5 +with: + python-version: '3.12' + + - name: Publish to pypi (without sdist) +uses: PyO3/maturin-action@v1 +env: + MATURIN_PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }} + MATURIN_REPOSITORY: pypi +with: + target: x86_64-pc-windows-msvc + command: publish + args: --skip-existing -m python/Cargo.toml --no-sdist + + release-pypi-manylinux: +name: PyPI rele
Re: [I] Setup Rust release pipeline [hudi-rs]
xushiyan closed issue #41: Setup Rust release pipeline URL: https://github.com/apache/hudi-rs/issues/41 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Setup Rust release pipeline [hudi-rs]
xushiyan closed issue #41: Setup Rust release pipeline URL: https://github.com/apache/hudi-rs/issues/41 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] build: add release workflow [hudi-rs]
xushiyan merged PR #63: URL: https://github.com/apache/hudi-rs/pull/63 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] build: add release workflow [hudi-rs]
codecov[bot] commented on PR #63: URL: https://github.com/apache/hudi-rs/pull/63#issuecomment-080421 ## [Codecov](https://app.codecov.io/gh/apache/hudi-rs/pull/63?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 87.19%. Comparing base [(`2bb004b`)](https://app.codecov.io/gh/apache/hudi-rs/commit/2bb004b48efb5624813671e38c890c6abff01712?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`b7e85d2`)](https://app.codecov.io/gh/apache/hudi-rs/commit/b7e85d285b458562063b26a7b4759619a7e8cfce?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #63 +/- ## === Coverage 87.19% 87.19% === Files 13 13 Lines 687 687 === Hits 599 599 Misses 88 88 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/hudi-rs/pull/63?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] build: add release workflow [hudi-rs]
xushiyan opened a new pull request, #63: URL: https://github.com/apache/hudi-rs/pull/63 - Add `release.yml` workflow in github to publish artifacts to crates.io and pypi.org - Add various metadata to populate artifact repository main pages - Fix python artifacts issues wrt macos aarch64 Closes #41 #42 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-304] enable spotless for hudi [hudi]
HuangZhenQiu closed pull request #11613: [HUDI-304] enable spotless for hudi URL: https://github.com/apache/hudi/pull/11613 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-304] enable spotless for hudi [hudi]
hudi-bot commented on PR #11613: URL: https://github.com/apache/hudi/pull/11613#issuecomment-024178 ## CI report: * b1246eca2843a2bf080d0e9df74ffa7045a5935c UNKNOWN * 47c7f9f4728cfb71f1887b65d1befcde718aee34 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]
hudi-bot commented on PR #11612: URL: https://github.com/apache/hudi/pull/11612#issuecomment-024075 ## CI report: * 8669c1c9afa99b08f866c97ed18eac0446cb1b36 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24816) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7709) ClassCastException while reading the data using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7709: Summary: ClassCastException while reading the data using TimestampBasedKeyGenerator (was: Class Cast Exception while reading the data using TimestampBasedKeyGenerator) > ClassCastException while reading the data using TimestampBasedKeyGenerator > -- > > Key: HUDI-7709 > URL: https://issues.apache.org/jira/browse/HUDI-7709 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Assignee: Geser Dugarov >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > Github Issue - [https://github.com/apache/hudi/issues/11140] -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-304] enable spotless for hudi [hudi]
hudi-bot commented on PR #11613: URL: https://github.com/apache/hudi/pull/11613#issuecomment-017030 ## CI report: * b1246eca2843a2bf080d0e9df74ffa7045a5935c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]
hudi-bot commented on PR #11612: URL: https://github.com/apache/hudi/pull/11612#issuecomment-016941 ## CI report: * 8669c1c9afa99b08f866c97ed18eac0446cb1b36 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Exception while using HoodieStreamer protobuf data from Kafka [hudi]
danny0405 commented on issue #11598: URL: https://github.com/apache/hudi/issues/11598#issuecomment-016668 @the-other-tim-brown any insights here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-304] enable spoltess for hudi [hudi]
HuangZhenQiu opened a new pull request, #11613: URL: https://github.com/apache/hudi/pull/11613 ### Change Logs Enable spotless plugin for Hudi ### Impact No API change ### Risk level (write none, low medium or high below) No Risk ### Documentation No Documentation Documentation ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]
hudi-bot commented on PR #11611: URL: https://github.com/apache/hudi/pull/11611#issuecomment-008533 ## CI report: * 4de9be3830a22f5dc11538353b469c2083f1a35e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24815) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7976) Fix BUG introduced in HUDI-7955 due to usage of wrong class
[ https://issues.apache.org/jira/browse/HUDI-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7976: - Labels: pull-request-available (was: ) > Fix BUG introduced in HUDI-7955 due to usage of wrong class > --- > > Key: HUDI-7976 > URL: https://issues.apache.org/jira/browse/HUDI-7976 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Assignee: voon >Priority: Major > Labels: pull-request-available > > In the bugfix for HUDI-7955, the wrong class for invoking {{getTimestamp > }}was used. > # {*}Wrong{*}: org.apache.hadoop.hive.common.type.Timestamp > # {*}Correct{*}: org.apache.hadoop.hive.serde2.io.TimestampWritableV2 > > !https://git.garena.com/shopee/data-infra/hudi/uploads/eeff29b3e741c65eeb48f9901fa28da0/image.png|width=468,height=235! > > Submitting a bugfix to fix this bugfix... > Log levels for the exception block is also changed to warn so errors will be > printed out. > On top of that, we have simplified the {{getMillis}} shim to remove the > method that was added in HUDI-7955 to standardise it with how {{getDays}} is > written. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]
voonhous opened a new pull request, #11612: URL: https://github.com/apache/hudi/pull/11612 ### Change Logs In the bugfix for [HUDI-7955](https://issues.apache.org/jira/browse/HUDI-7955), the wrong class for invoking {{getTimestamp }}was used. 1. **Wrong**: org.apache.hadoop.hive.common.type.Timestamp 2. **Correct**: org.apache.hadoop.hive.serde2.io.TimestampWritableV2 ![image](https://github.com/apache/hudi/assets/6312314/3d1bca3a-2ad4-4e25-b421-daec91d7c65c) Submitting a bugfix to fix this bugfix... Log levels for the exception block is also changed to warn so errors will be printed out. However, false positives might be printed when the environment is indeed in Hive2. On top of that, we have simplified the `getMillis` shim to remove the method that was added in [HUDI-7955](https://issues.apache.org/jira/browse/HUDI-7955) to standardise it with how `getDays` is written. ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]
hudi-bot commented on PR #11539: URL: https://github.com/apache/hudi/pull/11539#issuecomment-2221956386 ## CI report: * dac29c7e89201f0ced6d394bf6fd4a5c0622167b UNKNOWN * e8ad55251f5c98b061208ceb2e52637b345e9db0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24813) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader
[ https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7971: -- Description: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW ** few write commits ** Pending clustering ** Completed Clustering ** Failed writes with no rollbacks ** Insert overwrite table/partition ** Savepoint for Time-travel query * MOR ** Same as COW ** Pending and completed async compaction (with log-files and no base file) ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) ** Log block formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled (all combinations) # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset (all combinations) # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading was: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW ** few write commits ** Pending clustering ** Completed Clustering ** Failed writes with no rollbacks ** Insert overwrite table/partition ** Savepoint for Time-travel query * MOR ** Same as COW ** Pending and completed async compaction (with log-files and no base file) ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) ** Log block formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading > Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader > - > > Key: HUDI-7971 > URL: https://issues.apache.org/jira/browse/HUDI-7971 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x > tables > > Readers : 1.x > # Spark SQL > # Spark Datasource > # Trino/Presto > # Hive > # Flink > Writer: 0.16 > Table State: > * COW > ** few write commits > ** Pending clustering > ** Completed Clustering > ** Failed writes with no rollbacks > ** Insert overwrite table/partition > ** Savepoint for Time-travel query > * MOR > ** Same as COW > ** Pending and completed async compaction (with log-files and no base file) > ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) > ** Log block formats - DELETE, rollback block > Other knobs: > # Metadata enabled/disabled (all combinations) > # Column Stats enabled/disabled and data-skipping enabled/disabled > # RLI enabled with eq/IN queries > # Non-Partitioned dataset (all combinations) > # CDC Reads > # Incremental Reads > # Time-travel query > > What to test ? > # Query Results Correctness > # Performance : See the benefit of > # Partition Pruning > # Metadata table - col stats, RLI, > > Corner Case Testing: > > # Schema Evolution with different file-groups having different generation of > schema > # Dynamic Partition Pruning > # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader
[ https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7971: -- Description: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW ** few write commits ** Pending clustering ** Completed Clustering ** Failed writes with no rollbacks ** Insert overwrite table/partition ** Savepoint for Time-travel query * MOR ** Same as COW ** Pending and completed async compaction (with log-files and no base file) ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) ** Log block formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading was: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW ** few write commits ** Pending clustering ** Completed Clustering ** Failed writes with no rollbacks ** Insert overwrite table/partition ** Savepoint for Time-travel query * MOR ** Same as COW ** Pending and completed async compaction (with log-files and no base file) ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) ** Rollback formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading > Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader > - > > Key: HUDI-7971 > URL: https://issues.apache.org/jira/browse/HUDI-7971 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x > tables > > Readers : 1.x > # Spark SQL > # Spark Datasource > # Trino/Presto > # Hive > # Flink > Writer: 0.16 > Table State: > * COW > ** few write commits > ** Pending clustering > ** Completed Clustering > ** Failed writes with no rollbacks > ** Insert overwrite table/partition > ** Savepoint for Time-travel query > * MOR > ** Same as COW > ** Pending and completed async compaction (with log-files and no base file) > ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) > ** Log block formats - DELETE, rollback block > Other knobs: > # Metadata enabled/disabled > # Column Stats enabled/disabled and data-skipping enabled/disabled > # RLI enabled with eq/IN queries > # Non-Partitioned dataset > # CDC Reads > # Incremental Reads > # Time-travel query > > What to test ? > # Query Results Correctness > # Performance : See the benefit of > # Partition Pruning > # Metadata table - col stats, RLI, > > Corner Case Testing: > > # Schema Evolution with different file-groups having different generation of > schema > # Dynamic Partition Pruning > # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch asf-site updated: [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql (#11610)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 488438fbb6a [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql (#11610) 488438fbb6a is described below commit 488438fbb6ae2f8dfcc9257016c66a38c0352171 Author: Sagar Sumit AuthorDate: Thu Jul 11 08:38:42 2024 +0530 [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql (#11610) --- website/docs/sql_ddl.md| 5 - website/versioned_docs/version-0.11.0/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.11.1/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.12.0/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.12.1/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.12.2/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.12.3/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.13.0/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.13.1/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.14.0/sql_ddl.md | 5 - website/versioned_docs/version-0.14.1/sql_ddl.md | 5 - website/versioned_docs/version-0.15.0/sql_ddl.md | 5 - 12 files changed, 72 insertions(+), 4 deletions(-) diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md index 61e7d33cd7f..a85d8a7bb04 100644 --- a/website/docs/sql_ddl.md +++ b/website/docs/sql_ddl.md @@ -67,7 +67,10 @@ PARTITIONED BY (dt); ``` :::note -You can also create a table partitioned by multiple fields by supplying comma-separated field names. For, e.g., "partitioned by dt, hh" +You can also create a table partitioned by multiple fields by supplying comma-separated field names. +When creating a table partitioned by multiple fields, ensure that you specify the columns in the `PARTITIONED BY` clause +in the same order as they appear in the `CREATE TABLE` schema. For example, for the above table, the partition fields +should be specified as `PARTITIONED BY (dt, hh)`. ::: ### Create table with record keys and ordering fields diff --git a/website/versioned_docs/version-0.11.0/quick-start-guide.md b/website/versioned_docs/version-0.11.0/quick-start-guide.md index 9f670871f6a..35336d3f4d4 100644 --- a/website/versioned_docs/version-0.11.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.11.0/quick-start-guide.md @@ -298,6 +298,13 @@ partitioned by (dt, hh) location '/tmp/hudi/hudi_cow_pt_tbl'; ``` +:::note +You can also create a table partitioned by multiple fields by supplying comma-separated field names. +When creating a table partitioned by multiple fields, ensure that you specify the columns in the `PARTITIONED BY` clause +in the same order as they appear in the `CREATE TABLE` schema. For example, for the above table, the partition fields +should be specified as `PARTITIONED BY (dt, hh)`. +::: + **Create Table for an existing Hudi Table** We can create a table on an existing hudi table(created with spark-shell or deltastreamer). This is useful to diff --git a/website/versioned_docs/version-0.11.1/quick-start-guide.md b/website/versioned_docs/version-0.11.1/quick-start-guide.md index d45b535ef42..d0c32790d5a 100644 --- a/website/versioned_docs/version-0.11.1/quick-start-guide.md +++ b/website/versioned_docs/version-0.11.1/quick-start-guide.md @@ -296,6 +296,13 @@ partitioned by (dt, hh) location '/tmp/hudi/hudi_cow_pt_tbl'; ``` +:::note +You can also create a table partitioned by multiple fields by supplying comma-separated field names. +When creating a table partitioned by multiple fields, ensure that you specify the columns in the `PARTITIONED BY` clause +in the same order as they appear in the `CREATE TABLE` schema. For example, for the above table, the partition fields +should be specified as `PARTITIONED BY (dt, hh)`. +::: + **Create Table for an existing Hudi Table** We can create a table on an existing hudi table(created with spark-shell or deltastreamer). This is useful to diff --git a/website/versioned_docs/version-0.12.0/quick-start-guide.md b/website/versioned_docs/version-0.12.0/quick-start-guide.md index aac9a9bd048..9fc3a0414f5 100644 --- a/website/versioned_docs/version-0.12.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.12.0/quick-start-guide.md @@ -322,6 +322,13 @@ partitioned by (dt, hh) location '/tmp/hudi/hudi_cow_pt_tbl'; ``` +:::note +You can also create a table partitioned by multiple fields by supplying comma-separated field names. +When creating a table partitioned by multiple fields, ensure that you specify the columns in the `PARTITIONED BY` clause +in the same order as they appear in the `CREATE TABLE` schema. For example, for the above table, the
Re: [PR] [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql [hudi]
nsivabalan merged PR #11610: URL: https://github.com/apache/hudi/pull/11610 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]
hudi-bot commented on PR #11611: URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221892912 ## CI report: * 3016b3dc7dff77b95cb349b5ef5b6ba436fa05b4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24809) * 4de9be3830a22f5dc11538353b469c2083f1a35e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24815) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]
hudi-bot commented on PR #11611: URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221885010 ## CI report: * 3016b3dc7dff77b95cb349b5ef5b6ba436fa05b4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24809) * 4de9be3830a22f5dc11538353b469c2083f1a35e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]
hudi-bot commented on PR #11539: URL: https://github.com/apache/hudi/pull/11539#issuecomment-2221884743 ## CI report: * dac29c7e89201f0ced6d394bf6fd4a5c0622167b UNKNOWN * c4ccc63e66957214447210afa8365e08a2548ea8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24812) * e8ad55251f5c98b061208ceb2e52637b345e9db0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24813) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]
hudi-bot commented on PR #11539: URL: https://github.com/apache/hudi/pull/11539#issuecomment-2221877431 ## CI report: * dac29c7e89201f0ced6d394bf6fd4a5c0622167b UNKNOWN * c4ccc63e66957214447210afa8365e08a2548ea8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24812) * e8ad55251f5c98b061208ceb2e52637b345e9db0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]
danny0405 commented on code in PR #11605: URL: https://github.com/apache/hudi/pull/11605#discussion_r1673317175 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java: ## @@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy strategy) { } } + private HoodieCleanerPlan getEmptyCleanerPlan(Option earliestInstant, CleanPlanner planner) throws IOException { +LOG.info("Nothing to clean here. It is already clean"); +Option instantVal = getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant); +HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder(); +instantVal.map(x -> cleanBuilder.setPolicy(config.getCleanerPolicy().name()) +.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION) +.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())) + .setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp()) + ).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())); +return cleanBuilder.build(); + } + + /** + * There is nothing to clean so this method will create an empty clean plan. But there is chance of optimizing + * the subsequent cleaner calls. Consider this scenarios in incremental cleaner mode, + * If clean timeline is empty or no clean commits were created for a while then every clean call will have to + * scan all the partitions, by creating an empty clean commit to update earliestCommitToRetain instant value, + * incremental clean policy does not have to look for file changes in all the partitions, rather it will look + * for partitions that are modified in last x hours. This value is configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN. Review Comment: > Do you think we should create separate partition in metadata table for storing these such key, value configs? I kind of think storing the single value `earliestCommitToRetain`somewhere in the .hoodie is feasible, just not sure whether we should utilize the MDT because MDT itself is very heavy-weight, maybe some auxiliary marker file is enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7921] Fixing file system view closures in MDT (#11496)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 3789840be3d [HUDI-7921] Fixing file system view closures in MDT (#11496) 3789840be3d is described below commit 3789840be3d041cbcfc6b24786740210e4e6d6ac Author: Sivabalan Narayanan AuthorDate: Wed Jul 10 19:25:41 2024 -0700 [HUDI-7921] Fixing file system view closures in MDT (#11496) --- .../metadata/HoodieBackedTableMetadataWriter.java | 55 ++-- .../common/testutils/HoodieMetadataTestTable.java | 6 + .../java/org/apache/hudi/table/TestCleaner.java| 326 +++-- .../table/functional/TestCleanPlanExecutor.java| 325 ++-- .../hudi/testutils/HoodieCleanerTestBase.java | 31 +- .../hudi/metadata/HoodieBackedTableMetadata.java | 4 + .../hudi/metadata/HoodieTableMetadataUtil.java | 48 +-- .../hudi/common/testutils/HoodieTestTable.java | 8 +- 8 files changed, 440 insertions(+), 363 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java index 89d21e79b22..c38a68e37cf 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java @@ -1081,9 +1081,8 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableM private HoodieData getFunctionalIndexUpdates(HoodieCommitMetadata commitMetadata, String indexPartition, String instantTime) throws Exception { HoodieIndexDefinition indexDefinition = getFunctionalIndexDefinition(indexPartition); List> partitionFileSlicePairs = new ArrayList<>(); -HoodieTableFileSystemView fsView = HoodieTableMetadataUtil.getFileSystemView(dataMetaClient); commitMetadata.getPartitionToWriteStats().forEach((dataPartition, value) -> { - List fileSlices = getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, Option.ofNullable(fsView), dataPartition); + List fileSlices = getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, Option.empty(), dataPartition); fileSlices.forEach(fileSlice -> { // Filter log files for the instant time and add to this partition fileSlice pairs List logFilesForInstant = fileSlice.getLogFiles() @@ -1411,35 +1410,35 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableM HoodieData> partitionRecordsMap) { // The result set HoodieData allPartitionRecords = engineContext.emptyHoodieData(); +try (HoodieTableFileSystemView fsView = HoodieTableMetadataUtil.getFileSystemView(metadataMetaClient)) { + for (Map.Entry> entry : partitionRecordsMap.entrySet()) { +final String partitionName = HoodieIndexUtils.getPartitionNameFromPartitionType(entry.getKey(), dataMetaClient, dataWriteConfig.getIndexingConfig().getIndexName()); +HoodieData records = entry.getValue(); + +List fileSlices = + HoodieTableMetadataUtil.getPartitionLatestFileSlices(metadataMetaClient, Option.ofNullable(fsView), partitionName); +if (fileSlices.isEmpty()) { + // scheduling of INDEX only initializes the file group and not add commit + // so if there are no committed file slices, look for inflight slices + fileSlices = getPartitionLatestFileSlicesIncludingInflight(metadataMetaClient, Option.ofNullable(fsView), partitionName); +} +final int fileGroupCount = fileSlices.size(); +ValidationUtils.checkArgument(fileGroupCount > 0, String.format("FileGroup count for MDT partition %s should be >0", partitionName)); + +List finalFileSlices = fileSlices; +HoodieData rddSinglePartitionRecords = records.map(r -> { + FileSlice slice = finalFileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), + fileGroupCount)); + r.unseal(); + r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId())); + r.seal(); + return r; +}); -HoodieTableFileSystemView fsView = HoodieTableMetadataUtil.getFileSystemView(metadataMetaClient); -for (Map.Entry> entry : partitionRecordsMap.entrySet()) { - final String partitionName = HoodieIndexUtils.getPartitionNameFromPartitionType(entry.getKey(), dataMetaClient, dataWriteConfig.getIndexingConfig().getIndexName()); - HoodieData records = entry.getValue(); - - List fileSlices = - HoodieTableMetadataUtil.getPartitionLatestFileSlices(metadataMetaClient, Option.ofNullable(fsView), partitionNa
Re: [PR] [HUDI-7921] Fixing file system view closures in MDT [hudi]
nsivabalan merged PR #11496: URL: https://github.com/apache/hudi/pull/11496 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]
hudi-bot commented on PR #11539: URL: https://github.com/apache/hudi/pull/11539#issuecomment-2221839392 ## CI report: * dac29c7e89201f0ced6d394bf6fd4a5c0622167b UNKNOWN * c14015c3618d231bc439c0a4fb14ce2dff32de00 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24803) * c4ccc63e66957214447210afa8365e08a2548ea8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24812) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7774] Add Avro Logical type support for Merciful Java convertor [hudi]
codope commented on code in PR #11265: URL: https://github.com/apache/hudi/pull/11265#discussion_r1673287677 ## hudi-common/src/test/resources/date-type-invalid.avsc: ## Review Comment: ok with separate resource files, but we should be conscious as too many files like these will bloat the bundle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7774] Add Avro Logical type support for Merciful Java convertor [hudi]
codope commented on code in PR #11265: URL: https://github.com/apache/hudi/pull/11265#discussion_r1673285947 ## hudi-common/src/main/java/org/apache/hudi/avro/MercifulJsonConverter.java: ## @@ -187,196 +178,774 @@ private static Object convertJsonToAvroField(Object value, String name, Schema s throw new HoodieJsonToAvroConversionException(null, name, schema, shouldSanitize, invalidCharMask); } -JsonToAvroFieldProcessor processor = FIELD_TYPE_PROCESSORS.get(schema.getType()); -if (null != processor) { +return JsonToAvroFieldProcessorUtil.convertToAvro(value, name, schema, shouldSanitize, invalidCharMask); + } + + private static class JsonToAvroFieldProcessorUtil { +/** + * Base Class for converting json to avro fields. + */ +private abstract static class JsonToAvroFieldProcessor implements Serializable { + + public Object convertToAvro(Object value, String name, Schema schema, boolean shouldSanitize, String invalidCharMask) { +Pair res = convert(value, name, schema, shouldSanitize, invalidCharMask); +if (!res.getLeft()) { + throw new HoodieJsonToAvroConversionException(value, name, schema, shouldSanitize, invalidCharMask); +} +return res.getRight(); + } + + protected abstract Pair convert(Object value, String name, Schema schema, boolean shouldSanitize, String invalidCharMask); +} + +public static Object convertToAvro(Object value, String name, Schema schema, boolean shouldSanitize, String invalidCharMask) { + JsonToAvroFieldProcessor processor = getProcessorForSchema(schema); return processor.convertToAvro(value, name, schema, shouldSanitize, invalidCharMask); } -throw new IllegalArgumentException("JsonConverter cannot handle type: " + schema.getType()); - } - /** - * Base Class for converting json to avro fields. - */ - private abstract static class JsonToAvroFieldProcessor implements Serializable { +private static JsonToAvroFieldProcessor getProcessorForSchema(Schema schema) { + JsonToAvroFieldProcessor processor = null; + + // 3 cases to consider: customized logicalType, logicalType, and type. + String customizedLogicalType = schema.getProp("logicalType"); + LogicalType logicalType = schema.getLogicalType(); + Type type = schema.getType(); + if (customizedLogicalType != null && !customizedLogicalType.isEmpty()) { +processor = AVRO_LOGICAL_TYPE_FIELD_PROCESSORS.get(customizedLogicalType); + } else if (logicalType != null) { +processor = AVRO_LOGICAL_TYPE_FIELD_PROCESSORS.get(logicalType.getName()); + } else { +processor = AVRO_TYPE_FIELD_TYPE_PROCESSORS.get(type); + } -public Object convertToAvro(Object value, String name, Schema schema, boolean shouldSanitize, String invalidCharMask) { - Pair res = convert(value, name, schema, shouldSanitize, invalidCharMask); - if (!res.getLeft()) { -throw new HoodieJsonToAvroConversionException(value, name, schema, shouldSanitize, invalidCharMask); + if (processor == null) { +throw new IllegalArgumentException(String.format("JsonConverter cannot handle type: %s", type)); } - return res.getRight(); + return processor; } -protected abstract Pair convert(Object value, String name, Schema schema, boolean shouldSanitize, String invalidCharMask); - } +// Avro primitive and complex type processors. +private static final Map AVRO_TYPE_FIELD_TYPE_PROCESSORS = getFieldTypeProcessors(); +// Avro logical type processors. +private static final Map AVRO_LOGICAL_TYPE_FIELD_PROCESSORS = getLogicalFieldTypeProcessors(); + +/** + * Build type processor map for each avro type. + */ +private static Map getFieldTypeProcessors() { + Map fieldTypeProcessors = new EnumMap<>(Schema.Type.class); + fieldTypeProcessors.put(Type.STRING, generateStringTypeHandler()); + fieldTypeProcessors.put(Type.BOOLEAN, generateBooleanTypeHandler()); + fieldTypeProcessors.put(Type.DOUBLE, generateDoubleTypeHandler()); + fieldTypeProcessors.put(Type.FLOAT, generateFloatTypeHandler()); + fieldTypeProcessors.put(Type.INT, generateIntTypeHandler()); + fieldTypeProcessors.put(Type.LONG, generateLongTypeHandler()); + fieldTypeProcessors.put(Type.ARRAY, generateArrayTypeHandler()); + fieldTypeProcessors.put(Type.RECORD, generateRecordTypeHandler()); + fieldTypeProcessors.put(Type.ENUM, generateEnumTypeHandler()); + fieldTypeProcessors.put(Type.MAP, generateMapTypeHandler()); + fieldTypeProcessors.put(Type.BYTES, generateBytesTypeHandler()); + fieldTypeProcessors.put(Type.FIXED, generateFixedTypeHandler()); + return Collections.unmodifiableMap(fieldTypeProcessors); +} - private static JsonToAvroFieldProcessor generateBooleanTypeHandler() { -return new JsonToAvroFieldProcessor() { +private
Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]
hudi-bot commented on PR #11539: URL: https://github.com/apache/hudi/pull/11539#issuecomment-2221830778 ## CI report: * dac29c7e89201f0ced6d394bf6fd4a5c0622167b UNKNOWN * c14015c3618d231bc439c0a4fb14ce2dff32de00 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24803) * c4ccc63e66957214447210afa8365e08a2548ea8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql [hudi]
codope commented on code in PR #11610: URL: https://github.com/apache/hudi/pull/11610#discussion_r1673266828 ## website/versioned_docs/version-0.15.0/sql_ddl.md: ## @@ -67,7 +67,10 @@ PARTITIONED BY (dt); ``` :::note -You can also create a table partitioned by multiple fields by supplying comma-separated field names. For, e.g., "partitioned by dt, hh" +You can also create a table partitioned by multiple fields by supplying comma-separated field names. +When creating a table partitioned by multiple fields, ensure that you specify the columns in the `PARTITIONED BY` clause +in the same order as they appear in the `CREATE TABLE` schema. For example, for the above table, the partition fields +should be specified as `PARTITIONED BY (dt, hh)`. Review Comment: I could not find sql ddl docs in earlier version but later realized this was part of quickstart itself. Tested back till 0.10, which error out, but other versions create table and insert data in incorrect partition. Updated the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7961] Optimizing upsert partitioner for prepped write operations (#11581)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 01d905ff46b [HUDI-7961] Optimizing upsert partitioner for prepped write operations (#11581) 01d905ff46b is described below commit 01d905ff46bceb735ac6d3fa2960ec34908dfcd2 Author: Sivabalan Narayanan AuthorDate: Wed Jul 10 18:18:07 2024 -0700 [HUDI-7961] Optimizing upsert partitioner for prepped write operations (#11581) --- .../table/action/commit/BaseSparkCommitActionExecutor.java | 2 +- .../commit/SparkInsertOverwriteCommitActionExecutor.java | 2 +- .../table/action/commit/SparkInsertOverwritePartitioner.java | 5 +++-- .../apache/hudi/table/action/commit/UpsertPartitioner.java | 10 -- .../deltacommit/BaseSparkDeltaCommitActionExecutor.java | 2 +- .../deltacommit/SparkUpsertDeltaCommitPartitioner.java | 5 +++-- .../hudi/table/action/commit/TestUpsertPartitioner.java | 12 ++-- .../org/apache/hudi/common/model/WriteOperationType.java | 4 8 files changed, 27 insertions(+), 15 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java index 32e4824b8b8..36902a8c3f2 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java @@ -370,7 +370,7 @@ public abstract class BaseSparkCommitActionExecutor extends if (profile == null) { throw new HoodieUpsertException("Need workload profile to construct the upsert partitioner."); } -return new UpsertPartitioner<>(profile, context, table, config); +return new UpsertPartitioner<>(profile, context, table, config, operationType); } public Partitioner getInsertPartitioner(WorkloadProfile profile) { diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java index ac84475bfa4..63342989c79 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java @@ -71,7 +71,7 @@ public class SparkInsertOverwriteCommitActionExecutor protected Partitioner getPartitioner(WorkloadProfile profile) { return table.getStorageLayout().layoutPartitionerClass() .map(c -> getLayoutPartitioner(profile, c)) -.orElseGet(() -> new SparkInsertOverwritePartitioner(profile, context, table, config)); +.orElseGet(() -> new SparkInsertOverwritePartitioner(profile, context, table, config, operationType)); } @Override diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java index cdf2bcd0345..d2cef9250e6 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java @@ -20,6 +20,7 @@ package org.apache.hudi.table.action.commit; import org.apache.hudi.common.engine.HoodieEngineContext; import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.WriteOperationType; import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.table.HoodieTable; import org.apache.hudi.table.WorkloadProfile; @@ -38,8 +39,8 @@ public class SparkInsertOverwritePartitioner extends UpsertPartitioner { private static final Logger LOG = LoggerFactory.getLogger(SparkInsertOverwritePartitioner.class); public SparkInsertOverwritePartitioner(WorkloadProfile profile, HoodieEngineContext context, HoodieTable table, - HoodieWriteConfig config) { -super(profile, context, table, config); + HoodieWriteConfig config, WriteOperationType operationType) { +super(profile, context, table, config, operationType); } @Override diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.ja
Re: [PR] [HUDI-7961] Optimizing upsert partitioner for prepped write operations [hudi]
codope merged PR #11581: URL: https://github.com/apache/hudi/pull/11581 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Exception while using HoodieStreamer protobuf data from Kafka [hudi]
gauravg1977 commented on issue #11598: URL: https://github.com/apache/hudi/issues/11598#issuecomment-2221804280 Please let me know if I am missing any details in describing the issue or if there is something basic that I am missing here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]
suryaprasanna commented on code in PR #11605: URL: https://github.com/apache/hudi/pull/11605#discussion_r1673243602 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java: ## @@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy strategy) { } } + private HoodieCleanerPlan getEmptyCleanerPlan(Option earliestInstant, CleanPlanner planner) throws IOException { +LOG.info("Nothing to clean here. It is already clean"); +Option instantVal = getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant); +HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder(); +instantVal.map(x -> cleanBuilder.setPolicy(config.getCleanerPolicy().name()) +.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION) +.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())) + .setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp()) + ).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())); +return cleanBuilder.build(); + } + + /** + * There is nothing to clean so this method will create an empty clean plan. But there is chance of optimizing + * the subsequent cleaner calls. Consider this scenarios in incremental cleaner mode, + * If clean timeline is empty or no clean commits were created for a while then every clean call will have to + * scan all the partitions, by creating an empty clean commit to update earliestCommitToRetain instant value, + * incremental clean policy does not have to look for file changes in all the partitions, rather it will look + * for partitions that are modified in last x hours. This value is configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN. Review Comment: > Can we do force clean in the rewrite jobs I think this may not work since we need to have restore capabilities on the dataset for 24 hours. > store the last earliestCommitToRetain separately in the metadata Generally speaking, clean's earliestCommitToRetain value and dataset's checkpoints information should be stored in a separate location maybe hoodie.properties or something like in a Hudi metaserver. Do you think we should create separate partition in metadata table for storing these such key, value configs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]
suryaprasanna commented on code in PR #11605: URL: https://github.com/apache/hudi/pull/11605#discussion_r1673243602 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java: ## @@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy strategy) { } } + private HoodieCleanerPlan getEmptyCleanerPlan(Option earliestInstant, CleanPlanner planner) throws IOException { +LOG.info("Nothing to clean here. It is already clean"); +Option instantVal = getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant); +HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder(); +instantVal.map(x -> cleanBuilder.setPolicy(config.getCleanerPolicy().name()) +.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION) +.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())) + .setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp()) + ).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())); +return cleanBuilder.build(); + } + + /** + * There is nothing to clean so this method will create an empty clean plan. But there is chance of optimizing + * the subsequent cleaner calls. Consider this scenarios in incremental cleaner mode, + * If clean timeline is empty or no clean commits were created for a while then every clean call will have to + * scan all the partitions, by creating an empty clean commit to update earliestCommitToRetain instant value, + * incremental clean policy does not have to look for file changes in all the partitions, rather it will look + * for partitions that are modified in last x hours. This value is configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN. Review Comment: > Can we do force clean in the rewrite jobs I think this may not work for our use case, since we need to have restore capabilities on the dataset upto 24 hours. > store the last earliestCommitToRetain separately in the metadata Generally speaking, clean's earliestCommitToRetain value and dataset's checkpoints information should be stored in a separate location maybe hoodie.properties or something like in a Hudi metaserver. Do you think we should create separate partition in metadata table for storing these such key, value configs? That might solve the file scanning issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]
suryaprasanna commented on code in PR #11605: URL: https://github.com/apache/hudi/pull/11605#discussion_r1673243602 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java: ## @@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy strategy) { } } + private HoodieCleanerPlan getEmptyCleanerPlan(Option earliestInstant, CleanPlanner planner) throws IOException { +LOG.info("Nothing to clean here. It is already clean"); +Option instantVal = getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant); +HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder(); +instantVal.map(x -> cleanBuilder.setPolicy(config.getCleanerPolicy().name()) +.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION) +.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())) + .setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp()) + ).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())); +return cleanBuilder.build(); + } + + /** + * There is nothing to clean so this method will create an empty clean plan. But there is chance of optimizing + * the subsequent cleaner calls. Consider this scenarios in incremental cleaner mode, + * If clean timeline is empty or no clean commits were created for a while then every clean call will have to + * scan all the partitions, by creating an empty clean commit to update earliestCommitToRetain instant value, + * incremental clean policy does not have to look for file changes in all the partitions, rather it will look + * for partitions that are modified in last x hours. This value is configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN. Review Comment: > Can we do force clean in the rewrite jobs I think this may not work for our use case, since we need to have restore capabilities on the dataset upto 24 hours. > store the last earliestCommitToRetain separately in the metadata Generally speaking, clean's earliestCommitToRetain value and dataset's checkpoints information should be stored in a separate location maybe hoodie.properties or something like in a Hudi metaserver. Do you think we should create separate partition in metadata table for storing these such key, value configs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7975] Provide an API to create empty commit [hudi]
suryaprasanna commented on PR #11606: URL: https://github.com/apache/hudi/pull/11606#issuecomment-2221771756 > should we think through a way to improve the incremental query performance on the timeline instead of these tricky changes? The reason for creating an empty commit is to trigger table service operations like rollback, clean and archival. We have noticed that users are not calling any hudi APIs when there is no data to ingest, but our internal table services cannot handle all the cases. Another reason we also need this when data is getting ingested from multiple sources and each writer is tracking its own checkpoint in the commit. Then in that setup, if one writer is making frequent writes and other is not writing that frequently then there is a case where checkpoints stored by the second writer can be archived. So, we need a better way to store those checkpoint information as well. This change helps us in both triggering table services and also copy checkpoints. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]
suryaprasanna commented on code in PR #11605: URL: https://github.com/apache/hudi/pull/11605#discussion_r1673243602 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java: ## @@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy strategy) { } } + private HoodieCleanerPlan getEmptyCleanerPlan(Option earliestInstant, CleanPlanner planner) throws IOException { +LOG.info("Nothing to clean here. It is already clean"); +Option instantVal = getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant); +HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder(); +instantVal.map(x -> cleanBuilder.setPolicy(config.getCleanerPolicy().name()) +.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION) +.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())) + .setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp()) + ).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())); +return cleanBuilder.build(); + } + + /** + * There is nothing to clean so this method will create an empty clean plan. But there is chance of optimizing + * the subsequent cleaner calls. Consider this scenarios in incremental cleaner mode, + * If clean timeline is empty or no clean commits were created for a while then every clean call will have to + * scan all the partitions, by creating an empty clean commit to update earliestCommitToRetain instant value, + * incremental clean policy does not have to look for file changes in all the partitions, rather it will look + * for partitions that are modified in last x hours. This value is configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN. Review Comment: > Can we do force clean in the rewrite jobs I think this may not work since we need to have restore capabilities on the dataset for 24 hours. > store the last earliestCommitToRetain separately in the metadata Generally speaking, clean's earliestCommitToRetain value and dataset's checkpoints information should be stored in a separate location maybe hoodie.properties or something like in a Hudi metaserver. Do you think we should create separate partition in metadata table for storing these such key, value configs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]
danny0405 commented on PR #11611: URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221763683 Does the 15MB comes from real practice? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Hudi Failed to read MARKERS file [hudi]
danny0405 commented on issue #6900: URL: https://github.com/apache/hudi/issues/6900#issuecomment-2221758340 > Could not read commit details from hdfs://hacluster/user/kylin/flink/data/streaming_rdss_rcsp_lab/2024062815382133 Is this a real file on storage? Did you check the integrity of it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]
danny0405 commented on PR #11440: URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221755217 > did you guys sync up on that. if not, can we sync up and make forward progress. I thought we have synced up and reached concensus in the last stand-up meeting, no? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] hive sql查询hudi分区表,如果分区字段不是表最后一列,解析parquet文件后返回的数据,没有查询分区字段单在分区字段列位置自动增加了分区字段的值,导致后续列错误发生类型转换问题 [hudi]
danny0405 commented on issue #11609: URL: https://github.com/apache/hudi/issues/11609#issuecomment-2221753169 @xicm Can you help with this case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]
danny0405 commented on code in PR #11605: URL: https://github.com/apache/hudi/pull/11605#discussion_r1673232619 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java: ## @@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy strategy) { } } + private HoodieCleanerPlan getEmptyCleanerPlan(Option earliestInstant, CleanPlanner planner) throws IOException { +LOG.info("Nothing to clean here. It is already clean"); +Option instantVal = getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant); +HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder(); +instantVal.map(x -> cleanBuilder.setPolicy(config.getCleanerPolicy().name()) +.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION) +.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())) + .setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp()) + ).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())); +return cleanBuilder.build(); + } + + /** + * There is nothing to clean so this method will create an empty clean plan. But there is chance of optimizing + * the subsequent cleaner calls. Consider this scenarios in incremental cleaner mode, + * If clean timeline is empty or no clean commits were created for a while then every clean call will have to + * scan all the partitions, by creating an empty clean commit to update earliestCommitToRetain instant value, + * incremental clean policy does not have to look for file changes in all the partitions, rather it will look + * for partitions that are modified in last x hours. This value is configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN. Review Comment: > Also, we cannot disable cleans for such datasets, since those datasets can be onboarded to clustering and can do adhoc rewrites on some partitions. So, clean have to be enabled all the time and when we are enabling cleans it is scanning all the partitions. So, we thought of storing the earliestCommitToRetain value somewhere and use it to store the commits it has scanned. Can we do force clean in the rewrite jobs or store the last `earliestCommitToRetain` separately in the metadata, instead of empty clean commits, like as you mentioned, most of the clean commits would just be useless and empty for this special inserts scenario. I kind of think we should come up with a more general solution for append only use case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7977) Improve bucket index paritioner algorithm
[ https://issues.apache.org/jira/browse/HUDI-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7977. Resolution: Fixed Fixed via master branch:16ee6bc4e329e6ecd7887039cb5216aecf571e8c > Improve bucket index paritioner algorithm > - > > Key: HUDI-7977 > URL: https://issues.apache.org/jira/browse/HUDI-7977 > Project: Apache Hudi > Issue Type: Improvement > Components: index >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > imporve {{BucketIndexUtil}} partitionIndex algorithm make the data be evenly > distributed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7977) Improve bucket index paritioner algorithm
[ https://issues.apache.org/jira/browse/HUDI-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7977: - Summary: Improve bucket index paritioner algorithm (was: improve bucket index paritioner) > Improve bucket index paritioner algorithm > - > > Key: HUDI-7977 > URL: https://issues.apache.org/jira/browse/HUDI-7977 > Project: Apache Hudi > Issue Type: Improvement > Components: index >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > imporve {{BucketIndexUtil}} partitionIndex algorithm make the data be evenly > distributed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7977] Improve bucket index partitioner algorithm (#11608)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 16ee6bc4e32 [HUDI-7977] Improve bucket index partitioner algorithm (#11608) 16ee6bc4e32 is described below commit 16ee6bc4e329e6ecd7887039cb5216aecf571e8c Author: KnightChess <981159...@qq.com> AuthorDate: Thu Jul 11 08:11:48 2024 +0800 [HUDI-7977] Improve bucket index partitioner algorithm (#11608) --- .../hudi/common/util/hash/BucketIndexUtil.java | 26 +--- .../hudi/common/util/hash/TestBucketIndexUtil.java | 152 + 2 files changed, 157 insertions(+), 21 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java b/hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java index adfdd4540d8..ea3f6a2a12c 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java @@ -36,26 +36,10 @@ public class BucketIndexUtil { * @return The partition index of this bucket. */ public static Functions.Function2 getPartitionIndexFunc(int bucketNum, int parallelism) { -if (parallelism < bucketNum) { - return (partition, curBucket) -> { -int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / parallelism * bucketNum; -int globalIndex = partitionIndex + curBucket; -return globalIndex % parallelism; - }; -} else { - if (parallelism % bucketNum == 0) { -return (partition, curBucket) -> { - int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / (parallelism / bucketNum) * bucketNum; - int globalIndex = partitionIndex + curBucket; - return globalIndex % parallelism; -}; - } else { -return (partition, curBucket) -> { - int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / (parallelism / bucketNum + 1) * bucketNum; - int globalIndex = partitionIndex + curBucket; - return globalIndex % parallelism; -}; - } -} +return (partition, curBucket) -> { + int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) % parallelism * bucketNum; + int globalIndex = partitionIndex + curBucket; + return globalIndex % parallelism; +}; } } diff --git a/hudi-common/src/test/java/org/apache/hudi/common/util/hash/TestBucketIndexUtil.java b/hudi-common/src/test/java/org/apache/hudi/common/util/hash/TestBucketIndexUtil.java new file mode 100644 index 000..91c0da003f4 --- /dev/null +++ b/hudi-common/src/test/java/org/apache/hudi/common/util/hash/TestBucketIndexUtil.java @@ -0,0 +1,152 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.util.hash; + +import org.apache.hudi.common.util.Functions; + +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.Arguments; +import org.junit.jupiter.params.provider.MethodSource; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Stream; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +public class TestBucketIndexUtil { + + private static Stream partitionParams() { +List argsList = new ArrayList<>(); +argsList.add(Arguments.of(10, 5, true)); +argsList.add(Arguments.of(20, 5, true)); +argsList.add(Arguments.of(21, 5, true)); +argsList.add(Arguments.of(40, 5, true)); +argsList.add(Arguments.of(41, 5, true)); +argsList.add(Arguments.of(100, 5, true)); +argsList.add(Arguments.of(101, 5, true)); +argsList.add(Arguments.of(20, 100, true)); +argsList.add(Arguments.of(21, 100, true)); +argsList.add(Arguments.of(100, 100, true)); +argsList.add(Arguments.of(101, 100, true)); +argsList.add(Arguments.of(200, 100, true)); +argsList.add(Arguments.of(201, 100, tr
Re: [PR] [HUDI-7977] Improve bucket index partitioner algorithm [hudi]
danny0405 merged PR #11608: URL: https://github.com/apache/hudi/pull/11608 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch asf-site updated: fix: update home page title (#11530)
This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new cbeff1cb9ca fix: update home page title (#11530) cbeff1cb9ca is described below commit cbeff1cb9ca3b7327cebfe6a032e354568ce72c5 Author: pintusoliya <37680791+pintusol...@users.noreply.github.com> AuthorDate: Thu Jul 11 04:15:29 2024 +0530 fix: update home page title (#11530) fix: merge conflicts fix: title --- website/src/pages/index.js| 7 +++ website/src/theme/LayoutHead/index.js | 7 --- website/src/theme/LayoutHead/useTitleFormatter.js | 14 ++ 3 files changed, 21 insertions(+), 7 deletions(-) diff --git a/website/src/pages/index.js b/website/src/pages/index.js index ed6f4f9d050..faa5eefbcd4 100644 --- a/website/src/pages/index.js +++ b/website/src/pages/index.js @@ -24,10 +24,9 @@ export default function Home() { const { siteConfig } = useDocusaurusContext(); return ( + title={`Apache Hudi | An Open Source Data Lake Platform`} + shouldShowOnlyTitle={true} + description="Description will go into a meta tag in "> diff --git a/website/src/theme/LayoutHead/index.js b/website/src/theme/LayoutHead/index.js index d0841677f5b..7c3abb4d2de 100644 --- a/website/src/theme/LayoutHead/index.js +++ b/website/src/theme/LayoutHead/index.js @@ -12,10 +12,10 @@ import SearchMetadata from '@theme/SearchMetadata'; import Seo from '@theme/Seo'; import { DEFAULT_SEARCH_TAG, - useTitleFormatter, useAlternatePageUtils, useThemeConfig, } from '@docusaurus/theme-common'; +import { useTitleFormatter } from './useTitleFormatter'; import {useLocation} from '@docusaurus/router'; // Useful for SEO // See https://developers.google.com/search/docs/advanced/crawling/localized-versions // See https://github.com/facebook/docusaurus/issues/3317 @@ -82,9 +82,10 @@ export default function LayoutHead(props) { i18n: {currentLocale, localeConfigs}, } = useDocusaurusContext(); const {metadata, image: defaultImage} = useThemeConfig(); - const {title, description, image, keywords, searchMetadata} = props; + const {title, description, image, keywords, searchMetadata, shouldShowOnlyTitle} = props; const faviconUrl = useBaseUrl(favicon); - const pageTitle = useTitleFormatter(title); // See https://github.com/facebook/docusaurus/issues/3317#issuecomment-754661855 + const pageTitle = useTitleFormatter(title, shouldShowOnlyTitle); // See https://github.com/facebook/docusaurus/issues/3317#issuecomment-754661855 + // const htmlLang = currentLocale.split('-')[0]; const htmlLang = currentLocale; // should we allow the user to override htmlLang with localeConfig? diff --git a/website/src/theme/LayoutHead/useTitleFormatter.js b/website/src/theme/LayoutHead/useTitleFormatter.js new file mode 100644 index 000..ee628fc29cc --- /dev/null +++ b/website/src/theme/LayoutHead/useTitleFormatter.js @@ -0,0 +1,14 @@ +import useDocusaurusContext from '@docusaurus/useDocusaurusContext'; + +export const useTitleFormatter = (title, shouldShowOnlyTitle)=> { +const {siteConfig} = useDocusaurusContext(); +const {title: siteTitle, titleDelimiter} = siteConfig; + +if (shouldShowOnlyTitle && title && title.trim().length) { +return title.trim(); +} + +return title && title.trim().length +? `${title.trim()} ${titleDelimiter} ${siteTitle}` +: siteTitle; +};
Re: [PR] [DOCS] fix: update home page title [hudi]
bhasudha merged PR #11530: URL: https://github.com/apache/hudi/pull/11530 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]
hudi-bot commented on PR #11611: URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221642367 ## CI report: * 3016b3dc7dff77b95cb349b5ef5b6ba436fa05b4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24809) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]
CTTY commented on code in PR #11608: URL: https://github.com/apache/hudi/pull/11608#discussion_r1673121749 ## hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java: ## @@ -36,26 +36,10 @@ public class BucketIndexUtil { * @return The partition index of this bucket. */ public static Functions.Function2 getPartitionIndexFunc(int bucketNum, int parallelism) { -if (parallelism < bucketNum) { - return (partition, curBucket) -> { -int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / parallelism * bucketNum; -int globalIndex = partitionIndex + curBucket; -return globalIndex % parallelism; - }; -} else { - if (parallelism % bucketNum == 0) { -return (partition, curBucket) -> { - int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / (parallelism / bucketNum) * bucketNum; - int globalIndex = partitionIndex + curBucket; - return globalIndex % parallelism; -}; - } else { -return (partition, curBucket) -> { - int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / (parallelism / bucketNum + 1) * bucketNum; - int globalIndex = partitionIndex + curBucket; - return globalIndex % parallelism; -}; - } -} +return (partition, curBucket) -> { + int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) % parallelism * bucketNum; + int globalIndex = partitionIndex + curBucket; + return globalIndex % parallelism; +}; Review Comment: nit: Can we update the comment to reflect the logic of this new algorithm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]
nsivabalan commented on PR #11440: URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221624233 @danny0405 @lokeshj1703 : are we aligned on any optimization we need on top of the patch? for eg, we were discussing about replace commit timeline right? did you guys sync up on that. if not, can we sync up and make forward progress. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]
suryaprasanna commented on code in PR #11605: URL: https://github.com/apache/hudi/pull/11605#discussion_r1673110615 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java: ## @@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy strategy) { } } + private HoodieCleanerPlan getEmptyCleanerPlan(Option earliestInstant, CleanPlanner planner) throws IOException { +LOG.info("Nothing to clean here. It is already clean"); +Option instantVal = getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant); +HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder(); +instantVal.map(x -> cleanBuilder.setPolicy(config.getCleanerPolicy().name()) +.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION) +.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())) + .setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp()) + ).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())); +return cleanBuilder.build(); + } + + /** + * There is nothing to clean so this method will create an empty clean plan. But there is chance of optimizing + * the subsequent cleaner calls. Consider this scenarios in incremental cleaner mode, + * If clean timeline is empty or no clean commits were created for a while then every clean call will have to + * scan all the partitions, by creating an empty clean commit to update earliestCommitToRetain instant value, + * incremental clean policy does not have to look for file changes in all the partitions, rather it will look + * for partitions that are modified in last x hours. This value is configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN. Review Comment: @danny0405 The usecase we are trying to solve here is when a dataset that does not receive upserts and does not use small handling, cleans are running for every commit and scans all the partitions all the time. Also, we cannot disable cleans for such datasets, since those datasets can be onboarded to clustering and can do adhoc rewrites on some partitions. So, clean have to be enabled all the time and when we are enabling cleans it is scanning all the partitions. So, we thought of storing the earliestCommitToRetain value somewhere and use it to store the commits it has scanned. That way every clean will only scan the partitions that were touched in a commit since earliestCommitToRetain progress commit by commit when incremental clean policy is enabled. This reduces the full scan for all partitions to scanning one or two partitions. CC @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] fix: update home page title [hudi]
xushiyan commented on PR #11530: URL: https://github.com/apache/hudi/pull/11530#issuecomment-2221566110 @pintusoliya looks good. can you fix the CI failure pls? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]
hudi-bot commented on PR #11611: URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221551975 ## CI report: * 3016b3dc7dff77b95cb349b5ef5b6ba436fa05b4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24809) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7961] Optimizing upsert partitioner for prepped write operations [hudi]
nsivabalan commented on code in PR #11581: URL: https://github.com/apache/hudi/pull/11581#discussion_r1673013661 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java: ## @@ -86,16 +87,21 @@ public class UpsertPartitioner extends SparkHoodiePartitioner { private HashMap bucketInfoMap; protected final HoodieWriteConfig config; + private final WriteOperationType operationType; public UpsertPartitioner(WorkloadProfile profile, HoodieEngineContext context, HoodieTable table, - HoodieWriteConfig config) { + HoodieWriteConfig config, WriteOperationType operationType) { super(profile, table); updateLocationToBucket = new HashMap<>(); partitionPathToInsertBucketInfos = new HashMap<>(); bucketInfoMap = new HashMap<>(); this.config = config; +this.operationType = operationType; assignUpdates(profile); -assignInserts(profile, context); +long totalInserts = profile.getInputPartitionPathStatMap().values().stream().mapToLong(stat -> stat.getNumInserts()).sum(); +if (!WriteOperationType.isPreppedWriteOperation(operationType) || totalInserts > 0) { // skip if its prepped write operation. or if totalInserts = 0. + assignInserts(profile, context); Review Comment: yes. thats the main purpose of prepped write operation. we expect the location to be set already in them and so we do not invoke tag locations. and hence assignInserts() is unnecessary overhead and is a no-op. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7961] Optimizing upsert partitioner for prepped write operations [hudi]
nsivabalan commented on code in PR #11581: URL: https://github.com/apache/hudi/pull/11581#discussion_r1673012152 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java: ## @@ -38,8 +39,8 @@ public class SparkInsertOverwritePartitioner extends UpsertPartitioner { private static final Logger LOG = LoggerFactory.getLogger(SparkInsertOverwritePartitioner.class); public SparkInsertOverwritePartitioner(WorkloadProfile profile, HoodieEngineContext context, HoodieTable table, - HoodieWriteConfig config) { -super(profile, context, table, config); + HoodieWriteConfig config, WriteOperationType operationType) { Review Comment: yes. Neither HoodieTable nor HoodieWriteConfig has write operation. So, I had to add an extra argument -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]
hudi-bot commented on PR #11611: URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221430587 ## CI report: * 3016b3dc7dff77b95cb349b5ef5b6ba436fa05b4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7979) Fix out of the box defaults with spillable memory configs
[ https://issues.apache.org/jira/browse/HUDI-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7979: - Labels: pull-request-available (was: ) > Fix out of the box defaults with spillable memory configs > -- > > Key: HUDI-7979 > URL: https://issues.apache.org/jira/browse/HUDI-7979 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core, writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > Looks like we are very conservative wrt memory configs used for spillable map > based FSV. > > For eg, we are only allocating 15Mb out of the box to file groups when using > spillable map based FSV. > public long getMaxMemoryForFileGroupMap() \{ > long totalMemory = getLong(SPILLABLE_MEMORY); > return totalMemory - getMaxMemoryForPendingCompaction() - > getMaxMemoryForBootstrapBaseFile(); > } > > SPILLABLE_MEMORY = default is 100Mb. > getMaxMemoryForPendingCompaction = 80% of 100MB. > getMaxMemoryForBootstrapBaseFile = 5% of 100Mb. > so, overall, out of the box we are allocating only 15Mb for > getMaxMemoryForFileGroupMap. > ref: > [https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-[…]/apache/hudi/common/table/view/FileSystemViewStorageConfig.java|https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java#L224] > Wondering do we even need 80% for pending compaction tracker in our FSV. I am > thinking to make it 15%. so that we can give more memory to actual file > groups. We may not have lot of pending compactions for a given table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]
nsivabalan opened a new pull request, #11611: URL: https://github.com/apache/hudi/pull/11611 ### Change Logs Adjusting defaults with spillable map memory. Reduced the 80Mb allocation for pending compaction to 15Mb. ### Impact More memory for file groups in FSVs. ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7979) Fix out of the box defaults with spillable memory configs
sivabalan narayanan created HUDI-7979: - Summary: Fix out of the box defaults with spillable memory configs Key: HUDI-7979 URL: https://issues.apache.org/jira/browse/HUDI-7979 Project: Apache Hudi Issue Type: Improvement Components: reader-core, writer-core Reporter: sivabalan narayanan Looks like we are very conservative wrt memory configs used for spillable map based FSV. For eg, we are only allocating 15Mb out of the box to file groups when using spillable map based FSV. public long getMaxMemoryForFileGroupMap() \{ long totalMemory = getLong(SPILLABLE_MEMORY); return totalMemory - getMaxMemoryForPendingCompaction() - getMaxMemoryForBootstrapBaseFile(); } SPILLABLE_MEMORY = default is 100Mb. getMaxMemoryForPendingCompaction = 80% of 100MB. getMaxMemoryForBootstrapBaseFile = 5% of 100Mb. so, overall, out of the box we are allocating only 15Mb for getMaxMemoryForFileGroupMap. ref: [https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-[…]/apache/hudi/common/table/view/FileSystemViewStorageConfig.java|https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java#L224] Wondering do we even need 80% for pending compaction tracker in our FSV. I am thinking to make it 15%. so that we can give more memory to actual file groups. We may not have lot of pending compactions for a given table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]
hudi-bot commented on PR #11440: URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221276113 ## CI report: * 7e8e5336ab1f079d93ccaa0981e70798169532f1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24807) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql [hudi]
nsivabalan commented on code in PR #11610: URL: https://github.com/apache/hudi/pull/11610#discussion_r1672779411 ## website/versioned_docs/version-0.15.0/sql_ddl.md: ## @@ -67,7 +67,10 @@ PARTITIONED BY (dt); ``` :::note -You can also create a table partitioned by multiple fields by supplying comma-separated field names. For, e.g., "partitioned by dt, hh" +You can also create a table partitioned by multiple fields by supplying comma-separated field names. +When creating a table partitioned by multiple fields, ensure that you specify the columns in the `PARTITIONED BY` clause +in the same order as they appear in the `CREATE TABLE` schema. For example, for the above table, the partition fields +should be specified as `PARTITIONED BY (dt, hh)`. Review Comment: is it not an issue in older versions? or we have sql_ddl page only from 0.14.x -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
hudi-bot commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2221181265 ## CI report: * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN * 0960a9f7454d0b7daae0a78d1993af11b94cea53 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24808) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
hudi-bot commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2221109881 ## CI report: * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN * 77afe6dfa6dfa6f6180929e052d6ad75d8d3618b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24770) * 0960a9f7454d0b7daae0a78d1993af11b94cea53 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]
hudi-bot commented on PR #11440: URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221109430 ## CI report: * 4bb072faa37b2ea398144fbbc24deff966153cfa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24471) * 7e8e5336ab1f079d93ccaa0981e70798169532f1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24807) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]
hudi-bot commented on PR #11440: URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221096095 ## CI report: * 4bb072faa37b2ea398144fbbc24deff966153cfa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24471) * 7e8e5336ab1f079d93ccaa0981e70798169532f1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions
[ https://issues.apache.org/jira/browse/HUDI-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7978: - Labels: pull-request-available (was: ) > Update docs for older versions to state that partitions should be ordered > when creating multiple partitions > --- > > Key: HUDI-7978 > URL: https://issues.apache.org/jira/browse/HUDI-7978 > Project: Apache Hudi > Issue Type: Sub-task > Components: docs >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql [hudi]
codope opened a new pull request, #11610: URL: https://github.com/apache/hudi/pull/11610 ### Change Logs If we specify partition fields in an order different from the order in create table schema, then query does not fail and the whole partitioning is incorrect. This has been an issue even in 0.15 and 0.14. We will fix this in HUDI-7964. Meanwhile, it's important we update the docs. ### Impact Docs improvement. ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]
lokeshj1703 commented on PR #11440: URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221025516 I have explained the scenario in the comments. PTAL @nbalajee @danny0405 I have also fixed the earliestCommitToNotArchive logic to take into account last clean instant and removed separate handling in the cleaner timeline archival. The case where earliestCommitToNotArchive > last clean instant will not be possible since this timestamp is generated by cleaner but still added a check to consider the minimum of last clean instant and earliestCommitToNotArchive. cc @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader
[ https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7971: -- Description: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW ** few write commits ** Pending clustering ** Completed Clustering ** Failed writes with no rollbacks ** Insert overwrite table/partition ** Savepoint for Time-travel query * MOR ** Same as COW ** Pending and completed async compaction (with log-files and no base file) ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) ** Rollback formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading was: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW * Pending clustering * Completed Clustering * Failed writes with no rollbacks * Insert overwrite table/partition * Savepoint for Time-travel query * MOR * Same as COW * Pending and completed async compaction (with log-files and no base file) * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) * Rollback formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading > Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader > - > > Key: HUDI-7971 > URL: https://issues.apache.org/jira/browse/HUDI-7971 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x > tables > > Readers : 1.x > # Spark SQL > # Spark Datasource > # Trino/Presto > # Hive > # Flink > Writer: 0.16 > Table State: > * COW > ** few write commits > ** Pending clustering > ** Completed Clustering > ** Failed writes with no rollbacks > ** Insert overwrite table/partition > ** Savepoint for Time-travel query > * MOR > ** Same as COW > ** Pending and completed async compaction (with log-files and no base file) > ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) > ** Rollback formats - DELETE, rollback block > Other knobs: > # Metadata enabled/disabled > # Column Stats enabled/disabled and data-skipping enabled/disabled > # RLI enabled with eq/IN queries > # Non-Partitioned dataset > # CDC Reads > # Incremental Reads > # Time-travel query > > What to test ? > # Query Results Correctness > # Performance : See the benefit of > # Partition Pruning > # Metadata table - col stats, RLI, > > Corner Case Testing: > > # Schema Evolution with different file-groups having different generation of > schema > # Dynamic Partition Pruning > # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi-rs) branch release-0.1.x updated: build: bump version to 0.1.0-rc1
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch release-0.1.x in repository https://gitbox.apache.org/repos/asf/hudi-rs.git The following commit(s) were added to refs/heads/release-0.1.x by this push: new 973ae75 build: bump version to 0.1.0-rc1 973ae75 is described below commit 973ae75da14c96c18d411c91e5c3d74448cbb88e Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Wed Jul 10 09:41:45 2024 -0500 build: bump version to 0.1.0-rc1 --- Cargo.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Cargo.toml b/Cargo.toml index 8259243..5a26c92 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -23,7 +23,7 @@ members = [ resolver = "2" [workspace.package] -version = "0.1.0" +version = "0.1.0-rc1" edition = "2021" license = "Apache-2.0" rust-version = "1.75.0"
Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]
hudi-bot commented on PR #11608: URL: https://github.com/apache/hudi/pull/11608#issuecomment-2220692616 ## CI report: * 3175c04ed10009da40aa6cb9d8e2b4432678d38f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24806) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Tracking issue for `hudi-rs` release 0.1.0 [hudi-rs]
xushiyan opened a new issue, #62: URL: https://github.com/apache/hudi-rs/issues/62 *This issue is for tracking tasks of releasing `hudi-rs` 0.1.0.* ## Tasks ### Issues - [ ] This issue is added to [the target milestone](https://github.com/apache/hudi-rs/milestone/1) - [ ] All remaining issues in the milestone should be closed > [!CAUTION] > Blockers to highlight - [ ] https://github.com/apache/hudi-rs/issues/41 - [ ] https://github.com/apache/hudi-rs/issues/42 ### GitHub - [ ] Bump version - [ ] Push release tag ### ASF - [ ] Create an ASF release - [ ] Upload artifacts to the SVN dist repo - [ ] Start VOTE in dev email list > [!CAUTION] > Proceed from here only after VOTE passes. ### Official release - [ ] Push the release git tag - [ ] Publish artifacts to SVN RELEASE branch - [ ] Send `ANNOUNCE` email to dev and user email lists -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Write contribution guide and document dev setup [hudi-rs]
xushiyan closed issue #44: Write contribution guide and document dev setup URL: https://github.com/apache/hudi-rs/issues/44 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] hive sql查询hudi分区表,如果分区字段是表最后一列,解析parquet返回的数据,在分区字段列位置自动增加了分区字段值,导致后续列错误发生类型转换问题 [hudi]
liucongjy opened a new issue, #11609: URL: https://github.com/apache/hudi/issues/11609 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1.flink sql create table and insert data CREATE CATALOG hoodie_catalog WITH ( 'type'='hudi', 'catalog.path' = '/tmp/hudi', 'hive.conf.dir' = '/opt/hive-2.3.9/conf', 'mode'='hms' ); USE CATALOG hoodie_catalog; create database hoodie_catalog.flink_hudi; CREATE CATALOG myhive WITH ( 'type' = 'hive', 'hive-conf-dir' = '/opt/hive-2.3.9/conf' ); CREATE TABLE hoodie_catalog.flink_hudi.TEST_COW( jllsh VARCHAR(100), syh VARCHAR(50) PRIMARY KEY NOT ENFORCED, jzlsh VARCHAR(40), grbsh VARCHAR(20), dalx VARCHAR(10), dzjkkkh VARCHAR(40), yzzh VARCHAR(40), yzbs VARCHAR(40), fqsj VARCHAR(10), dj DOUBLE, ze DOUBLE, zzsj TIMESTAMP ) PARTITIONED BY (`fqsj`) WITH ( 'connector' = 'hudi', 'path' = 'hdfs://bigdata01:8020/tmp/hudi/flink_hudi/TEST_COW', 'table.type' = 'COPY_ON_WRITE', 'precombine.field' = 'syh', 'write.operation' = 'upsert', 'hoodie.datasource.hive_sync.support_timestamp' = 'true' ); insert into hoodie_catalog.flink_hudi.TEST_COW select jllsh,syh,jzlsh,grbsh,dalx,dzjkkkh,yzzh,yzbs,fqsj,dj,ze,zzsj from myhive.ods.prdata where pch='202407090102000' limit 100; 2.flink sql read table,Read out the data normally select * from hoodie_catalog.flink_hudi.EHR_ZYYZMX_PRE_COW; 3.use hive sql read data use flink_hudi; select dj,ze,zzsj from test_cow; **Expected behavior** Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.DoubleWritable **Environment Description** * Hudi version : 0.14 * Spark version : 3.3 * Hive version : 2.3.9 * Hadoop version : 3.0 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no **Additional context** 调试hive日志输出,看到读出数据格式为[2014-06-06,0.0,0.0],但我在hive查询的sql语句(select dj,ze,zzsj from test_cow;)第一列是dj,是一个double类型字段,但在这里读出来的是分区的字段的值 hive调试输出[2014-06-06,0.0,0.0]这个格式数据的类名为:org.apache.hadoop.hive.ql.exec.ListSinkOperator,调用方法为 ``` public void process(Object row, int tag) throws HiveException { try { LOG.info("row数据:{},tag:{},class:{}", row, tag, row.getClass().getName()); LOG.info("row对应数据类型:{}", inputObjInspectors); ClassLoader classLoader = fetcher.getClass().getClassLoader(); if (classLoader != null) { // 尝试获取资源URLs try { Enumeration urls = classLoader.getResources(fetcher.getClass().getName().replace('.', '/') + ".class"); while (urls.hasMoreElements()) { URL url = urls.nextElement(); LOG.info("Class " + fetcher.getClass().getName() + " is loaded from: " + url); } } catch (Exception e) { e.printStackTrace(); } } else { LOG.info("Class " + fetcher.getClass().getName() + " is loaded by the bootstrap class loader."); } res.add(fetcher.convert(row, inputObjInspectors[0])); numRows++; runTimeNumRows++; } catch (Exception e) { throw new HiveException(e); } } ``` **Stacktrace** ```Add the stacktrace of the error.``` 2024-07-08T14:42:03,814 ERROR [62dfdf81-d91C-40f8-91a2-d84f761a3671 main] liDriver: Failed with exception java.io.IOException:org.apache.haoophive.ql.metadata.HiveException:java.lang.ClassCastException:org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.DoubleWriteable java.io.IOException:org.apache.haoophive.ql.metadata.HiveException:java.lang.ClassCastException:org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.DoubleWriteable at org.apache.hadoop.hive.ql.exec. FetchTask. fetch(FetchTask.java:165) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208) at orgapachehadop-hivecL1.CLriver.processocaCr0Evpriver:ava:257) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:407) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:825) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:763) at org.apache.hadoop.hive.cli.CliDriv
Re: [PR] build: bump version to 0.2.0 [hudi-rs]
codecov[bot] commented on PR #61: URL: https://github.com/apache/hudi-rs/pull/61#issuecomment-2220625487 ## [Codecov](https://app.codecov.io/gh/apache/hudi-rs/pull/61?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 87.19%. Comparing base [(`2bb004b`)](https://app.codecov.io/gh/apache/hudi-rs/commit/2bb004b48efb5624813671e38c890c6abff01712?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`1a6c604`)](https://app.codecov.io/gh/apache/hudi-rs/commit/1a6c60409f71a7a359ed60620e11f624684869fb?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #61 +/- ## === Coverage 87.19% 87.19% === Files 13 13 Lines 687 687 === Hits 599 599 Misses 88 88 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/hudi-rs/pull/61?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] build: bump version to 0.2.0 [hudi-rs]
xushiyan opened a new pull request, #61: URL: https://github.com/apache/hudi-rs/pull/61 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi-rs) branch release-0.1.x created (now 2bb004b)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a change to branch release-0.1.x in repository https://gitbox.apache.org/repos/asf/hudi-rs.git at 2bb004b build: add info for rust and python artifacts (#60) No new revisions were added by this update.
[jira] [Updated] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions
[ https://issues.apache.org/jira/browse/HUDI-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7978: -- Component/s: docs > Update docs for older versions to state that partitions should be ordered > when creating multiple partitions > --- > > Key: HUDI-7978 > URL: https://issues.apache.org/jira/browse/HUDI-7978 > Project: Apache Hudi > Issue Type: Sub-task > Components: docs >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions
[ https://issues.apache.org/jira/browse/HUDI-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-7978: - Assignee: Sagar Sumit > Update docs for older versions to state that partitions should be ordered > when creating multiple partitions > --- > > Key: HUDI-7978 > URL: https://issues.apache.org/jira/browse/HUDI-7978 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions
Sagar Sumit created HUDI-7978: - Summary: Update docs for older versions to state that partitions should be ordered when creating multiple partitions Key: HUDI-7978 URL: https://issues.apache.org/jira/browse/HUDI-7978 Project: Apache Hudi Issue Type: Sub-task Reporter: Sagar Sumit -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]
KnightChess commented on PR #11608: URL: https://github.com/apache/hudi/pull/11608#issuecomment-2220529640 @danny0405 @xicm cc: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]
KnightChess closed pull request #11608: [HUDI-7977] import bucket index partitioner URL: https://github.com/apache/hudi/pull/11608 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]
hudi-bot commented on PR #11608: URL: https://github.com/apache/hudi/pull/11608#issuecomment-2220434167 ## CI report: * 3175c04ed10009da40aa6cb9d8e2b4432678d38f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24806) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]
hudi-bot commented on PR #11608: URL: https://github.com/apache/hudi/pull/11608#issuecomment-2220417516 ## CI report: * 3175c04ed10009da40aa6cb9d8e2b4432678d38f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7977] import bucket index partitioner [hudi]
KnightChess opened a new pull request, #11608: URL: https://github.com/apache/hudi/pull/11608 ### Change Logs as https://github.com/apache/hudi/pull/11578 describe, current algorithm in some case has drawbacks, but this new algorithm also has drawbacks, so we can proceed based on practical considerations desgin the ut. old algorithm: https://github.com/apache/hudi/assets/20125927/101cfedf-3a63-4f28-be20-9572adc8f5a7";> new algorithm: ![image](https://github.com/apache/hudi/assets/20125927/c4637dbe-1e91-4322-b3aa-020545b7e023) both of them perform bad case https://github.com/apache/hudi/pull/11578#issuecomment-2219339643 : partition values are discontinuous parallelism = 10, bucketNumber = 5 and partition = ["2021-01-01", "2021-01-03"] old: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] new: [2, 2, 2, 2, 2] parallelism = 20, bucketNumber = 5 and partition = ["2021-01-01", "2021-01-03"] old: [2, 2, 2, 2, 2] new: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ### Impact none ### Risk level (write none, low medium or high below) low ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7977) improve bucket index paritioner
[ https://issues.apache.org/jira/browse/HUDI-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7977: - Labels: pull-request-available (was: ) > improve bucket index paritioner > --- > > Key: HUDI-7977 > URL: https://issues.apache.org/jira/browse/HUDI-7977 > Project: Apache Hudi > Issue Type: Improvement > Components: index >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > imporve {{BucketIndexUtil}} partitionIndex algorithm make the data be evenly > distributed. -- This message was sent by Atlassian Jira (v8.20.10#820010)