Re: [PR] [HUDI-7980] Optimize the configuration content when performing clustering with row writer [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11614:
URL: https://github.com/apache/hudi/pull/11614#issuecomment-169886

   
   ## CI report:
   
   * 305cfba4c163a2d70bcbeff8029c9f2a2d205a3c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24819)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7980] Optimize the configuration content when performing clustering with row writer [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11614:
URL: https://github.com/apache/hudi/pull/11614#issuecomment-154563

   
   ## CI report:
   
   * 305cfba4c163a2d70bcbeff8029c9f2a2d205a3c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11612:
URL: https://github.com/apache/hudi/pull/11612#issuecomment-143878

   
   ## CI report:
   
   * 8669c1c9afa99b08f866c97ed18eac0446cb1b36 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24816)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Setup Python release pipeline [hudi-rs]

2024-07-10 Thread via GitHub


xushiyan closed issue #42: Setup Python release pipeline
URL: https://github.com/apache/hudi-rs/issues/42


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7980) Optimize the configuration content when performing clustering with row writer

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7980:
-
Labels: pull-request-available  (was: )

> Optimize the configuration content when performing clustering with row writer
> -
>
> Key: HUDI-7980
> URL: https://issues.apache.org/jira/browse/HUDI-7980
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the row writer defaults to snapshot reads for all tables. However, 
> this method is relatively inefficient for MOR (Merge on Read) tables when 
> there are no logs. Therefore, we should optimize this part of the 
> configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7980] Optimize the configuration content when performing clustering with row writer [hudi]

2024-07-10 Thread via GitHub


majian1998 opened a new pull request, #11614:
URL: https://github.com/apache/hudi/pull/11614

   Currently, the row writer defaults to snapshot reads for all tables. 
However, this method is relatively inefficient for MOR  tables when there are 
no logs. Additionally, we have already configured the glob path for queries. We 
need to read all the files from the glob path without requiring additional 
configurations for time travel queries. Therefore, we should optimize this part 
of the configuration.
   
   ### Change Logs
   
   None
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Exception when executing log compaction : Unsupported Operation Exception [hudi]

2024-07-10 Thread via GitHub


xuzifu666 commented on issue #10982:
URL: https://github.com/apache/hudi/issues/10982#issuecomment-107992

   This issue can close again. @xushiyan @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7980) Optimize the configuration content when performing clustering with row writer

2024-07-10 Thread Ma Jian (Jira)
Ma Jian created HUDI-7980:
-

 Summary: Optimize the configuration content when performing 
clustering with row writer
 Key: HUDI-7980
 URL: https://issues.apache.org/jira/browse/HUDI-7980
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ma Jian


Currently, the row writer defaults to snapshot reads for all tables. However, 
this method is relatively inefficient for MOR (Merge on Read) tables when there 
are no logs. Therefore, we should optimize this part of the configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi-rs) branch main updated: build: add release workflow (#63)

2024-07-10 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/hudi-rs.git


The following commit(s) were added to refs/heads/main by this push:
 new cb9f4d8  build: add release workflow (#63)
cb9f4d8 is described below

commit cb9f4d8a64fd6e8c2e06fc2c79a6c6740eeadba4
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Thu Jul 11 00:52:47 2024 -0500

build: add release workflow (#63)

- Add `release.yml` workflow in github to publish artifacts to crates.io 
and pypi.org
- Add various metadata to populate artifact repository main pages
- Fix python artifacts issues wrt macos aarch64

Closes #41 #42
---
 .github/workflows/release.yml | 146 ++
 README.md |  34 +-
 crates/core/Cargo.toml|   5 ++
 crates/datafusion/Cargo.toml  |   7 +-
 crates/hudi/Cargo.toml|  12 +++-
 crates/tests/Cargo.toml   |   5 ++
 python/.cargo/config.toml |   1 +
 python/Cargo.toml |  12 ++--
 python/README.md  |  46 -
 python/pyproject.toml |   7 +-
 10 files changed, 199 insertions(+), 76 deletions(-)

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
new file mode 100644
index 000..e7aa911
--- /dev/null
+++ b/.github/workflows/release.yml
@@ -0,0 +1,146 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Publish artifacts
+
+on:
+  push:
+tags:
+  - 'release-[0-9]+.[0-9]+.[0-9]+**'
+
+jobs:
+  validate-release-tag:
+name: Validate git tag
+runs-on: ubuntu-latest
+steps:
+  - uses: actions/checkout@v4
+  - name: compare git tag with cargo metadata
+run: |
+  CURR_VER=$( grep version Cargo.toml | head -n 1 | awk '{print $3}' | 
tr -d '"' )
+  if [[ "${GITHUB_REF_NAME}" != "release-${CURR_VER}" ]]; then
+echo "Pushed tag ${GITHUB_REF_NAME} does not match with the Cargo 
package version ${CURR_VER}."
+exit 1
+  fi
+
+  release-crates:
+name: Release to crates.io
+needs: validate-release-tag
+runs-on: ubuntu-latest
+strategy:
+  max-parallel: 1
+  matrix:
+# order matters here as later crates depend on previous ones
+package:
+  - "hudi-core"
+  - "hudi-datafusion"
+  - "hudi"
+steps:
+  - uses: actions/checkout@v4
+
+  - uses: actions-rs/toolchain@v1
+with:
+  profile: minimal
+  toolchain: stable
+  override: true
+
+  - name: cargo publish
+uses: actions-rs/cargo@v1
+env:
+  CARGO_REGISTRY_TOKEN: ${{ secrets.CARGO_REGISTRY_TOKEN }}
+with:
+  command: publish
+  args: -p ${{ matrix.package }} --all-features
+
+  release-pypi-mac:
+name: PyPI release on Mac
+needs: validate-release-tag
+strategy:
+  fail-fast: false
+  matrix:
+target: [ x86_64-apple-darwin, aarch64-apple-darwin ]
+runs-on: macos-latest
+steps:
+  - uses: actions/checkout@v4
+  - uses: actions/setup-python@v5
+with:
+  python-version: '3.12'
+
+  - name: Publish to pypi (without sdist)
+uses: PyO3/maturin-action@v1
+env:
+  MATURIN_PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
+  MATURIN_REPOSITORY: pypi
+with:
+  target: ${{ matrix.target }}
+  command: publish
+  args: --skip-existing -m python/Cargo.toml --no-sdist
+
+  release-pypi-windows:
+name: PyPI release on Windows
+needs: validate-release-tag
+runs-on: windows-latest
+steps:
+  - uses: actions/checkout@v4
+  - uses: actions/setup-python@v5
+with:
+  python-version: '3.12'
+
+  - name: Publish to pypi (without sdist)
+uses: PyO3/maturin-action@v1
+env:
+  MATURIN_PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
+  MATURIN_REPOSITORY: pypi
+with:
+  target: x86_64-pc-windows-msvc
+  command: publish
+  args: --skip-existing -m python/Cargo.toml --no-sdist
+
+  release-pypi-manylinux:
+name: PyPI rele

Re: [I] Setup Rust release pipeline [hudi-rs]

2024-07-10 Thread via GitHub


xushiyan closed issue #41: Setup Rust release pipeline
URL: https://github.com/apache/hudi-rs/issues/41


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Setup Rust release pipeline [hudi-rs]

2024-07-10 Thread via GitHub


xushiyan closed issue #41: Setup Rust release pipeline
URL: https://github.com/apache/hudi-rs/issues/41


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] build: add release workflow [hudi-rs]

2024-07-10 Thread via GitHub


xushiyan merged PR #63:
URL: https://github.com/apache/hudi-rs/pull/63


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] build: add release workflow [hudi-rs]

2024-07-10 Thread via GitHub


codecov[bot] commented on PR #63:
URL: https://github.com/apache/hudi-rs/pull/63#issuecomment-080421

   ## 
[Codecov](https://app.codecov.io/gh/apache/hudi-rs/pull/63?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 87.19%. Comparing base 
[(`2bb004b`)](https://app.codecov.io/gh/apache/hudi-rs/commit/2bb004b48efb5624813671e38c890c6abff01712?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`b7e85d2`)](https://app.codecov.io/gh/apache/hudi-rs/commit/b7e85d285b458562063b26a7b4759619a7e8cfce?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   Additional details and impacted files
   
   
   ```diff
   @@   Coverage Diff   @@
   ## main  #63   +/-   ##
   ===
 Coverage   87.19%   87.19%   
   ===
 Files  13   13   
 Lines 687  687   
   ===
 Hits  599  599   
 Misses 88   88   
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/hudi-rs/pull/63?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] build: add release workflow [hudi-rs]

2024-07-10 Thread via GitHub


xushiyan opened a new pull request, #63:
URL: https://github.com/apache/hudi-rs/pull/63

   - Add `release.yml` workflow in github to publish artifacts to crates.io and 
pypi.org
   - Add various metadata to populate artifact repository main pages
   - Fix python artifacts issues wrt macos aarch64
   
   Closes #41 #42


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-304] enable spotless for hudi [hudi]

2024-07-10 Thread via GitHub


HuangZhenQiu closed pull request #11613: [HUDI-304] enable spotless for hudi
URL: https://github.com/apache/hudi/pull/11613


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-304] enable spotless for hudi [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11613:
URL: https://github.com/apache/hudi/pull/11613#issuecomment-024178

   
   ## CI report:
   
   * b1246eca2843a2bf080d0e9df74ffa7045a5935c UNKNOWN
   * 47c7f9f4728cfb71f1887b65d1befcde718aee34 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11612:
URL: https://github.com/apache/hudi/pull/11612#issuecomment-024075

   
   ## CI report:
   
   * 8669c1c9afa99b08f866c97ed18eac0446cb1b36 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24816)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7709) ClassCastException while reading the data using TimestampBasedKeyGenerator

2024-07-10 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7709:

Summary: ClassCastException while reading the data using 
TimestampBasedKeyGenerator  (was: Class Cast Exception while reading the data 
using TimestampBasedKeyGenerator)

> ClassCastException while reading the data using TimestampBasedKeyGenerator
> --
>
> Key: HUDI-7709
> URL: https://issues.apache.org/jira/browse/HUDI-7709
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Assignee: Geser Dugarov
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/11140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-304] enable spotless for hudi [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11613:
URL: https://github.com/apache/hudi/pull/11613#issuecomment-017030

   
   ## CI report:
   
   * b1246eca2843a2bf080d0e9df74ffa7045a5935c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11612:
URL: https://github.com/apache/hudi/pull/11612#issuecomment-016941

   
   ## CI report:
   
   * 8669c1c9afa99b08f866c97ed18eac0446cb1b36 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Exception while using HoodieStreamer protobuf data from Kafka [hudi]

2024-07-10 Thread via GitHub


danny0405 commented on issue #11598:
URL: https://github.com/apache/hudi/issues/11598#issuecomment-016668

   @the-other-tim-brown any insights here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-304] enable spoltess for hudi [hudi]

2024-07-10 Thread via GitHub


HuangZhenQiu opened a new pull request, #11613:
URL: https://github.com/apache/hudi/pull/11613

   ### Change Logs
   
   Enable spotless plugin for Hudi 
   ### Impact
   
   No API change
   
   ### Risk level (write none, low medium or high below)
   
   No Risk
   
   ### Documentation  
   
   No Documentation Documentation
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11611:
URL: https://github.com/apache/hudi/pull/11611#issuecomment-008533

   
   ## CI report:
   
   * 4de9be3830a22f5dc11538353b469c2083f1a35e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24815)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7976) Fix BUG introduced in HUDI-7955 due to usage of wrong class

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7976:
-
Labels: pull-request-available  (was: )

> Fix BUG introduced in HUDI-7955 due to usage of wrong class
> ---
>
> Key: HUDI-7976
> URL: https://issues.apache.org/jira/browse/HUDI-7976
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
>
> In the bugfix for HUDI-7955, the wrong class for invoking {{getTimestamp 
> }}was used.
>  # {*}Wrong{*}: org.apache.hadoop.hive.common.type.Timestamp
>  # {*}Correct{*}: org.apache.hadoop.hive.serde2.io.TimestampWritableV2
>  
> !https://git.garena.com/shopee/data-infra/hudi/uploads/eeff29b3e741c65eeb48f9901fa28da0/image.png|width=468,height=235!
>  
> Submitting a bugfix to fix this bugfix... 
> Log levels for the exception block is also changed to warn so errors will be 
> printed out.
> On top of that, we have simplified the {{getMillis}} shim to remove the 
> method that was added in HUDI-7955 to standardise it with how {{getDays}} is 
> written.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]

2024-07-10 Thread via GitHub


voonhous opened a new pull request, #11612:
URL: https://github.com/apache/hudi/pull/11612

   ### Change Logs
   
   In the bugfix for 
[HUDI-7955](https://issues.apache.org/jira/browse/HUDI-7955), the wrong class 
for invoking {{getTimestamp }}was used.
   
   1. **Wrong**: org.apache.hadoop.hive.common.type.Timestamp
   2. **Correct**: org.apache.hadoop.hive.serde2.io.TimestampWritableV2
   
   
![image](https://github.com/apache/hudi/assets/6312314/3d1bca3a-2ad4-4e25-b421-daec91d7c65c)
   
   Submitting a bugfix to fix this bugfix... 
   
   Log levels for the exception block is also changed to warn so errors will be 
printed out.
   However, false positives might be printed when the environment is indeed in 
Hive2.
   
   On top of that, we have simplified the `getMillis` shim to remove the method 
that was added in [HUDI-7955](https://issues.apache.org/jira/browse/HUDI-7955) 
to standardise it with how `getDays` is written.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2221956386

   
   ## CI report:
   
   * dac29c7e89201f0ced6d394bf6fd4a5c0622167b UNKNOWN
   * e8ad55251f5c98b061208ceb2e52637b345e9db0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24813)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader

2024-07-10 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7971:
--
Description: 
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 ** few write commits 
 ** Pending clustering
 ** Completed Clustering
 ** Failed writes with no rollbacks
 ** Insert overwrite table/partition
 ** Savepoint for Time-travel query

 * MOR
 ** Same as COW
 ** Pending and completed async compaction (with log-files and no base file)
 ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 ** Log block formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled (all combinations)
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries

 # Non-Partitioned dataset (all combinations)
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 

  was:
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 ** few write commits 
 ** Pending clustering
 ** Completed Clustering
 ** Failed writes with no rollbacks
 ** Insert overwrite table/partition
 ** Savepoint for Time-travel query

 * MOR
 ** Same as COW
 ** Pending and completed async compaction (with log-files and no base file)
 ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 ** Log block formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries

 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 


> Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader 
> -
>
> Key: HUDI-7971
> URL: https://issues.apache.org/jira/browse/HUDI-7971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
> tables 
>  
> Readers :  1.x
>  # Spark SQL
>  # Spark Datasource
>  # Trino/Presto
>  # Hive
>  # Flink
> Writer: 0.16
> Table State:
>  * COW
>  ** few write commits 
>  ** Pending clustering
>  ** Completed Clustering
>  ** Failed writes with no rollbacks
>  ** Insert overwrite table/partition
>  ** Savepoint for Time-travel query
>  * MOR
>  ** Same as COW
>  ** Pending and completed async compaction (with log-files and no base file)
>  ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
>  ** Log block formats - DELETE, rollback block
> Other knobs:
>  # Metadata enabled/disabled (all combinations)
>  # Column Stats enabled/disabled and data-skipping enabled/disabled
>  # RLI enabled with eq/IN queries
>  # Non-Partitioned dataset (all combinations)
>  # CDC Reads 
>  # Incremental Reads
>  # Time-travel query
>  
> What to test ?
>  # Query Results Correctness
>  # Performance : See the benefit of 
>  # Partition Pruning
>  # Metadata  table - col stats, RLI,
>  
> Corner Case Testing:
>  
>  # Schema Evolution with different file-groups having different generation of 
> schema
>  # Dynamic Partition Pruning
>  # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader

2024-07-10 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7971:
--
Description: 
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 ** few write commits 
 ** Pending clustering
 ** Completed Clustering
 ** Failed writes with no rollbacks
 ** Insert overwrite table/partition
 ** Savepoint for Time-travel query

 * MOR
 ** Same as COW
 ** Pending and completed async compaction (with log-files and no base file)
 ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 ** Log block formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries

 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 

  was:
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 ** few write commits 
 ** Pending clustering
 ** Completed Clustering
 ** Failed writes with no rollbacks
 ** Insert overwrite table/partition
 ** Savepoint for Time-travel query

 * MOR
 ** Same as COW
 ** Pending and completed async compaction (with log-files and no base file)
 ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 ** Rollback formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries

 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 


> Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader 
> -
>
> Key: HUDI-7971
> URL: https://issues.apache.org/jira/browse/HUDI-7971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
> tables 
>  
> Readers :  1.x
>  # Spark SQL
>  # Spark Datasource
>  # Trino/Presto
>  # Hive
>  # Flink
> Writer: 0.16
> Table State:
>  * COW
>  ** few write commits 
>  ** Pending clustering
>  ** Completed Clustering
>  ** Failed writes with no rollbacks
>  ** Insert overwrite table/partition
>  ** Savepoint for Time-travel query
>  * MOR
>  ** Same as COW
>  ** Pending and completed async compaction (with log-files and no base file)
>  ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
>  ** Log block formats - DELETE, rollback block
> Other knobs:
>  # Metadata enabled/disabled
>  # Column Stats enabled/disabled and data-skipping enabled/disabled
>  # RLI enabled with eq/IN queries
>  # Non-Partitioned dataset
>  # CDC Reads 
>  # Incremental Reads
>  # Time-travel query
>  
> What to test ?
>  # Query Results Correctness
>  # Performance : See the benefit of 
>  # Partition Pruning
>  # Metadata  table - col stats, RLI,
>  
> Corner Case Testing:
>  
>  # Schema Evolution with different file-groups having different generation of 
> schema
>  # Dynamic Partition Pruning
>  # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch asf-site updated: [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql (#11610)

2024-07-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 488438fbb6a [HUDI-7978][DOCS] Add a note on field oldering in 
partitioned by clause of create sql (#11610)
488438fbb6a is described below

commit 488438fbb6ae2f8dfcc9257016c66a38c0352171
Author: Sagar Sumit 
AuthorDate: Thu Jul 11 08:38:42 2024 +0530

[HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of 
create sql (#11610)
---
 website/docs/sql_ddl.md| 5 -
 website/versioned_docs/version-0.11.0/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.11.1/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.12.0/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.12.1/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.12.2/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.12.3/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.13.0/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.13.1/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.14.0/sql_ddl.md   | 5 -
 website/versioned_docs/version-0.14.1/sql_ddl.md   | 5 -
 website/versioned_docs/version-0.15.0/sql_ddl.md   | 5 -
 12 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index 61e7d33cd7f..a85d8a7bb04 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -67,7 +67,10 @@ PARTITIONED BY (dt);
 ```
 
 :::note
-You can also create a table partitioned by multiple fields by supplying 
comma-separated field names. For, e.g., "partitioned by dt, hh"
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause 
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the above table, the partition fields 
+should be specified as `PARTITIONED BY (dt, hh)`.
 :::
 
 ### Create table with record keys and ordering fields
diff --git a/website/versioned_docs/version-0.11.0/quick-start-guide.md 
b/website/versioned_docs/version-0.11.0/quick-start-guide.md
index 9f670871f6a..35336d3f4d4 100644
--- a/website/versioned_docs/version-0.11.0/quick-start-guide.md
+++ b/website/versioned_docs/version-0.11.0/quick-start-guide.md
@@ -298,6 +298,13 @@ partitioned by (dt, hh)
 location '/tmp/hudi/hudi_cow_pt_tbl';
 ```
 
+:::note
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the above table, the partition fields
+should be specified as `PARTITIONED BY (dt, hh)`.
+:::
+
 **Create Table for an existing Hudi Table**
 
 We can create a table on an existing hudi table(created with spark-shell or 
deltastreamer). This is useful to
diff --git a/website/versioned_docs/version-0.11.1/quick-start-guide.md 
b/website/versioned_docs/version-0.11.1/quick-start-guide.md
index d45b535ef42..d0c32790d5a 100644
--- a/website/versioned_docs/version-0.11.1/quick-start-guide.md
+++ b/website/versioned_docs/version-0.11.1/quick-start-guide.md
@@ -296,6 +296,13 @@ partitioned by (dt, hh)
 location '/tmp/hudi/hudi_cow_pt_tbl';
 ```
 
+:::note
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the above table, the partition fields
+should be specified as `PARTITIONED BY (dt, hh)`.
+:::
+
 **Create Table for an existing Hudi Table**
 
 We can create a table on an existing hudi table(created with spark-shell or 
deltastreamer). This is useful to
diff --git a/website/versioned_docs/version-0.12.0/quick-start-guide.md 
b/website/versioned_docs/version-0.12.0/quick-start-guide.md
index aac9a9bd048..9fc3a0414f5 100644
--- a/website/versioned_docs/version-0.12.0/quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.0/quick-start-guide.md
@@ -322,6 +322,13 @@ partitioned by (dt, hh)
 location '/tmp/hudi/hudi_cow_pt_tbl';
 ```
 
+:::note
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the above table, the 

Re: [PR] [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql [hudi]

2024-07-10 Thread via GitHub


nsivabalan merged PR #11610:
URL: https://github.com/apache/hudi/pull/11610


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11611:
URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221892912

   
   ## CI report:
   
   * 3016b3dc7dff77b95cb349b5ef5b6ba436fa05b4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24809)
 
   * 4de9be3830a22f5dc11538353b469c2083f1a35e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24815)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11611:
URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221885010

   
   ## CI report:
   
   * 3016b3dc7dff77b95cb349b5ef5b6ba436fa05b4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24809)
 
   * 4de9be3830a22f5dc11538353b469c2083f1a35e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2221884743

   
   ## CI report:
   
   * dac29c7e89201f0ced6d394bf6fd4a5c0622167b UNKNOWN
   * c4ccc63e66957214447210afa8365e08a2548ea8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24812)
 
   * e8ad55251f5c98b061208ceb2e52637b345e9db0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24813)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2221877431

   
   ## CI report:
   
   * dac29c7e89201f0ced6d394bf6fd4a5c0622167b UNKNOWN
   * c4ccc63e66957214447210afa8365e08a2548ea8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24812)
 
   * e8ad55251f5c98b061208ceb2e52637b345e9db0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]

2024-07-10 Thread via GitHub


danny0405 commented on code in PR #11605:
URL: https://github.com/apache/hudi/pull/11605#discussion_r1673317175


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy 
strategy) {
 }
   }
 
+  private HoodieCleanerPlan getEmptyCleanerPlan(Option 
earliestInstant, CleanPlanner planner) throws IOException {
+LOG.info("Nothing to clean here. It is already clean");
+Option instantVal = 
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+instantVal.map(x -> 
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), 
x.getAction(), x.getState().name()))
+
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+return cleanBuilder.build();
+  }
+
+  /**
+   * There is nothing to clean so this method will create an empty clean plan. 
But there is chance of optimizing
+   * the subsequent cleaner calls. Consider this scenarios in incremental 
cleaner mode,
+   * If clean timeline is empty or no clean commits were created for a while 
then every clean call will have to
+   * scan all the partitions, by creating an empty clean commit to update 
earliestCommitToRetain instant value,
+   * incremental clean policy does not have to look for file changes in all 
the partitions, rather it will look
+   * for partitions that are modified in last x hours. This value is 
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.

Review Comment:
   > Do you think we should create separate partition in metadata table for 
storing these such key, value configs?
   
   I kind of think storing the single value `earliestCommitToRetain`somewhere 
in the .hoodie is feasible, just not sure whether we should utilize the MDT 
because MDT itself is very heavy-weight, maybe some auxiliary marker file is 
enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7921] Fixing file system view closures in MDT (#11496)

2024-07-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 3789840be3d [HUDI-7921] Fixing file system view closures in MDT 
(#11496)
3789840be3d is described below

commit 3789840be3d041cbcfc6b24786740210e4e6d6ac
Author: Sivabalan Narayanan 
AuthorDate: Wed Jul 10 19:25:41 2024 -0700

[HUDI-7921] Fixing file system view closures in MDT (#11496)
---
 .../metadata/HoodieBackedTableMetadataWriter.java  |  55 ++--
 .../common/testutils/HoodieMetadataTestTable.java  |   6 +
 .../java/org/apache/hudi/table/TestCleaner.java| 326 +++--
 .../table/functional/TestCleanPlanExecutor.java| 325 ++--
 .../hudi/testutils/HoodieCleanerTestBase.java  |  31 +-
 .../hudi/metadata/HoodieBackedTableMetadata.java   |   4 +
 .../hudi/metadata/HoodieTableMetadataUtil.java |  48 +--
 .../hudi/common/testutils/HoodieTestTable.java |   8 +-
 8 files changed, 440 insertions(+), 363 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 89d21e79b22..c38a68e37cf 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -1081,9 +1081,8 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
   private HoodieData 
getFunctionalIndexUpdates(HoodieCommitMetadata commitMetadata, String 
indexPartition, String instantTime) throws Exception {
 HoodieIndexDefinition indexDefinition = 
getFunctionalIndexDefinition(indexPartition);
 List> partitionFileSlicePairs = new ArrayList<>();
-HoodieTableFileSystemView fsView = 
HoodieTableMetadataUtil.getFileSystemView(dataMetaClient);
 commitMetadata.getPartitionToWriteStats().forEach((dataPartition, value) 
-> {
-  List fileSlices = 
getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, 
Option.ofNullable(fsView), dataPartition);
+  List fileSlices = 
getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, Option.empty(), 
dataPartition);
   fileSlices.forEach(fileSlice -> {
 // Filter log files for the instant time and add to this partition 
fileSlice pairs
 List logFilesForInstant = fileSlice.getLogFiles()
@@ -1411,35 +1410,35 @@ public abstract class 
HoodieBackedTableMetadataWriter implements HoodieTableM
   HoodieData> partitionRecordsMap) {
 // The result set
 HoodieData allPartitionRecords = 
engineContext.emptyHoodieData();
+try (HoodieTableFileSystemView fsView = 
HoodieTableMetadataUtil.getFileSystemView(metadataMetaClient)) {
+  for (Map.Entry> entry : 
partitionRecordsMap.entrySet()) {
+final String partitionName = 
HoodieIndexUtils.getPartitionNameFromPartitionType(entry.getKey(), 
dataMetaClient, dataWriteConfig.getIndexingConfig().getIndexName());
+HoodieData records = entry.getValue();
+
+List fileSlices =
+
HoodieTableMetadataUtil.getPartitionLatestFileSlices(metadataMetaClient, 
Option.ofNullable(fsView), partitionName);
+if (fileSlices.isEmpty()) {
+  // scheduling of INDEX only initializes the file group and not add 
commit
+  // so if there are no committed file slices, look for inflight slices
+  fileSlices = 
getPartitionLatestFileSlicesIncludingInflight(metadataMetaClient, 
Option.ofNullable(fsView), partitionName);
+}
+final int fileGroupCount = fileSlices.size();
+ValidationUtils.checkArgument(fileGroupCount > 0, 
String.format("FileGroup count for MDT partition %s should be >0", 
partitionName));
+
+List finalFileSlices = fileSlices;
+HoodieData rddSinglePartitionRecords = records.map(r -> {
+  FileSlice slice = 
finalFileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(),
+  fileGroupCount));
+  r.unseal();
+  r.setCurrentLocation(new 
HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
+  r.seal();
+  return r;
+});
 
-HoodieTableFileSystemView fsView = 
HoodieTableMetadataUtil.getFileSystemView(metadataMetaClient);
-for (Map.Entry> entry : 
partitionRecordsMap.entrySet()) {
-  final String partitionName = 
HoodieIndexUtils.getPartitionNameFromPartitionType(entry.getKey(), 
dataMetaClient, dataWriteConfig.getIndexingConfig().getIndexName());
-  HoodieData records = entry.getValue();
-
-  List fileSlices =
-  
HoodieTableMetadataUtil.getPartitionLatestFileSlices(metadataMetaClient, 
Option.ofNullable(fsView), partitionNa

Re: [PR] [HUDI-7921] Fixing file system view closures in MDT [hudi]

2024-07-10 Thread via GitHub


nsivabalan merged PR #11496:
URL: https://github.com/apache/hudi/pull/11496


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2221839392

   
   ## CI report:
   
   * dac29c7e89201f0ced6d394bf6fd4a5c0622167b UNKNOWN
   * c14015c3618d231bc439c0a4fb14ce2dff32de00 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24803)
 
   * c4ccc63e66957214447210afa8365e08a2548ea8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24812)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7774] Add Avro Logical type support for Merciful Java convertor [hudi]

2024-07-10 Thread via GitHub


codope commented on code in PR #11265:
URL: https://github.com/apache/hudi/pull/11265#discussion_r1673287677


##
hudi-common/src/test/resources/date-type-invalid.avsc:
##


Review Comment:
   ok with separate resource files, but we should be conscious as too many 
files like these will bloat the bundle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7774] Add Avro Logical type support for Merciful Java convertor [hudi]

2024-07-10 Thread via GitHub


codope commented on code in PR #11265:
URL: https://github.com/apache/hudi/pull/11265#discussion_r1673285947


##
hudi-common/src/main/java/org/apache/hudi/avro/MercifulJsonConverter.java:
##
@@ -187,196 +178,774 @@ private static Object convertJsonToAvroField(Object 
value, String name, Schema s
   throw new HoodieJsonToAvroConversionException(null, name, schema, 
shouldSanitize, invalidCharMask);
 }
 
-JsonToAvroFieldProcessor processor = 
FIELD_TYPE_PROCESSORS.get(schema.getType());
-if (null != processor) {
+return JsonToAvroFieldProcessorUtil.convertToAvro(value, name, schema, 
shouldSanitize, invalidCharMask);
+  }
+
+  private static class JsonToAvroFieldProcessorUtil {
+/**
+ * Base Class for converting json to avro fields.
+ */
+private abstract static class JsonToAvroFieldProcessor implements 
Serializable {
+
+  public Object convertToAvro(Object value, String name, Schema schema, 
boolean shouldSanitize, String invalidCharMask) {
+Pair res = convert(value, name, schema, 
shouldSanitize, invalidCharMask);
+if (!res.getLeft()) {
+  throw new HoodieJsonToAvroConversionException(value, name, schema, 
shouldSanitize, invalidCharMask);
+}
+return res.getRight();
+  }
+
+  protected abstract Pair convert(Object value, String 
name, Schema schema, boolean shouldSanitize, String invalidCharMask);
+}
+
+public static Object convertToAvro(Object value, String name, Schema 
schema, boolean shouldSanitize, String invalidCharMask) {
+  JsonToAvroFieldProcessor processor = getProcessorForSchema(schema);
   return processor.convertToAvro(value, name, schema, shouldSanitize, 
invalidCharMask);
 }
-throw new IllegalArgumentException("JsonConverter cannot handle type: " + 
schema.getType());
-  }
 
-  /**
-   * Base Class for converting json to avro fields.
-   */
-  private abstract static class JsonToAvroFieldProcessor implements 
Serializable {
+private static JsonToAvroFieldProcessor getProcessorForSchema(Schema 
schema) {
+  JsonToAvroFieldProcessor processor = null;
+
+  // 3 cases to consider: customized logicalType, logicalType, and type.
+  String customizedLogicalType = schema.getProp("logicalType");
+  LogicalType logicalType = schema.getLogicalType();
+  Type type = schema.getType();
+  if (customizedLogicalType != null && !customizedLogicalType.isEmpty()) {
+processor = 
AVRO_LOGICAL_TYPE_FIELD_PROCESSORS.get(customizedLogicalType);
+  } else if (logicalType != null) {
+processor = 
AVRO_LOGICAL_TYPE_FIELD_PROCESSORS.get(logicalType.getName());
+  } else {
+processor = AVRO_TYPE_FIELD_TYPE_PROCESSORS.get(type);
+  }
 
-public Object convertToAvro(Object value, String name, Schema schema, 
boolean shouldSanitize, String invalidCharMask) {
-  Pair res = convert(value, name, schema, shouldSanitize, 
invalidCharMask);
-  if (!res.getLeft()) {
-throw new HoodieJsonToAvroConversionException(value, name, schema, 
shouldSanitize, invalidCharMask);
+  if (processor == null) {
+throw new IllegalArgumentException(String.format("JsonConverter cannot 
handle type: %s", type));
   }
-  return res.getRight();
+  return processor;
 }
 
-protected abstract Pair convert(Object value, String 
name, Schema schema, boolean shouldSanitize, String invalidCharMask);
-  }
+// Avro primitive and complex type processors.
+private static final Map 
AVRO_TYPE_FIELD_TYPE_PROCESSORS = getFieldTypeProcessors();
+// Avro logical type processors.
+private static final Map 
AVRO_LOGICAL_TYPE_FIELD_PROCESSORS = getLogicalFieldTypeProcessors();
+
+/**
+ * Build type processor map for each avro type.
+ */
+private static Map 
getFieldTypeProcessors() {
+  Map fieldTypeProcessors = new 
EnumMap<>(Schema.Type.class);
+  fieldTypeProcessors.put(Type.STRING, generateStringTypeHandler());
+  fieldTypeProcessors.put(Type.BOOLEAN, generateBooleanTypeHandler());
+  fieldTypeProcessors.put(Type.DOUBLE, generateDoubleTypeHandler());
+  fieldTypeProcessors.put(Type.FLOAT, generateFloatTypeHandler());
+  fieldTypeProcessors.put(Type.INT, generateIntTypeHandler());
+  fieldTypeProcessors.put(Type.LONG, generateLongTypeHandler());
+  fieldTypeProcessors.put(Type.ARRAY, generateArrayTypeHandler());
+  fieldTypeProcessors.put(Type.RECORD, generateRecordTypeHandler());
+  fieldTypeProcessors.put(Type.ENUM, generateEnumTypeHandler());
+  fieldTypeProcessors.put(Type.MAP, generateMapTypeHandler());
+  fieldTypeProcessors.put(Type.BYTES, generateBytesTypeHandler());
+  fieldTypeProcessors.put(Type.FIXED, generateFixedTypeHandler());
+  return Collections.unmodifiableMap(fieldTypeProcessors);
+}
 
-  private static JsonToAvroFieldProcessor generateBooleanTypeHandler() {
-return new JsonToAvroFieldProcessor() {
+private

Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2221830778

   
   ## CI report:
   
   * dac29c7e89201f0ced6d394bf6fd4a5c0622167b UNKNOWN
   * c14015c3618d231bc439c0a4fb14ce2dff32de00 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24803)
 
   * c4ccc63e66957214447210afa8365e08a2548ea8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql [hudi]

2024-07-10 Thread via GitHub


codope commented on code in PR #11610:
URL: https://github.com/apache/hudi/pull/11610#discussion_r1673266828


##
website/versioned_docs/version-0.15.0/sql_ddl.md:
##
@@ -67,7 +67,10 @@ PARTITIONED BY (dt);
 ```
 
 :::note
-You can also create a table partitioned by multiple fields by supplying 
comma-separated field names. For, e.g., "partitioned by dt, hh"
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the above table, the partition fields
+should be specified as `PARTITIONED BY (dt, hh)`.

Review Comment:
   I could not find sql ddl docs in earlier version but later realized this was 
part of quickstart itself. Tested back till 0.10, which error out, but other 
versions create table and insert data in incorrect partition. Updated the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7961] Optimizing upsert partitioner for prepped write operations (#11581)

2024-07-10 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 01d905ff46b [HUDI-7961] Optimizing upsert partitioner for prepped 
write operations (#11581)
01d905ff46b is described below

commit 01d905ff46bceb735ac6d3fa2960ec34908dfcd2
Author: Sivabalan Narayanan 
AuthorDate: Wed Jul 10 18:18:07 2024 -0700

[HUDI-7961] Optimizing upsert partitioner for prepped write operations 
(#11581)
---
 .../table/action/commit/BaseSparkCommitActionExecutor.java   |  2 +-
 .../commit/SparkInsertOverwriteCommitActionExecutor.java |  2 +-
 .../table/action/commit/SparkInsertOverwritePartitioner.java |  5 +++--
 .../apache/hudi/table/action/commit/UpsertPartitioner.java   | 10 --
 .../deltacommit/BaseSparkDeltaCommitActionExecutor.java  |  2 +-
 .../deltacommit/SparkUpsertDeltaCommitPartitioner.java   |  5 +++--
 .../hudi/table/action/commit/TestUpsertPartitioner.java  | 12 ++--
 .../org/apache/hudi/common/model/WriteOperationType.java |  4 
 8 files changed, 27 insertions(+), 15 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
index 32e4824b8b8..36902a8c3f2 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
@@ -370,7 +370,7 @@ public abstract class BaseSparkCommitActionExecutor 
extends
 if (profile == null) {
   throw new HoodieUpsertException("Need workload profile to construct the 
upsert partitioner.");
 }
-return new UpsertPartitioner<>(profile, context, table, config);
+return new UpsertPartitioner<>(profile, context, table, config, 
operationType);
   }
 
   public Partitioner getInsertPartitioner(WorkloadProfile profile) {
diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java
index ac84475bfa4..63342989c79 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java
@@ -71,7 +71,7 @@ public class SparkInsertOverwriteCommitActionExecutor
   protected Partitioner getPartitioner(WorkloadProfile profile) {
 return table.getStorageLayout().layoutPartitionerClass()
 .map(c -> getLayoutPartitioner(profile, c))
-.orElseGet(() -> new SparkInsertOverwritePartitioner(profile, context, 
table, config));
+.orElseGet(() -> new SparkInsertOverwritePartitioner(profile, context, 
table, config, operationType));
   }
 
   @Override
diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java
index cdf2bcd0345..d2cef9250e6 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java
@@ -20,6 +20,7 @@ package org.apache.hudi.table.action.commit;
 
 import org.apache.hudi.common.engine.HoodieEngineContext;
 import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.table.HoodieTable;
 import org.apache.hudi.table.WorkloadProfile;
@@ -38,8 +39,8 @@ public class SparkInsertOverwritePartitioner extends 
UpsertPartitioner {
   private static final Logger LOG = 
LoggerFactory.getLogger(SparkInsertOverwritePartitioner.class);
 
   public SparkInsertOverwritePartitioner(WorkloadProfile profile, 
HoodieEngineContext context, HoodieTable table,
- HoodieWriteConfig config) {
-super(profile, context, table, config);
+ HoodieWriteConfig config, 
WriteOperationType operationType) {
+super(profile, context, table, config, operationType);
   }
 
   @Override
diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.ja

Re: [PR] [HUDI-7961] Optimizing upsert partitioner for prepped write operations [hudi]

2024-07-10 Thread via GitHub


codope merged PR #11581:
URL: https://github.com/apache/hudi/pull/11581


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Exception while using HoodieStreamer protobuf data from Kafka [hudi]

2024-07-10 Thread via GitHub


gauravg1977 commented on issue #11598:
URL: https://github.com/apache/hudi/issues/11598#issuecomment-2221804280

   Please let me know if I am missing any details in describing the issue or if 
there is something basic that I am missing here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]

2024-07-10 Thread via GitHub


suryaprasanna commented on code in PR #11605:
URL: https://github.com/apache/hudi/pull/11605#discussion_r1673243602


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy 
strategy) {
 }
   }
 
+  private HoodieCleanerPlan getEmptyCleanerPlan(Option 
earliestInstant, CleanPlanner planner) throws IOException {
+LOG.info("Nothing to clean here. It is already clean");
+Option instantVal = 
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+instantVal.map(x -> 
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), 
x.getAction(), x.getState().name()))
+
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+return cleanBuilder.build();
+  }
+
+  /**
+   * There is nothing to clean so this method will create an empty clean plan. 
But there is chance of optimizing
+   * the subsequent cleaner calls. Consider this scenarios in incremental 
cleaner mode,
+   * If clean timeline is empty or no clean commits were created for a while 
then every clean call will have to
+   * scan all the partitions, by creating an empty clean commit to update 
earliestCommitToRetain instant value,
+   * incremental clean policy does not have to look for file changes in all 
the partitions, rather it will look
+   * for partitions that are modified in last x hours. This value is 
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.

Review Comment:
   > Can we do force clean in the rewrite jobs
   
   I think this may not work since we need to have restore capabilities on the 
dataset for 24 hours.
   
   > store the last earliestCommitToRetain separately in the metadata
   
   Generally speaking, clean's earliestCommitToRetain value and dataset's 
checkpoints information should be stored in a separate location maybe 
hoodie.properties or something like in a Hudi metaserver. Do you think we 
should create separate partition in metadata table for storing these such key, 
value configs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]

2024-07-10 Thread via GitHub


suryaprasanna commented on code in PR #11605:
URL: https://github.com/apache/hudi/pull/11605#discussion_r1673243602


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy 
strategy) {
 }
   }
 
+  private HoodieCleanerPlan getEmptyCleanerPlan(Option 
earliestInstant, CleanPlanner planner) throws IOException {
+LOG.info("Nothing to clean here. It is already clean");
+Option instantVal = 
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+instantVal.map(x -> 
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), 
x.getAction(), x.getState().name()))
+
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+return cleanBuilder.build();
+  }
+
+  /**
+   * There is nothing to clean so this method will create an empty clean plan. 
But there is chance of optimizing
+   * the subsequent cleaner calls. Consider this scenarios in incremental 
cleaner mode,
+   * If clean timeline is empty or no clean commits were created for a while 
then every clean call will have to
+   * scan all the partitions, by creating an empty clean commit to update 
earliestCommitToRetain instant value,
+   * incremental clean policy does not have to look for file changes in all 
the partitions, rather it will look
+   * for partitions that are modified in last x hours. This value is 
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.

Review Comment:
   > Can we do force clean in the rewrite jobs
   
   I think this may not work for our use case, since we need to have restore 
capabilities on the dataset upto 24 hours.
   
   > store the last earliestCommitToRetain separately in the metadata
   
   Generally speaking, clean's earliestCommitToRetain value and dataset's 
checkpoints information should be stored in a separate location maybe 
hoodie.properties or something like in a Hudi metaserver. Do you think we 
should create separate partition in metadata table for storing these such key, 
value configs? That might solve the file scanning issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]

2024-07-10 Thread via GitHub


suryaprasanna commented on code in PR #11605:
URL: https://github.com/apache/hudi/pull/11605#discussion_r1673243602


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy 
strategy) {
 }
   }
 
+  private HoodieCleanerPlan getEmptyCleanerPlan(Option 
earliestInstant, CleanPlanner planner) throws IOException {
+LOG.info("Nothing to clean here. It is already clean");
+Option instantVal = 
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+instantVal.map(x -> 
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), 
x.getAction(), x.getState().name()))
+
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+return cleanBuilder.build();
+  }
+
+  /**
+   * There is nothing to clean so this method will create an empty clean plan. 
But there is chance of optimizing
+   * the subsequent cleaner calls. Consider this scenarios in incremental 
cleaner mode,
+   * If clean timeline is empty or no clean commits were created for a while 
then every clean call will have to
+   * scan all the partitions, by creating an empty clean commit to update 
earliestCommitToRetain instant value,
+   * incremental clean policy does not have to look for file changes in all 
the partitions, rather it will look
+   * for partitions that are modified in last x hours. This value is 
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.

Review Comment:
   > Can we do force clean in the rewrite jobs
   
   I think this may not work for our use case, since we need to have restore 
capabilities on the dataset upto 24 hours.
   
   > store the last earliestCommitToRetain separately in the metadata
   
   Generally speaking, clean's earliestCommitToRetain value and dataset's 
checkpoints information should be stored in a separate location maybe 
hoodie.properties or something like in a Hudi metaserver. Do you think we 
should create separate partition in metadata table for storing these such key, 
value configs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7975] Provide an API to create empty commit [hudi]

2024-07-10 Thread via GitHub


suryaprasanna commented on PR #11606:
URL: https://github.com/apache/hudi/pull/11606#issuecomment-2221771756

   > should we think through a way to improve the incremental query performance 
on the timeline instead of these tricky changes?
   The reason for creating an empty commit is to trigger table service 
operations like rollback, clean and archival. We have noticed that users are 
not calling any hudi APIs when there is no data to ingest, but our internal 
table services cannot handle all the cases. Another reason we also need this 
when data is getting ingested from multiple sources and each writer is tracking 
its own checkpoint in the commit. Then in that setup, if one writer is making 
frequent writes and other is not writing that frequently then there is a case 
where checkpoints stored by the second writer can be archived. So, we need a 
better way to store those checkpoint information as well.
   This change helps us in both triggering table services and also copy 
checkpoints.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]

2024-07-10 Thread via GitHub


suryaprasanna commented on code in PR #11605:
URL: https://github.com/apache/hudi/pull/11605#discussion_r1673243602


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy 
strategy) {
 }
   }
 
+  private HoodieCleanerPlan getEmptyCleanerPlan(Option 
earliestInstant, CleanPlanner planner) throws IOException {
+LOG.info("Nothing to clean here. It is already clean");
+Option instantVal = 
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+instantVal.map(x -> 
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), 
x.getAction(), x.getState().name()))
+
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+return cleanBuilder.build();
+  }
+
+  /**
+   * There is nothing to clean so this method will create an empty clean plan. 
But there is chance of optimizing
+   * the subsequent cleaner calls. Consider this scenarios in incremental 
cleaner mode,
+   * If clean timeline is empty or no clean commits were created for a while 
then every clean call will have to
+   * scan all the partitions, by creating an empty clean commit to update 
earliestCommitToRetain instant value,
+   * incremental clean policy does not have to look for file changes in all 
the partitions, rather it will look
+   * for partitions that are modified in last x hours. This value is 
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.

Review Comment:
   > Can we do force clean in the rewrite jobs
   I think this may not work since we need to have restore capabilities on the 
dataset for 24 hours.
   > store the last earliestCommitToRetain separately in the metadata
   Generally speaking, clean's earliestCommitToRetain value and dataset's 
checkpoints information should be stored in a separate location maybe 
hoodie.properties or something like in a Hudi metaserver. Do you think we 
should create separate partition in metadata table for storing these such key, 
value configs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]

2024-07-10 Thread via GitHub


danny0405 commented on PR #11611:
URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221763683

   Does the 15MB comes from real practice?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi Failed to read MARKERS file [hudi]

2024-07-10 Thread via GitHub


danny0405 commented on issue #6900:
URL: https://github.com/apache/hudi/issues/6900#issuecomment-2221758340

   > Could not read commit details from 
hdfs://hacluster/user/kylin/flink/data/streaming_rdss_rcsp_lab/2024062815382133
   
   Is this a real file on storage? Did you check the integrity of it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


danny0405 commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221755217

   > did you guys sync up on that. if not, can we sync up and make forward 
progress.
   
   I thought we have synced up and reached concensus in the last stand-up 
meeting, no?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hive sql查询hudi分区表,如果分区字段不是表最后一列,解析parquet文件后返回的数据,没有查询分区字段单在分区字段列位置自动增加了分区字段的值,导致后续列错误发生类型转换问题 [hudi]

2024-07-10 Thread via GitHub


danny0405 commented on issue #11609:
URL: https://github.com/apache/hudi/issues/11609#issuecomment-2221753169

   @xicm Can you help with this case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]

2024-07-10 Thread via GitHub


danny0405 commented on code in PR #11605:
URL: https://github.com/apache/hudi/pull/11605#discussion_r1673232619


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy 
strategy) {
 }
   }
 
+  private HoodieCleanerPlan getEmptyCleanerPlan(Option 
earliestInstant, CleanPlanner planner) throws IOException {
+LOG.info("Nothing to clean here. It is already clean");
+Option instantVal = 
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+instantVal.map(x -> 
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), 
x.getAction(), x.getState().name()))
+
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+return cleanBuilder.build();
+  }
+
+  /**
+   * There is nothing to clean so this method will create an empty clean plan. 
But there is chance of optimizing
+   * the subsequent cleaner calls. Consider this scenarios in incremental 
cleaner mode,
+   * If clean timeline is empty or no clean commits were created for a while 
then every clean call will have to
+   * scan all the partitions, by creating an empty clean commit to update 
earliestCommitToRetain instant value,
+   * incremental clean policy does not have to look for file changes in all 
the partitions, rather it will look
+   * for partitions that are modified in last x hours. This value is 
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.

Review Comment:
   > Also, we cannot disable cleans for such datasets, since those datasets can 
be onboarded to clustering and can do adhoc rewrites on some partitions. So, 
clean have to be enabled all the time and when we are enabling cleans it is 
scanning all the partitions. So, we thought of storing the 
earliestCommitToRetain value somewhere and use it to store the commits it has 
scanned. 
   
   Can we do force clean in the rewrite jobs or store the last 
`earliestCommitToRetain` separately in the metadata, instead of empty clean 
commits, like as you mentioned, most of the clean commits would just be useless 
and empty for this special inserts scenario. I kind of think we should come up 
with a more general solution for append only use case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7977) Improve bucket index paritioner algorithm

2024-07-10 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7977.

Resolution: Fixed

Fixed via master branch:16ee6bc4e329e6ecd7887039cb5216aecf571e8c

> Improve bucket index paritioner algorithm
> -
>
> Key: HUDI-7977
> URL: https://issues.apache.org/jira/browse/HUDI-7977
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> imporve {{BucketIndexUtil}} partitionIndex algorithm make the data be evenly 
> distributed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7977) Improve bucket index paritioner algorithm

2024-07-10 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7977:
-
Summary: Improve bucket index paritioner algorithm  (was: improve bucket 
index paritioner)

> Improve bucket index paritioner algorithm
> -
>
> Key: HUDI-7977
> URL: https://issues.apache.org/jira/browse/HUDI-7977
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> imporve {{BucketIndexUtil}} partitionIndex algorithm make the data be evenly 
> distributed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7977] Improve bucket index partitioner algorithm (#11608)

2024-07-10 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 16ee6bc4e32 [HUDI-7977] Improve bucket index partitioner algorithm 
(#11608)
16ee6bc4e32 is described below

commit 16ee6bc4e329e6ecd7887039cb5216aecf571e8c
Author: KnightChess <981159...@qq.com>
AuthorDate: Thu Jul 11 08:11:48 2024 +0800

[HUDI-7977] Improve bucket index partitioner algorithm (#11608)
---
 .../hudi/common/util/hash/BucketIndexUtil.java |  26 +---
 .../hudi/common/util/hash/TestBucketIndexUtil.java | 152 +
 2 files changed, 157 insertions(+), 21 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java
index adfdd4540d8..ea3f6a2a12c 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java
@@ -36,26 +36,10 @@ public class BucketIndexUtil {
* @return The partition index of this bucket.
*/
   public static Functions.Function2 
getPartitionIndexFunc(int bucketNum, int parallelism) {
-if (parallelism < bucketNum) {
-  return (partition, curBucket) -> {
-int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / 
parallelism * bucketNum;
-int globalIndex = partitionIndex + curBucket;
-return globalIndex % parallelism;
-  };
-} else {
-  if (parallelism % bucketNum == 0) {
-return (partition, curBucket) -> {
-  int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / 
(parallelism / bucketNum) * bucketNum;
-  int globalIndex = partitionIndex + curBucket;
-  return globalIndex % parallelism;
-};
-  } else {
-return (partition, curBucket) -> {
-  int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / 
(parallelism / bucketNum + 1) * bucketNum;
-  int globalIndex = partitionIndex + curBucket;
-  return globalIndex % parallelism;
-};
-  }
-}
+return (partition, curBucket) -> {
+  int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) % 
parallelism * bucketNum;
+  int globalIndex = partitionIndex + curBucket;
+  return globalIndex % parallelism;
+};
   }
 }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/util/hash/TestBucketIndexUtil.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/util/hash/TestBucketIndexUtil.java
new file mode 100644
index 000..91c0da003f4
--- /dev/null
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/util/hash/TestBucketIndexUtil.java
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.util.hash;
+
+import org.apache.hudi.common.util.Functions;
+
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Stream;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class TestBucketIndexUtil {
+
+  private static Stream partitionParams() {
+List argsList = new ArrayList<>();
+argsList.add(Arguments.of(10, 5, true));
+argsList.add(Arguments.of(20, 5, true));
+argsList.add(Arguments.of(21, 5, true));
+argsList.add(Arguments.of(40, 5, true));
+argsList.add(Arguments.of(41, 5, true));
+argsList.add(Arguments.of(100, 5, true));
+argsList.add(Arguments.of(101, 5, true));
+argsList.add(Arguments.of(20, 100, true));
+argsList.add(Arguments.of(21, 100, true));
+argsList.add(Arguments.of(100, 100, true));
+argsList.add(Arguments.of(101, 100, true));
+argsList.add(Arguments.of(200, 100, true));
+argsList.add(Arguments.of(201, 100, tr

Re: [PR] [HUDI-7977] Improve bucket index partitioner algorithm [hudi]

2024-07-10 Thread via GitHub


danny0405 merged PR #11608:
URL: https://github.com/apache/hudi/pull/11608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch asf-site updated: fix: update home page title (#11530)

2024-07-10 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new cbeff1cb9ca fix: update home page title (#11530)
cbeff1cb9ca is described below

commit cbeff1cb9ca3b7327cebfe6a032e354568ce72c5
Author: pintusoliya <37680791+pintusol...@users.noreply.github.com>
AuthorDate: Thu Jul 11 04:15:29 2024 +0530

fix: update home page title (#11530)

fix: merge conflicts

fix: title
---
 website/src/pages/index.js|  7 +++
 website/src/theme/LayoutHead/index.js |  7 ---
 website/src/theme/LayoutHead/useTitleFormatter.js | 14 ++
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/website/src/pages/index.js b/website/src/pages/index.js
index ed6f4f9d050..faa5eefbcd4 100644
--- a/website/src/pages/index.js
+++ b/website/src/pages/index.js
@@ -24,10 +24,9 @@ export default function Home() {
   const { siteConfig } = useDocusaurusContext();
   return (
 
+  title={`Apache Hudi | An Open Source Data Lake Platform`}
+  shouldShowOnlyTitle={true}
+  description="Description will go into a meta tag in ">
   
   
   
diff --git a/website/src/theme/LayoutHead/index.js 
b/website/src/theme/LayoutHead/index.js
index d0841677f5b..7c3abb4d2de 100644
--- a/website/src/theme/LayoutHead/index.js
+++ b/website/src/theme/LayoutHead/index.js
@@ -12,10 +12,10 @@ import SearchMetadata from '@theme/SearchMetadata';
 import Seo from '@theme/Seo';
 import {
   DEFAULT_SEARCH_TAG,
-  useTitleFormatter,
   useAlternatePageUtils,
   useThemeConfig,
 } from '@docusaurus/theme-common';
+import { useTitleFormatter } from './useTitleFormatter';
 import {useLocation} from '@docusaurus/router'; // Useful for SEO
 // See 
https://developers.google.com/search/docs/advanced/crawling/localized-versions
 // See https://github.com/facebook/docusaurus/issues/3317
@@ -82,9 +82,10 @@ export default function LayoutHead(props) {
 i18n: {currentLocale, localeConfigs},
   } = useDocusaurusContext();
   const {metadata, image: defaultImage} = useThemeConfig();
-  const {title, description, image, keywords, searchMetadata} = props;
+  const {title, description, image, keywords, searchMetadata, 
shouldShowOnlyTitle} = props;
   const faviconUrl = useBaseUrl(favicon);
-  const pageTitle = useTitleFormatter(title); // See 
https://github.com/facebook/docusaurus/issues/3317#issuecomment-754661855
+  const pageTitle = useTitleFormatter(title, shouldShowOnlyTitle); // See 
https://github.com/facebook/docusaurus/issues/3317#issuecomment-754661855
+
   // const htmlLang = currentLocale.split('-')[0];
 
   const htmlLang = currentLocale; // should we allow the user to override 
htmlLang with localeConfig?
diff --git a/website/src/theme/LayoutHead/useTitleFormatter.js 
b/website/src/theme/LayoutHead/useTitleFormatter.js
new file mode 100644
index 000..ee628fc29cc
--- /dev/null
+++ b/website/src/theme/LayoutHead/useTitleFormatter.js
@@ -0,0 +1,14 @@
+import useDocusaurusContext from '@docusaurus/useDocusaurusContext';
+
+export const useTitleFormatter = (title, shouldShowOnlyTitle)=> {
+const {siteConfig} = useDocusaurusContext();
+const {title: siteTitle, titleDelimiter} = siteConfig;
+
+if (shouldShowOnlyTitle && title && title.trim().length) {
+return title.trim();
+}
+
+return title && title.trim().length
+? `${title.trim()} ${titleDelimiter} ${siteTitle}`
+: siteTitle;
+};



Re: [PR] [DOCS] fix: update home page title [hudi]

2024-07-10 Thread via GitHub


bhasudha merged PR #11530:
URL: https://github.com/apache/hudi/pull/11530


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11611:
URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221642367

   
   ## CI report:
   
   * 3016b3dc7dff77b95cb349b5ef5b6ba436fa05b4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24809)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]

2024-07-10 Thread via GitHub


CTTY commented on code in PR #11608:
URL: https://github.com/apache/hudi/pull/11608#discussion_r1673121749


##
hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java:
##
@@ -36,26 +36,10 @@ public class BucketIndexUtil {
* @return The partition index of this bucket.
*/
   public static Functions.Function2 
getPartitionIndexFunc(int bucketNum, int parallelism) {
-if (parallelism < bucketNum) {
-  return (partition, curBucket) -> {
-int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / 
parallelism * bucketNum;
-int globalIndex = partitionIndex + curBucket;
-return globalIndex % parallelism;
-  };
-} else {
-  if (parallelism % bucketNum == 0) {
-return (partition, curBucket) -> {
-  int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / 
(parallelism / bucketNum) * bucketNum;
-  int globalIndex = partitionIndex + curBucket;
-  return globalIndex % parallelism;
-};
-  } else {
-return (partition, curBucket) -> {
-  int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / 
(parallelism / bucketNum + 1) * bucketNum;
-  int globalIndex = partitionIndex + curBucket;
-  return globalIndex % parallelism;
-};
-  }
-}
+return (partition, curBucket) -> {
+  int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) % 
parallelism * bucketNum;
+  int globalIndex = partitionIndex + curBucket;
+  return globalIndex % parallelism;
+};

Review Comment:
   nit: Can we update the comment to reflect the logic of this new algorithm



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


nsivabalan commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221624233

   @danny0405 @lokeshj1703 : are we aligned on any optimization we need on top 
of the patch? for eg, we were discussing about replace commit timeline right? 
did you guys sync up on that. if not, can we sync up and make forward progress. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7974] Create empty clean commit at a cadence and make it configurable [hudi]

2024-07-10 Thread via GitHub


suryaprasanna commented on code in PR #11605:
URL: https://github.com/apache/hudi/pull/11605#discussion_r1673110615


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy 
strategy) {
 }
   }
 
+  private HoodieCleanerPlan getEmptyCleanerPlan(Option 
earliestInstant, CleanPlanner planner) throws IOException {
+LOG.info("Nothing to clean here. It is already clean");
+Option instantVal = 
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+instantVal.map(x -> 
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+.setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), 
x.getAction(), x.getState().name()))
+
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+return cleanBuilder.build();
+  }
+
+  /**
+   * There is nothing to clean so this method will create an empty clean plan. 
But there is chance of optimizing
+   * the subsequent cleaner calls. Consider this scenarios in incremental 
cleaner mode,
+   * If clean timeline is empty or no clean commits were created for a while 
then every clean call will have to
+   * scan all the partitions, by creating an empty clean commit to update 
earliestCommitToRetain instant value,
+   * incremental clean policy does not have to look for file changes in all 
the partitions, rather it will look
+   * for partitions that are modified in last x hours. This value is 
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.

Review Comment:
   @danny0405 
   The usecase we are trying to solve here is when a dataset that does not 
receive upserts and does not use small handling, cleans are running for every 
commit and scans all the partitions all the time. Also, we cannot disable 
cleans for such datasets, since those datasets can be onboarded to clustering 
and can do adhoc rewrites on some partitions. So, clean have to be enabled all 
the time and when we are enabling cleans it is scanning all the partitions. So, 
we thought of storing the earliestCommitToRetain value somewhere and use it to 
store the commits it has scanned. That way every clean will only scan the 
partitions that were touched in a commit since earliestCommitToRetain progress 
commit by commit when incremental clean policy is enabled. This reduces the 
full scan for all partitions to scanning one or two partitions.
   CC @nsivabalan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] fix: update home page title [hudi]

2024-07-10 Thread via GitHub


xushiyan commented on PR #11530:
URL: https://github.com/apache/hudi/pull/11530#issuecomment-2221566110

   @pintusoliya looks good. can you fix the CI failure pls?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11611:
URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221551975

   
   ## CI report:
   
   * 3016b3dc7dff77b95cb349b5ef5b6ba436fa05b4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24809)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7961] Optimizing upsert partitioner for prepped write operations [hudi]

2024-07-10 Thread via GitHub


nsivabalan commented on code in PR #11581:
URL: https://github.com/apache/hudi/pull/11581#discussion_r1673013661


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java:
##
@@ -86,16 +87,21 @@ public class UpsertPartitioner extends 
SparkHoodiePartitioner {
   private HashMap bucketInfoMap;
 
   protected final HoodieWriteConfig config;
+  private final WriteOperationType operationType;
 
   public UpsertPartitioner(WorkloadProfile profile, HoodieEngineContext 
context, HoodieTable table,
-  HoodieWriteConfig config) {
+   HoodieWriteConfig config, WriteOperationType 
operationType) {
 super(profile, table);
 updateLocationToBucket = new HashMap<>();
 partitionPathToInsertBucketInfos = new HashMap<>();
 bucketInfoMap = new HashMap<>();
 this.config = config;
+this.operationType = operationType;
 assignUpdates(profile);
-assignInserts(profile, context);
+long totalInserts = 
profile.getInputPartitionPathStatMap().values().stream().mapToLong(stat -> 
stat.getNumInserts()).sum();
+if (!WriteOperationType.isPreppedWriteOperation(operationType) || 
totalInserts > 0) { // skip if its prepped write operation. or if totalInserts 
= 0.
+  assignInserts(profile, context);

Review Comment:
   yes. thats the main purpose of prepped write operation. we expect the 
location to be set already in them and so we do not invoke tag locations. and 
hence assignInserts() is unnecessary overhead and is a no-op. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7961] Optimizing upsert partitioner for prepped write operations [hudi]

2024-07-10 Thread via GitHub


nsivabalan commented on code in PR #11581:
URL: https://github.com/apache/hudi/pull/11581#discussion_r1673012152


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java:
##
@@ -38,8 +39,8 @@ public class SparkInsertOverwritePartitioner extends 
UpsertPartitioner {
   private static final Logger LOG = 
LoggerFactory.getLogger(SparkInsertOverwritePartitioner.class);
 
   public SparkInsertOverwritePartitioner(WorkloadProfile profile, 
HoodieEngineContext context, HoodieTable table,
- HoodieWriteConfig config) {
-super(profile, context, table, config);
+ HoodieWriteConfig config, 
WriteOperationType operationType) {

Review Comment:
   yes. Neither HoodieTable nor HoodieWriteConfig has write operation. So, I 
had to add an extra argument



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11611:
URL: https://github.com/apache/hudi/pull/11611#issuecomment-2221430587

   
   ## CI report:
   
   * 3016b3dc7dff77b95cb349b5ef5b6ba436fa05b4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7979) Fix out of the box defaults with spillable memory configs

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7979:
-
Labels: pull-request-available  (was: )

> Fix out of the box defaults with spillable memory configs 
> --
>
> Key: HUDI-7979
> URL: https://issues.apache.org/jira/browse/HUDI-7979
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> Looks like we are very conservative wrt memory configs used for spillable map 
> based FSV. 
>  
> For eg, we are only allocating 15Mb out of the box to file groups when using 
> spillable map based FSV.
>  public long getMaxMemoryForFileGroupMap() \{
> long totalMemory = getLong(SPILLABLE_MEMORY);
> return totalMemory - getMaxMemoryForPendingCompaction() - 
> getMaxMemoryForBootstrapBaseFile();
>   }
>  
> SPILLABLE_MEMORY = default is 100Mb.
> getMaxMemoryForPendingCompaction = 80% of 100MB.
> getMaxMemoryForBootstrapBaseFile = 5% of 100Mb.
> so, overall, out of the box we are allocating only 15Mb for 
> getMaxMemoryForFileGroupMap.
> ref: 
> [https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-[…]/apache/hudi/common/table/view/FileSystemViewStorageConfig.java|https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java#L224]
> Wondering do we even need 80% for pending compaction tracker in our FSV. I am 
> thinking to make it 15%. so that we can give more memory to actual file 
> groups. We may not have lot of pending compactions for a given table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7979] Adjusting defaults with spillable map memory [hudi]

2024-07-10 Thread via GitHub


nsivabalan opened a new pull request, #11611:
URL: https://github.com/apache/hudi/pull/11611

   ### Change Logs
   
   Adjusting defaults with spillable map memory. Reduced the 80Mb allocation 
for pending compaction to 15Mb. 
   
   ### Impact
   
   More memory for file groups in FSVs. 
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7979) Fix out of the box defaults with spillable memory configs

2024-07-10 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7979:
-

 Summary: Fix out of the box defaults with spillable memory configs 
 Key: HUDI-7979
 URL: https://issues.apache.org/jira/browse/HUDI-7979
 Project: Apache Hudi
  Issue Type: Improvement
  Components: reader-core, writer-core
Reporter: sivabalan narayanan


Looks like we are very conservative wrt memory configs used for spillable map 
based FSV. 

 
For eg, we are only allocating 15Mb out of the box to file groups when using 
spillable map based FSV.
 public long getMaxMemoryForFileGroupMap() \{
long totalMemory = getLong(SPILLABLE_MEMORY);
return totalMemory - getMaxMemoryForPendingCompaction() - 
getMaxMemoryForBootstrapBaseFile();
  }
 
SPILLABLE_MEMORY = default is 100Mb.
getMaxMemoryForPendingCompaction = 80% of 100MB.
getMaxMemoryForBootstrapBaseFile = 5% of 100Mb.
so, overall, out of the box we are allocating only 15Mb for 
getMaxMemoryForFileGroupMap.
ref: 
[https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-[…]/apache/hudi/common/table/view/FileSystemViewStorageConfig.java|https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java#L224]
Wondering do we even need 80% for pending compaction tracker in our FSV. I am 
thinking to make it 15%. so that we can give more memory to actual file groups. 
We may not have lot of pending compactions for a given table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221276113

   
   ## CI report:
   
   * 7e8e5336ab1f079d93ccaa0981e70798169532f1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24807)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql [hudi]

2024-07-10 Thread via GitHub


nsivabalan commented on code in PR #11610:
URL: https://github.com/apache/hudi/pull/11610#discussion_r1672779411


##
website/versioned_docs/version-0.15.0/sql_ddl.md:
##
@@ -67,7 +67,10 @@ PARTITIONED BY (dt);
 ```
 
 :::note
-You can also create a table partitioned by multiple fields by supplying 
comma-separated field names. For, e.g., "partitioned by dt, hh"
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the above table, the partition fields
+should be specified as `PARTITIONED BY (dt, hh)`.

Review Comment:
   is it not an issue in older versions? or we have sql_ddl page only from 
0.14.x 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2221181265

   
   ## CI report:
   
   * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN
   * 0960a9f7454d0b7daae0a78d1993af11b94cea53 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24808)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2221109881

   
   ## CI report:
   
   * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN
   * 77afe6dfa6dfa6f6180929e052d6ad75d8d3618b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24770)
 
   * 0960a9f7454d0b7daae0a78d1993af11b94cea53 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221109430

   
   ## CI report:
   
   * 4bb072faa37b2ea398144fbbc24deff966153cfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24471)
 
   * 7e8e5336ab1f079d93ccaa0981e70798169532f1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24807)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221096095

   
   ## CI report:
   
   * 4bb072faa37b2ea398144fbbc24deff966153cfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24471)
 
   * 7e8e5336ab1f079d93ccaa0981e70798169532f1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7978:
-
Labels: pull-request-available  (was: )

> Update docs for older versions to state that partitions should be ordered 
> when creating multiple partitions
> ---
>
> Key: HUDI-7978
> URL: https://issues.apache.org/jira/browse/HUDI-7978
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: docs
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql [hudi]

2024-07-10 Thread via GitHub


codope opened a new pull request, #11610:
URL: https://github.com/apache/hudi/pull/11610

   ### Change Logs
   
   If we specify partition fields in an order different from the order in 
create table schema, then query does not fail and the whole partitioning is 
incorrect. This has been an issue even in 0.15 and 0.14. We will fix this in 
HUDI-7964. Meanwhile, it's important we update the docs. 
   
   ### Impact
   
   Docs improvement.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-10 Thread via GitHub


lokeshj1703 commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2221025516

   I have explained the scenario in the comments. PTAL @nbalajee @danny0405 
   I have also fixed the earliestCommitToNotArchive logic to take into account 
last clean instant and removed separate handling in the cleaner timeline 
archival. The case where earliestCommitToNotArchive > last clean instant will 
not be possible since this timestamp is generated by cleaner but still added a 
check to consider the minimum of last clean instant and 
earliestCommitToNotArchive.
   cc @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader

2024-07-10 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7971:
--
Description: 
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 ** few write commits 
 ** Pending clustering
 ** Completed Clustering
 ** Failed writes with no rollbacks
 ** Insert overwrite table/partition
 ** Savepoint for Time-travel query

 * MOR
 ** Same as COW
 ** Pending and completed async compaction (with log-files and no base file)
 ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 ** Rollback formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries

 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 

  was:
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 * Pending clustering
 * Completed Clustering
 * Failed writes with no rollbacks
 * Insert overwrite table/partition
 * Savepoint for Time-travel query


 * MOR
 * Same as COW
 * Pending and completed async compaction (with log-files and no base file)
 * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 * Rollback formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries


 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 


> Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader 
> -
>
> Key: HUDI-7971
> URL: https://issues.apache.org/jira/browse/HUDI-7971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
> tables 
>  
> Readers :  1.x
>  # Spark SQL
>  # Spark Datasource
>  # Trino/Presto
>  # Hive
>  # Flink
> Writer: 0.16
> Table State:
>  * COW
>  ** few write commits 
>  ** Pending clustering
>  ** Completed Clustering
>  ** Failed writes with no rollbacks
>  ** Insert overwrite table/partition
>  ** Savepoint for Time-travel query
>  * MOR
>  ** Same as COW
>  ** Pending and completed async compaction (with log-files and no base file)
>  ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
>  ** Rollback formats - DELETE, rollback block
> Other knobs:
>  # Metadata enabled/disabled
>  # Column Stats enabled/disabled and data-skipping enabled/disabled
>  # RLI enabled with eq/IN queries
>  # Non-Partitioned dataset
>  # CDC Reads 
>  # Incremental Reads
>  # Time-travel query
>  
> What to test ?
>  # Query Results Correctness
>  # Performance : See the benefit of 
>  # Partition Pruning
>  # Metadata  table - col stats, RLI,
>  
> Corner Case Testing:
>  
>  # Schema Evolution with different file-groups having different generation of 
> schema
>  # Dynamic Partition Pruning
>  # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi-rs) branch release-0.1.x updated: build: bump version to 0.1.0-rc1

2024-07-10 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch release-0.1.x
in repository https://gitbox.apache.org/repos/asf/hudi-rs.git


The following commit(s) were added to refs/heads/release-0.1.x by this push:
 new 973ae75  build: bump version to 0.1.0-rc1
973ae75 is described below

commit 973ae75da14c96c18d411c91e5c3d74448cbb88e
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Wed Jul 10 09:41:45 2024 -0500

build: bump version to 0.1.0-rc1
---
 Cargo.toml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Cargo.toml b/Cargo.toml
index 8259243..5a26c92 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -23,7 +23,7 @@ members = [
 resolver = "2"
 
 [workspace.package]
-version = "0.1.0"
+version = "0.1.0-rc1"
 edition = "2021"
 license = "Apache-2.0"
 rust-version = "1.75.0"



Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11608:
URL: https://github.com/apache/hudi/pull/11608#issuecomment-2220692616

   
   ## CI report:
   
   * 3175c04ed10009da40aa6cb9d8e2b4432678d38f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24806)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Tracking issue for `hudi-rs` release 0.1.0 [hudi-rs]

2024-07-10 Thread via GitHub


xushiyan opened a new issue, #62:
URL: https://github.com/apache/hudi-rs/issues/62

   *This issue is for tracking tasks of releasing `hudi-rs` 0.1.0.*
   
   ## Tasks
   
   ### Issues
   
   - [ ] This issue is added to [the target 
milestone](https://github.com/apache/hudi-rs/milestone/1)
   - [ ] All remaining issues in the milestone should be closed
   
   > [!CAUTION]
   > Blockers to highlight
   
   - [ ] https://github.com/apache/hudi-rs/issues/41
   - [ ] https://github.com/apache/hudi-rs/issues/42
   
   ### GitHub
   
   - [ ] Bump version
   - [ ] Push release tag
   
   ### ASF
   
   - [ ] Create an ASF release
   - [ ] Upload artifacts to the SVN dist repo
   - [ ] Start VOTE in dev email list
   
   > [!CAUTION]
   > Proceed from here only after VOTE passes.
   
   ### Official release
   
   - [ ] Push the release git tag
   - [ ] Publish artifacts to SVN RELEASE branch
   - [ ] Send `ANNOUNCE` email to dev and user email lists
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Write contribution guide and document dev setup [hudi-rs]

2024-07-10 Thread via GitHub


xushiyan closed issue #44: Write contribution guide and document dev setup
URL: https://github.com/apache/hudi-rs/issues/44


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] hive sql查询hudi分区表,如果分区字段是表最后一列,解析parquet返回的数据,在分区字段列位置自动增加了分区字段值,导致后续列错误发生类型转换问题 [hudi]

2024-07-10 Thread via GitHub


liucongjy opened a new issue, #11609:
URL: https://github.com/apache/hudi/issues/11609

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.flink sql create table and insert data
   CREATE CATALOG hoodie_catalog
 WITH (
   'type'='hudi',
   'catalog.path' = '/tmp/hudi',
   'hive.conf.dir' = '/opt/hive-2.3.9/conf',
   'mode'='hms'
 );
   USE CATALOG hoodie_catalog;
   create database hoodie_catalog.flink_hudi;
   CREATE CATALOG myhive WITH (
 'type' = 'hive',
 'hive-conf-dir' = '/opt/hive-2.3.9/conf'
   );
   CREATE TABLE hoodie_catalog.flink_hudi.TEST_COW(
   jllsh VARCHAR(100),
   syh VARCHAR(50) PRIMARY KEY NOT ENFORCED,
   jzlsh VARCHAR(40),
   grbsh VARCHAR(20),
   dalx VARCHAR(10),
   dzjkkkh VARCHAR(40),
   yzzh VARCHAR(40),
   yzbs VARCHAR(40),
   fqsj VARCHAR(10),
   dj DOUBLE,
   ze DOUBLE,
   zzsj TIMESTAMP
   )
   PARTITIONED BY (`fqsj`)
   WITH (
 'connector' = 'hudi',
 'path' = 'hdfs://bigdata01:8020/tmp/hudi/flink_hudi/TEST_COW',
 'table.type' = 'COPY_ON_WRITE',
 'precombine.field' = 'syh',
 'write.operation' = 'upsert',
 'hoodie.datasource.hive_sync.support_timestamp' = 'true'
   );
   insert into hoodie_catalog.flink_hudi.TEST_COW select 
jllsh,syh,jzlsh,grbsh,dalx,dzjkkkh,yzzh,yzbs,fqsj,dj,ze,zzsj from 
myhive.ods.prdata where pch='202407090102000' limit 100;
   
   2.flink sql read table,Read out the data normally 
   select * from hoodie_catalog.flink_hudi.EHR_ZYYZMX_PRE_COW;
   
   3.use hive sql read data
   use flink_hudi;
   select dj,ze,zzsj from test_cow;
   
   
   **Expected behavior**
   Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
org.apache.hadoop.io.DoubleWritable
   
   **Environment Description**
   
   * Hudi version : 0.14
   
   * Spark version : 3.3
   
   * Hive version : 2.3.9
   
   * Hadoop version : 3.0
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   调试hive日志输出,看到读出数据格式为[2014-06-06,0.0,0.0],但我在hive查询的sql语句(select dj,ze,zzsj 
from test_cow;)第一列是dj,是一个double类型字段,但在这里读出来的是分区的字段的值
   
   
hive调试输出[2014-06-06,0.0,0.0]这个格式数据的类名为:org.apache.hadoop.hive.ql.exec.ListSinkOperator,调用方法为
   ```
   public void process(Object row, int tag) throws HiveException {
   try {
 LOG.info("row数据:{},tag:{},class:{}", row, tag, 
row.getClass().getName());
 LOG.info("row对应数据类型:{}", inputObjInspectors);
   
 ClassLoader classLoader = fetcher.getClass().getClassLoader();
 if (classLoader != null) {
   // 尝试获取资源URLs
   try {
 Enumeration urls = 
classLoader.getResources(fetcher.getClass().getName().replace('.', '/') + 
".class");
 while (urls.hasMoreElements()) {
   URL url = urls.nextElement();
   LOG.info("Class " + fetcher.getClass().getName() + " is loaded 
from: " + url);
 }
   } catch (Exception e) {
 e.printStackTrace();
   }
 } else {
   LOG.info("Class " + fetcher.getClass().getName() + " is loaded by 
the bootstrap class loader.");
 }
   
 res.add(fetcher.convert(row, inputObjInspectors[0]));
 numRows++;
 runTimeNumRows++;
   } catch (Exception e) {
 throw new HiveException(e);
   }
 }
   ```
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   2024-07-08T14:42:03,814 ERROR [62dfdf81-d91C-40f8-91a2-d84f761a3671 main] 
liDriver: Failed with exception 
java.io.IOException:org.apache.haoophive.ql.metadata.HiveException:java.lang.ClassCastException:org.apache.hadoop.io.Text
 cannot be cast to org.apache.hadoop.io.DoubleWriteable
   
java.io.IOException:org.apache.haoophive.ql.metadata.HiveException:java.lang.ClassCastException:org.apache.hadoop.io.Text
 cannot be cast to org.apache.hadoop.io.DoubleWriteable
   at org.apache.hadoop.hive.ql.exec. FetchTask. fetch(FetchTask.java:165)
   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
   at orgapachehadop-hivecL1.CLriver.processocaCr0Evpriver:ava:257)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:407)
   at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:825)
   at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:763)
   at org.apache.hadoop.hive.cli.CliDriv

Re: [PR] build: bump version to 0.2.0 [hudi-rs]

2024-07-10 Thread via GitHub


codecov[bot] commented on PR #61:
URL: https://github.com/apache/hudi-rs/pull/61#issuecomment-2220625487

   ## 
[Codecov](https://app.codecov.io/gh/apache/hudi-rs/pull/61?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 87.19%. Comparing base 
[(`2bb004b`)](https://app.codecov.io/gh/apache/hudi-rs/commit/2bb004b48efb5624813671e38c890c6abff01712?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`1a6c604`)](https://app.codecov.io/gh/apache/hudi-rs/commit/1a6c60409f71a7a359ed60620e11f624684869fb?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   Additional details and impacted files
   
   
   ```diff
   @@   Coverage Diff   @@
   ## main  #61   +/-   ##
   ===
 Coverage   87.19%   87.19%   
   ===
 Files  13   13   
 Lines 687  687   
   ===
 Hits  599  599   
 Misses 88   88   
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/hudi-rs/pull/61?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] build: bump version to 0.2.0 [hudi-rs]

2024-07-10 Thread via GitHub


xushiyan opened a new pull request, #61:
URL: https://github.com/apache/hudi-rs/pull/61

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi-rs) branch release-0.1.x created (now 2bb004b)

2024-07-10 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a change to branch release-0.1.x
in repository https://gitbox.apache.org/repos/asf/hudi-rs.git


  at 2bb004b  build: add info for rust and python artifacts (#60)

No new revisions were added by this update.



[jira] [Updated] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions

2024-07-10 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7978:
--
Component/s: docs

> Update docs for older versions to state that partitions should be ordered 
> when creating multiple partitions
> ---
>
> Key: HUDI-7978
> URL: https://issues.apache.org/jira/browse/HUDI-7978
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: docs
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions

2024-07-10 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-7978:
-

Assignee: Sagar Sumit

> Update docs for older versions to state that partitions should be ordered 
> when creating multiple partitions
> ---
>
> Key: HUDI-7978
> URL: https://issues.apache.org/jira/browse/HUDI-7978
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7978) Update docs for older versions to state that partitions should be ordered when creating multiple partitions

2024-07-10 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-7978:
-

 Summary: Update docs for older versions to state that partitions 
should be ordered when creating multiple partitions
 Key: HUDI-7978
 URL: https://issues.apache.org/jira/browse/HUDI-7978
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Sagar Sumit






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]

2024-07-10 Thread via GitHub


KnightChess commented on PR #11608:
URL: https://github.com/apache/hudi/pull/11608#issuecomment-2220529640

   @danny0405 @xicm cc:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]

2024-07-10 Thread via GitHub


KnightChess closed pull request #11608: [HUDI-7977] import bucket index 
partitioner
URL: https://github.com/apache/hudi/pull/11608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11608:
URL: https://github.com/apache/hudi/pull/11608#issuecomment-2220434167

   
   ## CI report:
   
   * 3175c04ed10009da40aa6cb9d8e2b4432678d38f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24806)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7977] import bucket index partitioner [hudi]

2024-07-10 Thread via GitHub


hudi-bot commented on PR #11608:
URL: https://github.com/apache/hudi/pull/11608#issuecomment-2220417516

   
   ## CI report:
   
   * 3175c04ed10009da40aa6cb9d8e2b4432678d38f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7977] import bucket index partitioner [hudi]

2024-07-10 Thread via GitHub


KnightChess opened a new pull request, #11608:
URL: https://github.com/apache/hudi/pull/11608

   ### Change Logs
   
   as https://github.com/apache/hudi/pull/11578 describe, current algorithm in 
some case has drawbacks, but this new algorithm also has drawbacks, so we can 
proceed based on practical considerations desgin the ut.
   
   old algorithm:
   https://github.com/apache/hudi/assets/20125927/101cfedf-3a63-4f28-be20-9572adc8f5a7";>
   
   new algorithm:
   
![image](https://github.com/apache/hudi/assets/20125927/c4637dbe-1e91-4322-b3aa-020545b7e023)
   
   both of them perform bad case 
https://github.com/apache/hudi/pull/11578#issuecomment-2219339643 : partition 
values are discontinuous
   parallelism = 10, bucketNumber = 5 and partition = ["2021-01-01", 
"2021-01-03"]
   old: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
   new: [2, 2, 2, 2, 2]
   
   parallelism = 20, bucketNumber = 5 and partition = ["2021-01-01", 
"2021-01-03"]
   old: [2, 2, 2, 2, 2]
   new: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7977) improve bucket index paritioner

2024-07-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7977:
-
Labels: pull-request-available  (was: )

> improve bucket index paritioner
> ---
>
> Key: HUDI-7977
> URL: https://issues.apache.org/jira/browse/HUDI-7977
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> imporve {{BucketIndexUtil}} partitionIndex algorithm make the data be evenly 
> distributed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >