[hudi] branch asf-site updated: Updating 0.12.0 docs for known regression: (#6996)

bhavanisudha Wed, 19 Oct 2022 20:18:00 -0700

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 664294894c Updating 0.12.0 docs for known regression: (#6996)
664294894c is described below

commit 664294894ce42d098ab63e1db59de576bc2d6a21
Author: Sivabalan Narayanan <n.siv...@gmail.com>
AuthorDate: Wed Oct 19 20:17:49 2022 -0700

    Updating 0.12.0 docs for known regression: (#6996)
---
 website/releases/release-0.12.0.md | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/website/releases/release-0.12.0.md 
b/website/releases/release-0.12.0.md
index a384681a8e..c76781d037 100644
--- a/website/releases/release-0.12.0.md
+++ b/website/releases/release-0.12.0.md
@@ -160,6 +160,43 @@ However, if you had intentionally named your partition as 
`default`, you can byp
 - Flink 1.14 will continue to be supported via `hudi-flink1.14-bundle`.
 - Flink 1.13 will continue to be supported via `hudi-flink1.13-bundle`.
 
+## Known Regressions:
+
+We discovered a regression in Hudi 0.12 release related to Bloom
+Index metadata persisted w/in Parquet footers 
[HUDI-4992](https://issues.apache.org/jira/browse/HUDI-4992).
+
+Crux of the problem was that min/max statistics for the record keys were
+computed incorrectly during (Spark-specific) 
[row-writing](https://hudi.apache.org/docs/next/configurations#hoodiedatasourcewriterowwriterenable)
+Bulk Insert operation affecting [Key Range Pruning 
flow](https://hudi.apache.org/docs/next/basic_configurations/#hoodiebloomindexprunebyranges)
+w/in [Hoodie Bloom 
Index](https://hudi.apache.org/docs/next/faq/#how-do-i-configure-bloom-filter-when-bloomglobal_bloom-index-is-used)
+tagging sequence, resulting into updated records being incorrectly tagged
+as "inserts" and not as "updates", leading to duplicated records in the
+table.
+
+[PR#6883](https://github.com/apache/hudi/pull/6883) addressing the problem is 
incorporated into
+Hudi 0.12.1 release.*
+
+If all of the following is applicable to you:
+
+1. Using Spark as an execution engine
+2. Using Bulk Insert (using 
[row-writing](https://hudi.apache.org/docs/next/configurations#hoodiedatasourcewriterowwriterenable),
+   enabled *by default*)
+3. Using Bloom Index (with 
[range-pruning](https://hudi.apache.org/docs/next/basic_configurations/#hoodiebloomindexprunebyranges)
+   enabled, enabled *by default*) for "UPSERT" operations
+   - Note: Default index type is SIMPLE. So, unless you have over-ridden the 
index type, you may not hit this issue. 
+
+Please consider one of the following potential remediations to avoid
+getting duplicate records in your pipeline:
+
+- [Disabling Bloom Index 
range-pruning](https://hudi.apache.org/docs/next/basic_configurations/#hoodiebloomindexprunebyranges)
+  flow (might
+  affect performance of upsert operations)
+- Upgrading to 0.12.1. 
+- Making sure that the [fix](https://github.com/apache/hudi/pull/6883) is
+  included in your custom artifacts (if you're building and using ones)
+
+Sorry about the inconvenience caused. 
+
 ## Raw Release Notes
 
 The raw release notes are available 
[here](https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12351209).

[hudi] branch asf-site updated: Updating 0.12.0 docs for known regression: (#6996)

Reply via email to