date:20240702

Re: [I] [SUPPORT] insert into hudi table with columns specified(reordered and not in table schema order) throws exception [hudi]

2024-07-02 Thread via GitHub



KnightChess commented on issue #11552:
URL: https://github.com/apache/hudi/issues/11552#issuecomment-2205123555

   look like hoodie analysis not implement the specified column. @leesf are you 
working on the fix? I plan to take this up if you have not already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR][DO NOT MERGE] Create release branch for version 1.0.0-beta2 [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11558:
URL: https://github.com/apache/hudi/pull/11558#issuecomment-2205121244

   
   ## CI report:
   
   * bffdfa01172bdb9cf624930aff344dca1a075835 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2205121169

   
   ## CI report:
   
   * 24462930ca4dcf13f039fdebcd078c072367ac03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24677)
 
   * 1692b6fd644f59d5b73eeb3f42504aacb8408641 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [MINOR][DO NOT MERGE] Create release branch for version 1.0.0-beta2 [hudi]

2024-07-02 Thread via GitHub



codope opened a new pull request, #11558:
URL: https://github.com/apache/hudi/pull/11558

   ### Change Logs
   
   Created this PR to run all tests on 1.0.0-beta2 branch and track any 
failures. Please **do not merge**.
   
   ### Impact
   
   For 1.0.0-beta2 release
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-02 Thread via GitHub



danny0405 commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2205110034

   Hi @balaji-varadarajan can you confirm whether the file renaming is needed 
for 0.x release table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [DO NOT MERGE] Create release branch for version 1.0.0-beta2 [hudi]

2024-07-02 Thread via GitHub



codope closed pull request #11557: [DO NOT MERGE] Create release branch for 
version 1.0.0-beta2
URL: https://github.com/apache/hudi/pull/11557


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch release-1.0.0-beta2 updated (dda09b4e11a -> bffdfa01172)

2024-07-02 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to branch release-1.0.0-beta2
in repository https://gitbox.apache.org/repos/asf/hudi.git


omit dda09b4e11a Ensure properties are copied when modifying schema (#11441)
omit ea2cb5c5d74 Create release branch for version 1.0.0-beta2
 add 8eed6aefbc4 [MINOR] Update release guide (#11433)
 add 3415a4a1a2c [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch 
default value (#11370)
 add e38f78cedb9 [MINOR] Fix README for Slack link update and Linkedin 
badge (#11442)
 add 6f3539268f8 [HUDI-7854] Bump AWS SDK v2 version to 2.25.69 (#11426)
 add eb20273cf66 Ensure properties are copied when modifying schema (#11441)
 add 35d79270205 [MINOR] add `ad1happy2go` to github collaborators (#11447)
 add 8f0467b7229 remove tableconfig from filegroup reader params (#11449)
 add 35cdac0a645 [HUDI-7747] In MetaClient remove getBasePathV2() and 
return StoragePath from getBasePath() (#11385)
 add 6aea47a13f5 [HUDI-7671] Make Hudi timeline backward compatible (#11443)
 add 84381afe7aa [HUDI-7872] Recreate and sync glue and hive table when 
meta sync fails (#11451)
 add 9b77eb1f5c6 [HUDI-7879] Optimize the redundant creation of HoodieTable 
in DataSourceInternalWriterHelper and the unnecessary parameters in createTable 
within BaseHoodieWriteClien (#11456)
 add ec5244d5fe6 [HUDI-7880] Support extraMetadata in Spark SQL Insert Into 
(#11458)
 add b236396d877 [HUDI-7891] Fix 
HoodieActiveTimeline#deleteCompletedRollback missing check for Action type 
(#11462)
 add ed0295f8fbd [HUDI-7847] Infer record merge mode during table upgrade 
(#11439)
 add 662f0822bd4 [HUDI-7838] Remove the option hoodie.schema.cache.enable 
and always do the cache (#11444)
 add bf1df335442 [HUDI-7876] use properties to store log spill map configs 
for fg reader (#11455)
 add 9f0130442a5 [HUDI-7874] Fix Hudi being able to read 2-level structure 
(#11450)
 add 1ce97bae116 [MINOR][DNM] Test disabling new HFile reader (#11488)
 add 51c9c0e226a [HUDI-7906] Improve the parallelism deduce in rdd write 
(#11470)
 add 8a4bed03fa7 [HUDI-7849] Reduce time spent on running 
testFiltersInFileFormat (#11423)
 add 2cc45cc228a [HUDI-7881] Verify table base path as well for syncing 
table in bigquery metastore (#11460)
 add bb76de48e9f [HUDI-6508] Support compilation on Java 11 (#11479)
 add b0580ef56ca [HUDI-5956] Fix spark DAG ui when write (#11376)
 add 05a07cf76a0 [HUDI-7909] Add Comment to the FieldSchema returned by Aws 
Glue Client (#11474)
 add 7567eaef2c0 [HUDI-4123] Enchancing deltastreamer sql source tests 
(#6781)
 add dbc6ac50aec [HUDI-7395] Fix computation for metrics in 
HoodieMetadataMetrics (#10641)
 add 58b53f05980 [HUDI-4945] Add a test case for batch clean (#6845)
 add 7178fa6d497 [MINOR] Adding tests for streaming read mor with 
compaction (#6695)
 add 6192cfb0e95 [MINOR] Reduce logging volume (#11505)
 add c5ff6a2113f [MINOR] Removed useless checks from SqlBasedTransformers 
(#11499)
 add 4370178eb0b [HUDI-7927] Lazy init secondary view in FS view (#10652)
 add 3152e47876f [MINOR] Bump JUnit version to 5.8.2 (#11511)
 add 4b7e6e41573 [HUDI-7922] Add Hudi CLI bundle for Scala 2.13 (#11495)
 add 1c731769d60 [HUDI-7882] Picking RFC-78 for bridge release (#11515)
 add 7c2480f903b [HUDI-7403] Support Filter/Transformer to Hudi Exporter 
Utility (non-hudi export) (#11509)
 add ae1ee05ab8c [HUDI-7709] ClassCastException while reading the data 
using `TimestampBasedKeyGenerator` (#11501)
 add eb0725d1ef5 [HUDI-7877] Add record position to record index metadata 
payload (#11467)
 add c80b5596c1d [HUDI-7932] Fix Import ordering (#11524)
 add ea6a6145223 [HUDI-7763] Fix that multiple jmx reporter can exist if 
metadata enables (#11226)
 add 8e36fe91715 [HUDI-7924] Capture Latency and Failure Metrics For Hive 
Table recreation (#11498)
 add 63b7b15fff2 [MINOR] doap: don't nest multiple versions in a release 
(#11533)
 add 4c0fb6d951e [HUDI-7914] Use internal schema without metadata fields 
for delete-partitions OP (#11487)
 add 107cffca4af [HUDI-7908] HoodieFileGroupReader fails if preCombine and 
partition fields are the same (#11473)
 add 78d7ee7cba5 Closing fsview in HoodieTableFileIndex (#11497)
 add eeafa734a6c [HUDI-7903] Fix storage partition stats index to skip data 
(#11472)
 add 9f66ac0fb4b [HUDI-7942] Fix SmallFileAssignState#totalUnassigned 
naming error (#11544)
 add e942a27d446 [MINOR] Fix NPE after a clustering plan is finished 
(#11550)
 add 990c0c565b7 [HUDI-7933] Sync table in Glue/HMS if table base path is 
updated (#11529)
 add 9bc0164a387 [HUDI-7841] RLI and secondary index should consider only 
pruned partitions for file skipping (#11434)
 add a9a75f43a1c [HUDI-7941] add show_file_status procedure & 
run_rollback_inflight_tableservice

(hudi) 01/01: Create release branch for version 1.0.0-beta2

2024-07-02 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch release-1.0.0-beta2
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit bffdfa01172bdb9cf624930aff344dca1a075835
Author: Sagar Sumit 
AuthorDate: Tue Jun 11 09:23:59 2024 -0700

Create release branch for version 1.0.0-beta2
---
 docker/hoodie/hadoop/base/pom.xml| 2 +-
 docker/hoodie/hadoop/base_java11/pom.xml | 2 +-
 docker/hoodie/hadoop/datanode/pom.xml| 2 +-
 docker/hoodie/hadoop/historyserver/pom.xml   | 2 +-
 docker/hoodie/hadoop/hive_base/pom.xml   | 2 +-
 docker/hoodie/hadoop/namenode/pom.xml| 2 +-
 docker/hoodie/hadoop/pom.xml | 2 +-
 docker/hoodie/hadoop/prestobase/pom.xml  | 2 +-
 docker/hoodie/hadoop/spark_base/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml | 2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml | 2 +-
 docker/hoodie/hadoop/trinobase/pom.xml   | 2 +-
 docker/hoodie/hadoop/trinocoordinator/pom.xml| 2 +-
 docker/hoodie/hadoop/trinoworker/pom.xml | 2 +-
 hudi-aws/pom.xml | 4 ++--
 hudi-cli/pom.xml | 2 +-
 hudi-client/hudi-client-common/pom.xml   | 4 ++--
 hudi-client/hudi-flink-client/pom.xml| 4 ++--
 hudi-client/hudi-java-client/pom.xml | 4 ++--
 hudi-client/hudi-spark-client/pom.xml| 4 ++--
 hudi-client/pom.xml  | 2 +-
 hudi-common/pom.xml  | 2 +-
 hudi-examples/hudi-examples-common/pom.xml   | 2 +-
 hudi-examples/hudi-examples-flink/pom.xml| 2 +-
 hudi-examples/hudi-examples-java/pom.xml | 2 +-
 hudi-examples/hudi-examples-spark/pom.xml| 2 +-
 hudi-examples/pom.xml| 2 +-
 hudi-flink-datasource/hudi-flink/pom.xml | 4 ++--
 hudi-flink-datasource/hudi-flink1.14.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.15.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.16.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.17.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.18.x/pom.xml   | 4 ++--
 hudi-flink-datasource/pom.xml| 4 ++--
 hudi-gcp/pom.xml | 2 +-
 hudi-hadoop-common/pom.xml   | 2 +-
 hudi-hadoop-mr/pom.xml   | 2 +-
 hudi-integ-test/pom.xml  | 2 +-
 hudi-io/pom.xml  | 2 +-
 hudi-kafka-connect/pom.xml   | 4 ++--
 hudi-platform-service/hudi-metaserver/hudi-metaserver-client/pom.xml | 2 +-
 hudi-platform-service/hudi-metaserver/hudi-metaserver-server/pom.xml | 2 +-
 hudi-platform-service/hudi-metaserver/pom.xml| 4 ++--
 hudi-platform-service/pom.xml| 2 +-
 hudi-spark-datasource/hudi-spark-common/pom.xml  | 4 ++--
 hudi-spark-datasource/hudi-spark/pom.xml | 4 ++--
 hudi-spark-datasource/hudi-spark2-common/pom.xml | 2 +-
 hudi-spark-datasource/hudi-spark2/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3-common/pom.xml | 2 +-
 hudi-spark-datasource/hudi-spark3.0.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.1.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.2.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.2plus-common/pom.xml   | 2 +-
 hudi-spark-datasource/hudi-spark3.3.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.4.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.5.x/pom.xml| 4 ++--
 hudi-spark-datasource/pom.xml| 2 +-
 hudi-sync/hudi-adb-sync/pom.xml  | 2 +-
 hudi-sync/hudi-datahub-sync/pom.xml

[PR] [DO NOT MERGE] Create release branch for version 1.0.0-beta2 [hudi]

2024-07-02 Thread via GitHub



codope opened a new pull request, #11557:
URL: https://github.com/apache/hudi/pull/11557

   ### Change Logs
   
   Created this PR to run all tests on 1.0.0-beta2 branch and track any 
failures. Please **do not merge**.
   
   ### Impact
   
   For 1.0.0-beta2 release.
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi-rs) branch main updated: refactor: improve error handling in storage module (#34)

2024-07-02 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/hudi-rs.git


The following commit(s) were added to refs/heads/main by this push:
 new 52a9245  refactor: improve error handling in storage module (#34)
52a9245 is described below

commit 52a924557ee18effadc02749ec7cdb1001ad6b58
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Tue Jul 2 22:18:26 2024 -0500

refactor: improve error handling in storage module (#34)
---
 crates/core/fixtures/leaf_dir/.gitkeep |   0
 crates/core/src/storage/mod.rs | 198 +
 crates/core/src/table/fs_view.rs   |  15 +--
 crates/core/src/table/mod.rs   |  11 +-
 crates/core/src/table/timeline.rs  |   4 +-
 python/hudi/_internal.pyi  |   2 +-
 python/hudi/table.py   |   2 +-
 python/tests/test_table_read.py|   6 +-
 8 files changed, 143 insertions(+), 95 deletions(-)

diff --git a/crates/core/fixtures/leaf_dir/.gitkeep 
b/crates/core/fixtures/leaf_dir/.gitkeep
new file mode 100644
index 000..e69de29
diff --git a/crates/core/src/storage/mod.rs b/crates/core/src/storage/mod.rs
index 0f09c05..43dd0e7 100644
--- a/crates/core/src/storage/mod.rs
+++ b/crates/core/src/storage/mod.rs
@@ -21,7 +21,8 @@ use std::collections::HashMap;
 use std::path::PathBuf;
 use std::sync::Arc;
 
-use anyhow::{anyhow, Result};
+use anyhow::{anyhow, Context, Result};
+use arrow::compute::concat_batches;
 use arrow::record_batch::RecordBatch;
 use async_recursion::async_recursion;
 use bytes::Bytes;
@@ -60,16 +61,21 @@ impl Storage {
 }
 }
 
-#[allow(dead_code)]
-pub async fn get_file_info(, relative_path: ) -> FileInfo {
-let obj_url = join_url_segments(_url, 
&[relative_path]).unwrap();
-let obj_path = ObjPath::from_url_path(obj_url.path()).unwrap();
-let meta = self.object_store.head(_path).await.unwrap();
-FileInfo {
-uri: obj_url.to_string(),
-name: obj_path.filename().unwrap().to_string(),
+#[cfg(test)]
+async fn get_file_info(, relative_path: ) -> Result {
+let obj_url = join_url_segments(_url, &[relative_path])?;
+let obj_path = ObjPath::from_url_path(obj_url.path())?;
+let meta = self.object_store.head(_path).await?;
+let uri = obj_url.to_string();
+let name = obj_path
+.filename()
+.ok_or(anyhow!("Failed to get file name for {}", obj_path))?
+.to_string();
+Ok(FileInfo {
+uri,
+name,
 size: meta.size,
-}
+})
 }
 
 pub async fn get_parquet_file_metadata(, relative_path: ) -> 
Result {
@@ -79,79 +85,100 @@ impl Storage {
 let meta = obj_store.head(_path).await?;
 let reader = ParquetObjectReader::new(obj_store, meta);
 let builder = ParquetRecordBatchStreamBuilder::new(reader).await?;
-Ok(builder.metadata().as_ref().to_owned())
+Ok(builder.metadata().as_ref().clone())
 }
 
-pub async fn get_file_data(, relative_path: ) -> Bytes {
-let obj_url = join_url_segments(_url, 
&[relative_path]).unwrap();
-let obj_path = ObjPath::from_url_path(obj_url.path()).unwrap();
-let result = self.object_store.get(_path).await.unwrap();
-result.bytes().await.unwrap()
+pub async fn get_file_data(, relative_path: ) -> Result {
+let obj_url = join_url_segments(_url, &[relative_path])?;
+let obj_path = ObjPath::from_url_path(obj_url.path())?;
+let result = self.object_store.get(_path).await?;
+let bytes = result.bytes().await?;
+Ok(bytes)
 }
 
-pub async fn get_parquet_file_data(, relative_path: ) -> 
Vec {
-let obj_url = join_url_segments(_url, 
&[relative_path]).unwrap();
-let obj_path = ObjPath::from_url_path(obj_url.path()).unwrap();
+pub async fn get_parquet_file_data(, relative_path: ) -> 
Result {
+let obj_url = join_url_segments(_url, &[relative_path])?;
+let obj_path = ObjPath::from_url_path(obj_url.path())?;
 let obj_store = self.object_store.clone();
-let meta = obj_store.head(_path).await.unwrap();
+let meta = obj_store.head(_path).await?;
+
+// read parquet
 let reader = ParquetObjectReader::new(obj_store, meta);
-let stream = ParquetRecordBatchStreamBuilder::new(reader)
-.await
-.unwrap()
-.build()
-.unwrap();
-stream
-.collect::>()
-.await
-.into_iter()
-.map(|r| r.unwrap())
-.collect()
+let builder = ParquetRecordBatchStreamBuilder::new(reader).await?;
+let schema = builder.schema().clone();
+let mut stream = builder.build()?;
+let mut batches = Vec::new();
+
+while let Some(r) =

Re: [PR] refactor: improve error handling in storage module [hudi-rs]

2024-07-02 Thread via GitHub



xushiyan merged PR #34:
URL: https://github.com/apache/hudi-rs/pull/34


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] refactor: improve error handling in storage module [hudi-rs]

2024-07-02 Thread via GitHub



codecov[bot] commented on PR #34:
URL: https://github.com/apache/hudi-rs/pull/34#issuecomment-2204996227

   ## 
[Codecov](https://app.codecov.io/gh/apache/hudi-rs/pull/34?dropdown=coverage=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 Report
   Attention: Patch coverage is `88.75000%` with `9 lines` in your changes 
missing coverage. Please review.
   > Project coverage is 88.84%. Comparing base 
[(`199a25d`)](https://app.codecov.io/gh/apache/hudi-rs/commit/199a25d82ba09c0bedeb12430bf22299603209b2?dropdown=coverage=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 to head 
[(`a3a9b87`)](https://app.codecov.io/gh/apache/hudi-rs/commit/a3a9b87e3a6cd26b19ce5d602e6bc746cc055e9f?dropdown=coverage=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache).
   
   | 
[Files](https://app.codecov.io/gh/apache/hudi-rs/pull/34?dropdown=coverage=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[crates/core/src/storage/mod.rs](https://app.codecov.io/gh/apache/hudi-rs/pull/34?src=pr=tree=crates%2Fcore%2Fsrc%2Fstorage%2Fmod.rs_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-Y3JhdGVzL2NvcmUvc3JjL3N0b3JhZ2UvbW9kLnJz)
 | 87.14% | [9 Missing :warning: 
](https://app.codecov.io/gh/apache/hudi-rs/pull/34?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@Coverage Diff @@
   ## main  #34  +/-   ##
   ==
   - Coverage   90.63%   88.84%   -1.79% 
   ==
 Files  10   10  
 Lines 491  511  +20 
   ==
   + Hits  445  454   +9 
   - Misses 46   57  +11 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/hudi-rs/pull/34?dropdown=coverage=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] refactor: improve error handling in storage module [hudi-rs]

2024-07-02 Thread via GitHub



xushiyan opened a new pull request, #34:
URL: https://github.com/apache/hudi-rs/pull/34

   Change return values to Result<> for APIs in `storage` module for better 
error handling flows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2204978524

   
   ## CI report:
   
   * 761085a6fa9cc6eeca493c1c116caea56b3693f8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24680)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2204918597

   
   ## CI report:
   
   * 761085a6fa9cc6eeca493c1c116caea56b3693f8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24680)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7883] Make Hudi timeline backward compatible (#11443) [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11464:
URL: https://github.com/apache/hudi/pull/11464#issuecomment-2204918367

   
   ## CI report:
   
   * 91aeccec3005287b1a3c4706048fd88062c90fcb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24679)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7883) Ensure 1.x commit instants are readable w/ 0.16.0

2024-07-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7883:
-
Labels: pull-request-available  (was: )

> Ensure 1.x commit instants are readable w/ 0.16.0 
> --
>
> Key: HUDI-7883
> URL: https://issues.apache.org/jira/browse/HUDI-7883
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>
> Ensure 1.x commit instants are readable w/ 0.16.0 reader.
>  
> May be we need to migrate HoodieInstant parsing logic to 0.16.0 in a 
> backwards compatible manner. or its already ported. we just need to write 
> tests and validate. 
> [https://github.com/apache/hudi/pull/9617] - contains some portion 
> (HoodieInstant changes and some method renames)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2204909852

   
   ## CI report:
   
   * 761085a6fa9cc6eeca493c1c116caea56b3693f8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2204909780

   
   ## CI report:
   
   * fe7aa032f4463035775029ad486ca73ea2d02ac0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24668)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7883] Make Hudi timeline backward compatible (#11443) [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11464:
URL: https://github.com/apache/hudi/pull/11464#issuecomment-2204909564

   
   ## CI report:
   
   * 7b98569040d4d34c16a7d0d6446708e9b741a3e4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24438)
 
   * 91aeccec3005287b1a3c4706048fd88062c90fcb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2204899621

   
   ## CI report:
   
   * fe7aa032f4463035775029ad486ca73ea2d02ac0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24668)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-02 Thread via GitHub



watermelon12138 commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2204884386

   @hudi-bot run azure
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7911) Enable cdc log for MOR table

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7911:

Status: Patch Available  (was: In Progress)

> Enable cdc log for MOR table
> 
>
> Key: HUDI-7911
> URL: https://issues.apache.org/jira/browse/HUDI-7911
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7911) Enable cdc log for MOR table

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7911:

Status: In Progress  (was: Open)

> Enable cdc log for MOR table
> 
>
> Key: HUDI-7911
> URL: https://issues.apache.org/jira/browse/HUDI-7911
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-07-02 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7507:
--
Status: In Progress  (was: Open)

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> *Scenarios:*
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
>  ** If the completed commit files include som sort of "checkpointing" with 
> another "downstream job" performing incremental reads on this dataset (such 
> as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as 
> the incremental reader skipping some completed commits (that have a smaller 
> instant timestamp than latest completed commit but were created after).
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
> *Proposed approach:*
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
> Approach A
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately (A) has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan even 
> if it's an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this and would 
> require deprecating those APIs.
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline that are greater than it that could cause a conflict. If that 
> assertion fails, then throw a retry-able conflict resolution exception.
> Specifically, the following steps should be followed whenever any instant 
> (commit, table service, etc) is scheduled
> Approach B
>  # Acquire table lock. Assume that the desired instant time C and requested 
> file plan metadata have already been created, regardless of wether it was 
> before this step or right after acquiring the table lock.
>  # If there are any instants on the timeline that are greater than C 
> (regardless of their operation type or sate status) then release table lock 
> and throw an exception
>  # Create requested plan on timeline (As usual)
>  # Release table lock
> Unlike (A), this approach (B) allows users to continue to use HUDI APIs where 
> caller can specify instant time (preventing the need from deprecating any 
> public API). It also allows the possibility of table service operations 
> computing their plan

[jira] [Updated] (HUDI-6416) Completion Markers for handling spark retries

2024-07-02 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6416:
--
Status: In Progress  (was: Open)

> Completion Markers for handling spark retries
> -
>
> Key: HUDI-6416
> URL: https://issues.apache.org/jira/browse/HUDI-6416
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Balajee Nagasubramaniam
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> During spark stage retries, spark driver may have all the information to 
> reconcile the commit and proceed with next steps, while a stray executor may 
> still be writing to a data file and complete later (before the JVM exit). 
> Extra files left on the dataset, excluded from reconcile commit step could 
> show up as data quality issue for query engines with duplicate records.
> This change brings completion markers which tries to prevent the dataset from 
> experiencing data quality issues,  in such corner case scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7921:

Status: In Progress  (was: Open)

> Chase down memory leaks in Writeclient with MDT enabled
> ---
>
> Key: HUDI-7921
> URL: https://issues.apache.org/jira/browse/HUDI-7921
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We see OOMs when deltastreamer is running continuously for days together. We 
> suspect some memory leaks when metadata table is enabled. Lets try to chase 
> down all of them and fix it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7921:

Status: Patch Available  (was: In Progress)

> Chase down memory leaks in Writeclient with MDT enabled
> ---
>
> Key: HUDI-7921
> URL: https://issues.apache.org/jira/browse/HUDI-7921
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We see OOMs when deltastreamer is running continuously for days together. We 
> suspect some memory leaks when metadata table is enabled. Lets try to chase 
> down all of them and fix it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7779) Guarding archival to not archive unintended commits

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7779:
---

Assignee: Lokesh Jain  (was: sivabalan narayanan)

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data

[jira] [Updated] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7503:

Status: In Progress  (was: Open)

> Concurrent executions of table service plan should not corrupt dataset
> --
>
> Key: HUDI-7503
> URL: https://issues.apache.org/jira/browse/HUDI-7503
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, table-service
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Some external workflow schedulers can accidentally (or) misbehave and 
> schedule duplicate executions of the same compaction plan. We need a way to 
> guard against this inside Hudi (vs user taking a lock externally). In such a 
> world,  2 instance of the job concurrently call 
> `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same 
> compaction instant. 
> This is since one writer might execute the instant and create an inflight, 
> while the other writer sees the inflight and tries to roll it back before 
> re-attempting to execute it (since it will assume said inflight was a 
> previously failed compaction attempt).
> This logic should be updated such that only one writer will actually execute 
> the compaction plan at a time (and the others will fail/abort).
> One approach is to use a transaction (base table lock) in conjunction with 
> heartbeating, to ensure that the writer triggers a heartbeat before executing 
> compaction, and any concurrent writers will use the heartbeat to check wether 
> the compaction is currently being executed by another writer. Specifically , 
> the compact API should execute the following steps
>  # Get the instant to compact C (as usual)
>  # Start a transaction
>  # Checks if C has an active heartbeat, if so finish transaction and throw 
> exception
>  # Start a heartbeat for C (this will implicitly re-start the heartbeat if it 
> has been started before by another job)
>  # Finish transaction
>  # Run the existing compact API logic on C 
>  # If execution succeeds, clean up heartbeat file . If it fails do nothing 
> (as the heartbeat will anyway be automatically expired later).
> Note that this approach only holds the table lock temporarily, when 
> checking/starting the heartbeat
> Also, this flow can be applied to execution of clean plans and other table 
> services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7903] Fix storage partition stats index to skip data [hudi]

2024-07-02 Thread via GitHub



yihua commented on PR #11472:
URL: https://github.com/apache/hudi/pull/11472#issuecomment-2204851228

   I've put up the fix here #11556.  Working on making it review-ready.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark

2024-07-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7945:
-
Labels: pull-request-available  (was: )

> Fix file pruning using PARTITION_STATS index in Spark
> -
>
> Key: HUDI-7945
> URL: https://issues.apache.org/jira/browse/HUDI-7945
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> The issue can be reproduced by 
> [https://github.com/apache/hudi/pull/11472#issuecomment-2199332859.]
> When there are more than one base files in a table partition, the 
> corresponding PARTITION_STATS index record in the metadata table contains 
> null as the file_path field in HoodieColumnRangeMetadata.
> {code:java}
> private static > HoodieColumnRangeMetadata 
> mergeRanges(HoodieColumnRangeMetadata one,
>   
> HoodieColumnRangeMetadata another) {
>   
> ValidationUtils.checkArgument(one.getColumnName().equals(another.getColumnName()),
>   "Column names should be the same for merging column ranges");
>   final T minValue = getMinValueForColumnRanges(one, another);
>   final T maxValue = getMaxValueForColumnRanges(one, another);
>   return HoodieColumnRangeMetadata.create(
>   null, one.getColumnName(), minValue, maxValue,
>   one.getNullCount() + another.getNullCount(),
>   one.getValueCount() + another.getValueCount(),
>   one.getTotalSize() + another.getTotalSize(),
>   one.getTotalUncompressedSize() + another.getTotalUncompressedSize());
> } 
> {code}
> The null causes NPE when loading the column stats per partition from 
> PARTITION_STATS index.  Also, current implementation of 
> PartitionStatsIndexSupport assumes that the file_path field contains the 
> exact file name and it does not work if the the file path does not contain 
> null (even a list of file names stored does not work).  We have to 
> reimplement PartitionStatsIndexSupport so that it gives the pruned partitions 
> for further processing.
> {code:java}
> Caused by: java.lang.NullPointerException: element cannot be mapped to a null 
> key
>     at java.util.Objects.requireNonNull(Objects.java:228)
>     at java.util.stream.Collectors.lambda$groupingBy$45(Collectors.java:907)
>     at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at java.util.Iterator.forEachRemaining(Iterator.java:116)
>     at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>     at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
>     at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
>     at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>     at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>     at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
>     at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
>     at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
>     at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>     at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
>     at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
>     at 
> java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
>     at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>     at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>     at 
> org.apache.hudi.common.data.HoodieListPairData.groupByKey(HoodieListPairData.java:115)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.transpose(ColumnStatsIndexSupport.scala:253)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.$anonfun$loadTransposed$1(ColumnStatsIndexSupport.scala:149)
>     at 
> org.apache.hudi.HoodieCatalystUtils$.withPersistedData(HoodieCatalystUtils.scala:61)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.loadTransposed(ColumnStatsIndexSupport.scala:148)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.computeCandidateFileNames(ColumnStatsIndexSupport.scala:101)

[PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-02 Thread via GitHub



yihua opened a new pull request, #11556:
URL: https://github.com/apache/hudi/pull/11556

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]

2024-07-02 Thread via GitHub



Jason-liujc commented on issue #11535:
URL: https://github.com/apache/hudi/issues/11535#issuecomment-2204845455

   @danny0405 Ahh gotcha, we do have async cleaner that runs for our Hudi 
tables. 
   
   @ad1happy2go  I don't see any compaction on metadata table since a given 
date (I believe that's when we moved Hudi cleaning from sync to async, based on 
Danny's comment). When I delete the metadata and try to reinitialize I do see 
this error, which I believe they are the blocking instants:
   
   ```
   24/06/15 01:06:20 ip-10-0-157-87 WARN HoodieBackedTableMetadataWriter: 
Cannot initialize metadata table as operation(s) are in progress on the 
dataset: [[==>20240523221631416__commit__INFLIGHT__20240523224939000], 
[==>20240523225648799__commit__INFLIGHT__20240523232254000], 
[==>20240524111304660__commit__INFLIGHT__20240524142426000], 
[==>20240524235127638__commit__INFLIGHT__2024052500064], 
[==>20240525005114829__commit__INFLIGHT__20240525011802000], 
[==>20240525065356540__commit__INFLIGHT__20240525071004000], 
[==>20240525170219523__commit__INFLIGHT__20240525192315000], 
[==>20240527184608604__commit__INFLIGHT__20240527190327000], 
[==>20240528190417601__commit__INFLIGHT__20240528192418000], 
[==>20240529054718316__commit__INFLIGHT__20240529060542000], 
[==>20240530125710177__commit__INFLIGHT__20240531081522000], 
[==>20240530234238360__commit__INFLIGHT__20240530234726000], 
[==>20240531082713041__commit__REQUESTED__20240531082715000], 
[==>20240601164223688__commit__INFLIGHT__2024060
 1190853000], [==>20240602072248313__commit__INFLIGHT__20240603005951000], 
[==>20240603010859993__commit__INFLIGHT__20240603100305000], 
[==>20240604043334594__commit__INFLIGHT__20240604061732000], 
[==>20240605061406367__commit__REQUESTED__20240605061412000], 
[==>20240605063936872__commit__REQUESTED__20240605063943000], 
[==>20240605071904045__commit__REQUESTED__2024060507191], 
[==>20240605074456040__commit__REQUESTED__20240605074502000], 
[==>20240605082437667__commit__REQUESTED__20240605082443000], 
[==>20240605085008272__commit__REQUESTED__20240605085014000], 
[==>20240605123632368__commit__REQUESTED__20240605123638000], 
[==>20240605130201503__commit__REQUESTED__20240605130207000], 
[==>20240605134213113__commit__REQUESTED__20240605134219000], 
[==>20240605140741158__commit__REQUESTED__20240605140747000], 
[==>20240605144756228__commit__REQUESTED__20240605144802000], 
[==>20240605151313557__commit__REQUESTED__20240605151319000], 
[==>20240605195405678__commit__REQUESTED__202406051954110
 00], [==>20240605202017653__commit__REQUESTED__20240605202023000], 
[==>20240605205949232__commit__REQUESTED__20240605205955000], 
[==>20240605212536568__commit__REQUESTED__20240605212542000], 
[==>20240605220432089__commit__REQUESTED__20240605220438000], 
[==>20240606152537217__commit__INFLIGHT__20240607031027000], 
[==>20240606181110800__commit__INFLIGHT__2024060843000], 
[==>20240607112530977__commit__INFLIGHT__20240607212013000], 
[==>20240607213124841__commit__INFLIGHT__20240609024214000], 
[==>20240608001245366__commit__INFLIGHT__2024060904553], 
[==>20240609030620894__commit__INFLIGHT__2024060918031], 
[==>20240609181330488__commit__REQUESTED__20240609181336000], 
[==>20240609194304829__commit__INFLIGHT__20240611095337000], 
[==>20240611003906613__commit__INFLIGHT__20240611014341000], 
[==>20240611100258837__commit__INFLIGHT__20240612075536000], 
[==>20240611174425406__commit__INFLIGHT__20240611184626000], 
[==>20240612081821910__commit__INFLIGHT__20240612102427000], [==>2024061
 2204659323__commit__REQUESTED__20240612204705000], 
[==>20240613044301243__commit__INFLIGHT__20240613075101000], 
[==>20240613085334404__commit__INFLIGHT__20240613105718000], 
[==>20240613113055212__commit__REQUESTED__20240613113101000], 
[==>20240613122745696__commit__REQUESTED__20240613122751000], 
[==>20240614094542418__commit__REQUESTED__20240614094548000], 
[==>20240614172456990__commit__REQUESTED__20240614172503000], 
[==>20240614175526954__commit__REQUESTED__20240614175529000], 
[==>20240614181441857__commit__REQUESTED__20240614181444000], 
[==>20240614222012190__commit__REQUESTED__20240614222015000], 
[==>20240614225952031__commit__REQUESTED__20240614225954000], 
[==>20240614235545094__commit__REQUESTED__20240614235547000]]
   ```
   
   I guess my next questions are:
   
   1. Is there a way to run compaction of the metadata table asynchrounously, 
without cleaning up commits, deleting metadata table and recreating them again? 
The process is a bit expensive and since based on what Danny said, the going 
forward metadata table compaction still won't work. 
   
   2. Also if we just increase the 
`hoodie.metadata.max.deltacommits.when_pending` parameter to say like 100, 
what type of performance hit would we expect it take? is it mostly on the S3 
file listing level? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to

[PR] [RFC-79] Improving reliability of concurrent table service executions and rollbacks [hudi]

2024-07-02 Thread via GitHub



kbuci opened a new pull request, #11555:
URL: https://github.com/apache/hudi/pull/11555

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-02 Thread via GitHub



danny0405 commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2204769477

   we have drop partition cmd support, does that make sense to you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7779] Guard archival on savepoint removal until cleaner is able to clean it up [hudi]

2024-07-02 Thread via GitHub



danny0405 commented on PR #11440:
URL: https://github.com/apache/hudi/pull/11440#issuecomment-2204767885

   > Whose value will either refer to first savepoint if first savepoint < 
earliest commit to retain.
   
   Does this mean we can not archive beyond any savepoint now?
   
   > Last completed clean instant is required in the timeline to fetch earliest 
commit to not archive.
   
   I don't think this is reasonable in high level, there are some scenarios 
that people will never enable cleaning service, or the cleaning just got stuck 
and the halt of archival would impact the performance of ingestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [HUDI-7926] Data skipping failure mode should be strict in query test (#11502)

2024-07-02 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 29e2f6cd0f0 [HUDI-7926] Data skipping failure mode should be strict in 
query test (#11502)
29e2f6cd0f0 is described below

commit 29e2f6cd0f0287158bc85ae36f8d2c081ff3a8c2
Author: KnightChess <981159...@qq.com>
AuthorDate: Wed Jul 3 08:19:38 2024 +0800

[HUDI-7926] Data skipping failure mode should be strict in query test 
(#11502)
---
 .../hudi/testutils/HoodieSparkClientTestHarness.java  |  6 ++
 .../main/scala/org/apache/hudi/HoodieFileIndex.scala  | 17 +++--
 .../apache/hudi/HoodieHadoopFsRelationFactory.scala   |  2 +-
 .../scala/org/apache/hudi/SparkBaseIndexSupport.scala | 19 +++
 .../main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala |  2 +-
 .../HoodieFileGroupReaderBasedParquetFileFormat.scala |  2 +-
 .../hudi/functional/ColumnStatIndexTestBase.scala |  1 +
 .../hudi/functional/PartitionStatsIndexTestBase.scala |  1 +
 .../hudi/functional/RecordLevelIndexTestBase.scala|  1 +
 .../hudi/functional/SecondaryIndexTestBase.scala  |  1 +
 .../functional/TestBloomFiltersIndexSupport.scala |  1 +
 .../functional/TestPartitionStatsIndexWithSql.scala   |  7 ++-
 .../sql/hudi/command/index/TestFunctionalIndex.scala  |  4 
 .../sql/hudi/common/HoodieSparkSqlTestBase.scala  | 13 -
 .../spark/sql/hudi/dml/TestDataSkippingQuery.scala|  6 ++
 15 files changed, 72 insertions(+), 11 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java
index eefa825bc5c..3f342f8e054 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java
@@ -137,6 +137,7 @@ public abstract class HoodieSparkClientTestHarness extends 
HoodieWriterClientTes
   protected SparkRDDWriteClient writeClient;
   protected SparkRDDReadClient readClient;
   protected HoodieTableFileSystemView tableView;
+  protected Map extraConf = new HashMap<>();
 
   protected TimelineService timelineService;
   protected final SparkTaskContextSupplier supplier = new 
SparkTaskContextSupplier();
@@ -200,6 +201,7 @@ public abstract class HoodieSparkClientTestHarness extends 
HoodieWriterClientTes
 
 // Initialize a local spark env
 SparkConf sc = HoodieClientTestUtils.getSparkConfForTest(appName + "#" + 
testMethodName);
+extraConf.forEach(sc::set);
 SparkContext sparkContext = new SparkContext(sc);
 HoodieClientTestUtils.overrideSparkHadoopConfiguration(sparkContext);
 jsc = new JavaSparkContext(sparkContext);
@@ -229,6 +231,10 @@ public abstract class HoodieSparkClientTestHarness extends 
HoodieWriterClientTes
 initSparkContexts(this.getClass().getSimpleName());
   }
 
+  protected void initQueryIndexConf() {
+extraConf.put("hoodie.fileIndex.dataSkippingFailureMode", "strict");
+  }
+
   /**
* Cleanups Spark contexts ({@link JavaSparkContext} and {@link SQLContext}).
*/
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
index 47090d73887..e987ae47fc7 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
@@ -22,14 +22,14 @@ import org.apache.hudi.HoodieSparkConfUtils.getConfigValue
 import 
org.apache.hudi.common.config.TimestampKeyGeneratorConfig.{TIMESTAMP_INPUT_DATE_FORMAT,
 TIMESTAMP_OUTPUT_DATE_FORMAT}
 import org.apache.hudi.common.config.{HoodieMetadataConfig, TypedProperties}
 import org.apache.hudi.common.model.{FileSlice, HoodieBaseFile, HoodieLogFile}
-import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
 import org.apache.hudi.common.util.StringUtils
 import org.apache.hudi.exception.HoodieException
 import org.apache.hudi.keygen.{TimestampBasedAvroKeyGenerator, 
TimestampBasedKeyGenerator}
 import org.apache.hudi.storage.{StoragePath, StoragePathInfo}
 import org.apache.hudi.util.JFunction
-
 import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.DataSourceWriteOptions.{PARTITIONPATH_FIELD, 
PRECOMBINE_FIELD, RECORDKEY_FIELD}
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.catalyst.InternalRow
@@ -43,7 +43,6 @@ import org.apache.spark.unsafe.types.UTF8String
 import

Re: [PR] [HUDI-7926] dataskipping failure mode should be strict in query test [hudi]

2024-07-02 Thread via GitHub



danny0405 merged PR #11502:
URL: https://github.com/apache/hudi/pull/11502


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2204763350

   
   ## CI report:
   
   * 24462930ca4dcf13f039fdebcd078c072367ac03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24677)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2204753163

   
   ## CI report:
   
   * 4de4a316924123adc80b32615b72e5827f99f46f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24676)
 
   * 24462930ca4dcf13f039fdebcd078c072367ac03 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2204662455

   
   ## CI report:
   
   * 4de4a316924123adc80b32615b72e5827f99f46f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24676)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2204652177

   
   ## CI report:
   
   * dc811a79eba67aa96aca74f841d7a79fde1c35de Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24675)
 
   * 4de4a316924123adc80b32615b72e5827f99f46f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7921] Fixing file system view closures in MDT [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11496:
URL: https://github.com/apache/hudi/pull/11496#issuecomment-2204652066

   
   ## CI report:
   
   * 650cefb8de1f1b97b12174361b1f1acf10b64ecb Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24674)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7946) [Umbrella] RFC-79 : Improving reliability of concurrent table service executions and rollbacks

2024-07-02 Thread Krishen Bhan (Jira)

Krishen Bhan created HUDI-7946:
--

 Summary: [Umbrella] RFC-79 : Improving reliability of concurrent 
table service executions and rollbacks
 Key: HUDI-7946
 URL: https://issues.apache.org/jira/browse/HUDI-7946
 Project: Apache Hudi
  Issue Type: Epic
  Components: multi-writer, table-service
Reporter: Krishen Bhan


This is the umbrella ticket that tracks the overall implementation of RFC-79
h4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



vinothchandar commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663256899


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.

Review Comment:
   ```suggestion
   - Document steps for rolling upgrade from 0.16.x to 1.x , with minimal 
downtime
   ```



##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 

Review Comment:
   may be good to document what works. but best to get everyone to 0.16 or 
those users can choose to take a downtime and do it directly? 



##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+

Re: [PR] [HUDI-7940] Pass HoodieIngestionMetrics to Error Table Writer to be able to emit metrics for Error Table Writer [hudi]

2024-07-02 Thread via GitHub



CTTY commented on PR #11541:
URL: https://github.com/apache/hudi/pull/11541#issuecomment-2204570297

   LGTM in general, could you fix the checkstyle issue as well?
   ```
   src/main/java/org/apache/hudi/utilities/streamer/ErrorTableUtils.java:[38,8] 
(imports) UnusedImports: Unused import - 
java.lang.reflect.InvocationTargetException.
   ```
   
   cc: @yihua 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7940) Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table

2024-07-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7940:
-
Labels: pull-request-available  (was: )

> Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table
> ---
>
> Key: HUDI-7940
> URL: https://issues.apache.org/jira/browse/HUDI-7940
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Minor
>  Labels: pull-request-available
>
> Pass metrics to ErrorTableWriter to be able to emit metrics for Error Table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7940] Pass HoodieIngestionMetrics to Error Table Writer to be able to emit metrics for Error Table Writer [hudi]

2024-07-02 Thread via GitHub



CTTY commented on code in PR #11541:
URL: https://github.com/apache/hudi/pull/11541#discussion_r1663232180


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/ErrorTableUtils.java:
##
@@ -47,27 +48,29 @@ public static Option 
getErrorTableWriter(HoodieStreamer.Co
  SparkSession 
sparkSession,
  
TypedProperties props,
  
HoodieSparkEngineContext hoodieSparkContext,
- FileSystem 
fileSystem) {
+ FileSystem fs,
+ 
Option metrics) {
 String errorTableWriterClass = 
props.getString(ERROR_TABLE_WRITE_CLASS.key());
 
ValidationUtils.checkState(!StringUtils.isNullOrEmpty(errorTableWriterClass),
-"Missing error table config " + ERROR_TABLE_WRITE_CLASS);
+   "Missing error table config " + 
ERROR_TABLE_WRITE_CLASS);
 
-Class[] argClassArr = new Class[] {HoodieStreamer.Config.class,
-SparkSession.class, TypedProperties.class, 
HoodieSparkEngineContext.class,
-FileSystem.class};
-String errMsg = "Unable to instantiate ErrorTableWriter with arguments 
type "
-+ Arrays.toString(argClassArr);
-ValidationUtils.checkArgument(
-ReflectionUtils.hasConstructor(BaseErrorTableWriter.class.getName(), 
argClassArr, false),
-errMsg);
+Class[] legacyArgClass = new Class[]{HoodieStreamer.Config.class,
+SparkSession.class, TypedProperties.class, 
HoodieSparkEngineContext.class, FileSystem.class};
+Class[] argClassV1 = new Class[]{HoodieStreamer.Config.class,
+SparkSession.class, TypedProperties.class, 
HoodieSparkEngineContext.class, FileSystem.class, Option.class};

Review Comment:
   Should we make these arg classes static as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-7854) Bump AWS SDK v2 version to 2.25.69

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7854.
---
Resolution: Fixed

> Bump AWS SDK v2 version to 2.25.69
> --
>
> Key: HUDI-7854
> URL: https://issues.apache.org/jira/browse/HUDI-7854
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> The current version of AWS SDK v2 used is 2.18.40 which is 1.5 years old.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6510) Java 17 compile time support

2024-07-02 Thread Ethan Guo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862608#comment-17862608
 ] 

Ethan Guo commented on HUDI-6510:
-

[~yc2523] is going to take this up as part of supporting Spark 4.0 for Hudi 
(Spark 4.0 requires Java 17 compile time support).

> Java 17 compile time support
> 
>
> Key: HUDI-6510
> URL: https://issues.apache.org/jira/browse/HUDI-6510
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Shawn Chang
>Priority: Major
> Fix For: 1.0.0
>
>
> Certify Hudi with Java 17 compile time support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-6510) Java 17 compile time support

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6510:
---

Assignee: Shawn Chang  (was: Ethan Guo)

> Java 17 compile time support
> 
>
> Key: HUDI-6510
> URL: https://issues.apache.org/jira/browse/HUDI-6510
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Shawn Chang
>Priority: Major
> Fix For: 1.0.0
>
>
> Certify Hudi with Java 17 compile time support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7922) Add Hudi CLI bundle for Scala 2.13

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7922.
---
Resolution: Fixed

> Add Hudi CLI bundle for Scala 2.13
> --
>
> Key: HUDI-7922
> URL: https://issues.apache.org/jira/browse/HUDI-7922
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Build of Hudi CLI bundle should succeed on Scala 2.13 and work on Spark 3.5 
> and Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6508) Java 11 compile time support

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6508.
---
Resolution: Fixed

> Java 11 compile time support
> 
>
> Key: HUDI-6508
> URL: https://issues.apache.org/jira/browse/HUDI-6508
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Udit Mehrotra
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Certify Hudi with Java 11 runtime support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7945:

Status: In Progress  (was: Open)

> Fix file pruning using PARTITION_STATS index in Spark
> -
>
> Key: HUDI-7945
> URL: https://issues.apache.org/jira/browse/HUDI-7945
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> The issue can be reproduced by 
> [https://github.com/apache/hudi/pull/11472#issuecomment-2199332859.]
> When there are more than one base files in a table partition, the 
> corresponding PARTITION_STATS index record in the metadata table contains 
> null as the file_path field in HoodieColumnRangeMetadata.
> {code:java}
> private static > HoodieColumnRangeMetadata 
> mergeRanges(HoodieColumnRangeMetadata one,
>   
> HoodieColumnRangeMetadata another) {
>   
> ValidationUtils.checkArgument(one.getColumnName().equals(another.getColumnName()),
>   "Column names should be the same for merging column ranges");
>   final T minValue = getMinValueForColumnRanges(one, another);
>   final T maxValue = getMaxValueForColumnRanges(one, another);
>   return HoodieColumnRangeMetadata.create(
>   null, one.getColumnName(), minValue, maxValue,
>   one.getNullCount() + another.getNullCount(),
>   one.getValueCount() + another.getValueCount(),
>   one.getTotalSize() + another.getTotalSize(),
>   one.getTotalUncompressedSize() + another.getTotalUncompressedSize());
> } 
> {code}
> The null causes NPE when loading the column stats per partition from 
> PARTITION_STATS index.  Also, current implementation of 
> PartitionStatsIndexSupport assumes that the file_path field contains the 
> exact file name and it does not work if the the file path does not contain 
> null (even a list of file names stored does not work).  We have to 
> reimplement PartitionStatsIndexSupport so that it gives the pruned partitions 
> for further processing.
> {code:java}
> Caused by: java.lang.NullPointerException: element cannot be mapped to a null 
> key
>     at java.util.Objects.requireNonNull(Objects.java:228)
>     at java.util.stream.Collectors.lambda$groupingBy$45(Collectors.java:907)
>     at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at java.util.Iterator.forEachRemaining(Iterator.java:116)
>     at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>     at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
>     at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
>     at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>     at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>     at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
>     at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
>     at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
>     at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>     at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
>     at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
>     at 
> java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
>     at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>     at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>     at 
> org.apache.hudi.common.data.HoodieListPairData.groupByKey(HoodieListPairData.java:115)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.transpose(ColumnStatsIndexSupport.scala:253)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.$anonfun$loadTransposed$1(ColumnStatsIndexSupport.scala:149)
>     at 
> org.apache.hudi.HoodieCatalystUtils$.withPersistedData(HoodieCatalystUtils.scala:61)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.loadTransposed(ColumnStatsIndexSupport.scala:148)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.computeCandidateFileNames(ColumnStatsIndexSupport.scala:101)
>     at 
>

Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2204550911

   
   ## CI report:
   
   * dc811a79eba67aa96aca74f841d7a79fde1c35de Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24675)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7945:

Story Points: 6

> Fix file pruning using PARTITION_STATS index in Spark
> -
>
> Key: HUDI-7945
> URL: https://issues.apache.org/jira/browse/HUDI-7945
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> The issue can be reproduced by 
> [https://github.com/apache/hudi/pull/11472#issuecomment-2199332859.]
> When there are more than one base files in a table partition, the 
> corresponding PARTITION_STATS index record in the metadata table contains 
> null as the file_path field in HoodieColumnRangeMetadata.
> {code:java}
> private static > HoodieColumnRangeMetadata 
> mergeRanges(HoodieColumnRangeMetadata one,
>   
> HoodieColumnRangeMetadata another) {
>   
> ValidationUtils.checkArgument(one.getColumnName().equals(another.getColumnName()),
>   "Column names should be the same for merging column ranges");
>   final T minValue = getMinValueForColumnRanges(one, another);
>   final T maxValue = getMaxValueForColumnRanges(one, another);
>   return HoodieColumnRangeMetadata.create(
>   null, one.getColumnName(), minValue, maxValue,
>   one.getNullCount() + another.getNullCount(),
>   one.getValueCount() + another.getValueCount(),
>   one.getTotalSize() + another.getTotalSize(),
>   one.getTotalUncompressedSize() + another.getTotalUncompressedSize());
> } 
> {code}
> The null causes NPE when loading the column stats per partition from 
> PARTITION_STATS index.  Also, current implementation of 
> PartitionStatsIndexSupport assumes that the file_path field contains the 
> exact file name and it does not work if the the file path does not contain 
> null (even a list of file names stored does not work).  We have to 
> reimplement PartitionStatsIndexSupport so that it gives the pruned partitions 
> for further processing.
> {code:java}
> Caused by: java.lang.NullPointerException: element cannot be mapped to a null 
> key
>     at java.util.Objects.requireNonNull(Objects.java:228)
>     at java.util.stream.Collectors.lambda$groupingBy$45(Collectors.java:907)
>     at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at java.util.Iterator.forEachRemaining(Iterator.java:116)
>     at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>     at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
>     at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
>     at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>     at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>     at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
>     at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
>     at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
>     at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>     at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
>     at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
>     at 
> java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
>     at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>     at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>     at 
> org.apache.hudi.common.data.HoodieListPairData.groupByKey(HoodieListPairData.java:115)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.transpose(ColumnStatsIndexSupport.scala:253)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.$anonfun$loadTransposed$1(ColumnStatsIndexSupport.scala:149)
>     at 
> org.apache.hudi.HoodieCatalystUtils$.withPersistedData(HoodieCatalystUtils.scala:61)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.loadTransposed(ColumnStatsIndexSupport.scala:148)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.computeCandidateFileNames(ColumnStatsIndexSupport.scala:101)
>     at 
>

[jira] [Updated] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7945:

Description: 
The issue can be reproduced by 
[https://github.com/apache/hudi/pull/11472#issuecomment-2199332859.]

When there are more than one base files in a table partition, the corresponding 
PARTITION_STATS index record in the metadata table contains null as the 
file_path field in HoodieColumnRangeMetadata.
{code:java}
private static > HoodieColumnRangeMetadata 
mergeRanges(HoodieColumnRangeMetadata one,

  HoodieColumnRangeMetadata another) {
  
ValidationUtils.checkArgument(one.getColumnName().equals(another.getColumnName()),
  "Column names should be the same for merging column ranges");
  final T minValue = getMinValueForColumnRanges(one, another);
  final T maxValue = getMaxValueForColumnRanges(one, another);

  return HoodieColumnRangeMetadata.create(
  null, one.getColumnName(), minValue, maxValue,
  one.getNullCount() + another.getNullCount(),
  one.getValueCount() + another.getValueCount(),
  one.getTotalSize() + another.getTotalSize(),
  one.getTotalUncompressedSize() + another.getTotalUncompressedSize());
} 
{code}
The null causes NPE when loading the column stats per partition from 
PARTITION_STATS index.  Also, current implementation of 
PartitionStatsIndexSupport assumes that the file_path field contains the exact 
file name and it does not work if the the file path does not contain null (even 
a list of file names stored does not work).  We have to reimplement 
PartitionStatsIndexSupport so that it gives the pruned partitions for further 
processing.
{code:java}
Caused by: java.lang.NullPointerException: element cannot be mapped to a null 
key
    at java.util.Objects.requireNonNull(Objects.java:228)
    at java.util.stream.Collectors.lambda$groupingBy$45(Collectors.java:907)
    at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at 
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
    at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
    at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
    at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
    at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
    at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
    at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
    at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
    at 
org.apache.hudi.common.data.HoodieListPairData.groupByKey(HoodieListPairData.java:115)
    at 
org.apache.hudi.ColumnStatsIndexSupport.transpose(ColumnStatsIndexSupport.scala:253)
    at 
org.apache.hudi.ColumnStatsIndexSupport.$anonfun$loadTransposed$1(ColumnStatsIndexSupport.scala:149)
    at 
org.apache.hudi.HoodieCatalystUtils$.withPersistedData(HoodieCatalystUtils.scala:61)
    at 
org.apache.hudi.ColumnStatsIndexSupport.loadTransposed(ColumnStatsIndexSupport.scala:148)
    at 
org.apache.hudi.ColumnStatsIndexSupport.computeCandidateFileNames(ColumnStatsIndexSupport.scala:101)
    at 
org.apache.hudi.HoodieFileIndex.$anonfun$lookupCandidateFilesInMetadataTable$3(HoodieFileIndex.scala:354)
    at 
org.apache.hudi.HoodieFileIndex.$anonfun$lookupCandidateFilesInMetadataTable$3$adapted(HoodieFileIndex.scala:351)
    at 
scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
    at 
org.apache.hudi.HoodieFileIndex.$anonfun$lookupCandidateFilesInMetadataTable$1(HoodieFileIndex.scala:351)
    at scala.util.Try$.apply(Try.scala:213)
    at

[jira] [Updated] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7945:

Story Points: 8  (was: 6)

> Fix file pruning using PARTITION_STATS index in Spark
> -
>
> Key: HUDI-7945
> URL: https://issues.apache.org/jira/browse/HUDI-7945
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0-beta2, 1.0.0
>
>
> The issue can be reproduced by 
> [https://github.com/apache/hudi/pull/11472#issuecomment-2199332859.]
> When there are more than one base files in a table partition, the 
> corresponding PARTITION_STATS index record in the metadata table contains 
> null as the file_path field in HoodieColumnRangeMetadata.
> {code:java}
> private static > HoodieColumnRangeMetadata 
> mergeRanges(HoodieColumnRangeMetadata one,
>   
> HoodieColumnRangeMetadata another) {
>   
> ValidationUtils.checkArgument(one.getColumnName().equals(another.getColumnName()),
>   "Column names should be the same for merging column ranges");
>   final T minValue = getMinValueForColumnRanges(one, another);
>   final T maxValue = getMaxValueForColumnRanges(one, another);
>   return HoodieColumnRangeMetadata.create(
>   null, one.getColumnName(), minValue, maxValue,
>   one.getNullCount() + another.getNullCount(),
>   one.getValueCount() + another.getValueCount(),
>   one.getTotalSize() + another.getTotalSize(),
>   one.getTotalUncompressedSize() + another.getTotalUncompressedSize());
> } 
> {code}
> The null causes NPE when loading the column stats per partition from 
> PARTITION_STATS index.  Also, current implementation of 
> PartitionStatsIndexSupport assumes that the file_path field contains the 
> exact file name and it does not work if the the file path does not contain 
> null (even a list of file names stored does not work).  We have to 
> reimplement PartitionStatsIndexSupport so that it gives the pruned partitions 
> for further processing.
> {code:java}
> Caused by: java.lang.NullPointerException: element cannot be mapped to a null 
> key
>     at java.util.Objects.requireNonNull(Objects.java:228)
>     at java.util.stream.Collectors.lambda$groupingBy$45(Collectors.java:907)
>     at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at java.util.Iterator.forEachRemaining(Iterator.java:116)
>     at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>     at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
>     at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
>     at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>     at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>     at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
>     at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
>     at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
>     at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
>     at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>     at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
>     at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
>     at 
> java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
>     at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>     at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>     at 
> org.apache.hudi.common.data.HoodieListPairData.groupByKey(HoodieListPairData.java:115)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.transpose(ColumnStatsIndexSupport.scala:253)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.$anonfun$loadTransposed$1(ColumnStatsIndexSupport.scala:149)
>     at 
> org.apache.hudi.HoodieCatalystUtils$.withPersistedData(HoodieCatalystUtils.scala:61)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.loadTransposed(ColumnStatsIndexSupport.scala:148)
>     at 
> org.apache.hudi.ColumnStatsIndexSupport.computeCandidateFileNames(ColumnStatsIndexSupport.scala:101)
>     at 
>

Re: [PR] [HUDI-7903] Fix storage partition stats index to skip data [hudi]

2024-07-02 Thread via GitHub



yihua commented on PR #11472:
URL: https://github.com/apache/hudi/pull/11472#issuecomment-2204550371

   HUDI-7945 to track the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2204539613

   
   ## CI report:
   
   * 183768aef83c6090cfd27b8c6b360dbf139d832a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24658)
 
   * dc811a79eba67aa96aca74f841d7a79fde1c35de UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7921] Fixing file system view closures in MDT [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11496:
URL: https://github.com/apache/hudi/pull/11496#issuecomment-2204539482

   
   ## CI report:
   
   * c9b592e06870a6a6b368528697217b3f9c2b63da Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24667)
 
   * 650cefb8de1f1b97b12174361b1f1acf10b64ecb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24674)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7945:

Fix Version/s: 1.0.0-beta2
   1.0.0

> Fix file pruning using PARTITION_STATS index in Spark
> -
>
> Key: HUDI-7945
> URL: https://issues.apache.org/jira/browse/HUDI-7945
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0-beta2, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark

2024-07-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7945:
---

Assignee: Ethan Guo

> Fix file pruning using PARTITION_STATS index in Spark
> -
>
> Key: HUDI-7945
> URL: https://issues.apache.org/jira/browse/HUDI-7945
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7945) Fix file pruning using PARTITION_STATS index in Spark

2024-07-02 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-7945:
---

 Summary: Fix file pruning using PARTITION_STATS index in Spark
 Key: HUDI-7945
 URL: https://issues.apache.org/jira/browse/HUDI-7945
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7921] Fixing file system view closures in MDT [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11496:
URL: https://github.com/apache/hudi/pull/11496#issuecomment-2204530009

   
   ## CI report:
   
   * c9b592e06870a6a6b368528697217b3f9c2b63da Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24667)
 
   * 650cefb8de1f1b97b12174361b1f1acf10b64ecb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7903] Fix storage partition stats index to skip data [hudi]

2024-07-02 Thread via GitHub



yihua commented on code in PR #11472:
URL: https://github.com/apache/hudi/pull/11472#discussion_r1663191099


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala:
##
@@ -34,74 +43,294 @@ class TestPartitionStatsIndexWithSql extends 
HoodieSparkSqlTestBase {
   val sqlTempTable = "hudi_tbl"
 
   test("Test partition stats index following insert, merge into, update and 
delete") {
-withTempDir { tmp =>
-  val tableName = generateTableName
-  val tablePath = s"${tmp.getCanonicalPath}/$tableName"
-  // Create table with date type partition
-  spark.sql(
-s"""
-   | create table $tableName using hudi
-   | partitioned by (dt)
-   | tblproperties(
-   |primaryKey = 'id',
-   |preCombineField = 'ts',
-   |'hoodie.metadata.index.partition.stats.enable' = 'true'
-   | )
-   | location '$tablePath'
-   | AS
-   | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, 
cast('2021-05-06' as date) as dt
+Seq("cow", "mor").foreach { tableType =>
+  withTempDir { tmp =>
+val tableName = generateTableName
+val tablePath = s"${tmp.getCanonicalPath}/$tableName"
+// Create table with date type partition
+spark.sql(
+  s"""
+ | create table $tableName using hudi
+ | partitioned by (dt)
+ | tblproperties(
+ |type = '$tableType',
+ |primaryKey = 'id',
+ |preCombineField = 'ts',
+ |'hoodie.metadata.index.partition.stats.enable' = 'true',
+ |'hoodie.metadata.index.column.stats.column.list' = 'name'
+ | )
+ | location '$tablePath'
+ | AS
+ | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, 
cast('2021-05-06' as date) as dt
  """.stripMargin
-  )
+)
+
+assertResult(WriteOperationType.BULK_INSERT) {
+  HoodieSparkSqlTestBase.getLastCommitMetadata(spark, 
tablePath).getOperationType
+}
+checkAnswer(s"select id, name, price, ts, cast(dt as string) from 
$tableName")(
+  Seq(1, "a1", 10, 1000, "2021-05-06")
+)
+
+val partitionValue = "2021-05-06"
+
+// Check the missing properties for spark sql
+val metaClient = HoodieTableMetaClient.builder()
+  .setBasePath(tablePath)
+  .setConf(HoodieTestUtils.getDefaultStorageConf)
+  .build()
+val properties = metaClient.getTableConfig.getProps.asScala.toMap
+
assertResult(true)(properties.contains(HoodieTableConfig.CREATE_SCHEMA.key))
+assertResult("dt")(properties(HoodieTableConfig.PARTITION_FIELDS.key))
+assertResult("ts")(properties(HoodieTableConfig.PRECOMBINE_FIELD.key))
+assertResult(tableName)(metaClient.getTableConfig.getTableName)
+// Validate partition_stats index exists
+
assertTrue(metaClient.getTableConfig.getMetadataPartitions.contains(PARTITION_STATS.getPartitionPath))
+
+// Test insert into
+spark.sql(s"insert into $tableName values(2, 'a2', 10, 1000, 
cast('$partitionValue' as date))")
+checkAnswer(s"select _hoodie_record_key, _hoodie_partition_path, id, 
name, price, ts, cast(dt as string) from $tableName order by id")(
+  Seq("1", s"dt=$partitionValue", 1, "a1", 10, 1000, partitionValue),
+  Seq("2", s"dt=$partitionValue", 2, "a2", 10, 1000, partitionValue)
+)
+// Test merge into
+spark.sql(
+  s"""
+ |merge into $tableName h0
+ |using (select 1 as id, 'a1' as name, 11 as price, 1001 as ts, 
'$partitionValue' as dt) s0
+ |on h0.id = s0.id
+ |when matched then update set *
+ |""".stripMargin)
+checkAnswer(s"select _hoodie_record_key, _hoodie_partition_path, id, 
name, price, ts, cast(dt as string) from $tableName order by id")(
+  Seq("1", s"dt=$partitionValue", 1, "a1", 11, 1001, partitionValue),
+  Seq("2", s"dt=$partitionValue", 2, "a2", 10, 1000, partitionValue)
+)
+// Test update
+spark.sql(s"update $tableName set price = price + 1 where id = 2")
+checkAnswer(s"select _hoodie_record_key, _hoodie_partition_path, id, 
name, price, ts, cast(dt as string) from $tableName order by id")(
+  Seq("1", s"dt=$partitionValue", 1, "a1", 11, 1001, partitionValue),
+  Seq("2", s"dt=$partitionValue", 2, "a2", 11, 1000, partitionValue)
+)
+// Test delete
+spark.sql(s"delete from $tableName where id = 1")
+checkAnswer(s"select _hoodie_record_key, _hoodie_partition_path, id, 
name, price, ts, cast(dt as string) from $tableName order by id")(
+  Seq("2", s"dt=$partitionValue", 2, "a2", 11, 1000, partitionValue)
+)
+  }
+}
+  }
+
+  test("Test partition

(hudi-rs) branch main updated: feat: add APIs for time-travel read (#33)

2024-07-02 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/hudi-rs.git


The following commit(s) were added to refs/heads/main by this push:
 new 199a25d  feat: add APIs for time-travel read (#33)
199a25d is described below

commit 199a25d82ba09c0bedeb12430bf22299603209b2
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Tue Jul 2 15:48:05 2024 -0500

feat: add APIs for time-travel read (#33)
---
 crates/core/src/file_group/mod.rs | 27 ---
 crates/core/src/storage/mod.rs| 12 ++---
 crates/core/src/table/fs_view.rs  | 38 +--
 crates/core/src/table/mod.rs  | 98 +++
 crates/core/src/table/timeline.rs | 70 
 crates/datafusion/src/lib.rs  | 14 ++
 crates/tests/src/lib.rs   |  7 +++
 python/hudi/_internal.pyi | 12 +++--
 python/hudi/table.py  | 22 +
 python/src/lib.rs | 31 -
 python/tests/test_table_read.py   | 44 ++
 11 files changed, 279 insertions(+), 96 deletions(-)

diff --git a/crates/core/src/file_group/mod.rs 
b/crates/core/src/file_group/mod.rs
index 6b9b22c..c0af0b3 100644
--- a/crates/core/src/file_group/mod.rs
+++ b/crates/core/src/file_group/mod.rs
@@ -100,7 +100,7 @@ impl FileSlice {
 if self.base_file.stats.is_none() {
 let parquet_meta = storage
 .get_parquet_file_metadata(_file_relative_path())
-.await;
+.await?;
 let num_records = parquet_meta.file_metadata().num_rows();
 let stats = FileStats { num_records };
 self.base_file.stats = Some(stats);
@@ -163,12 +163,22 @@ impl FileGroup {
 }
 }
 
-pub fn get_latest_file_slice() -> Option<> {
-return self.file_slices.values().next_back();
+pub fn get_file_slice_as_of(, timestamp: ) -> Option<> {
+let as_of = timestamp.to_string();
+return if let Some((_, file_slice)) = 
self.file_slices.range(..=as_of).next_back() {
+Some(file_slice)
+} else {
+None
+};
 }
 
-pub fn get_latest_file_slice_mut( self) -> Option< FileSlice> {
-return self.file_slices.values_mut().next_back();
+pub fn get_file_slice_mut_as_of( self, timestamp: ) -> Option< 
FileSlice> {
+let as_of = timestamp.to_string();
+return if let Some((_, file_slice)) = 
self.file_slices.range_mut(..=as_of).next_back() {
+Some(file_slice)
+} else {
+None
+};
 }
 }
 
@@ -203,8 +213,11 @@ mod tests {
 let commit_times: Vec<> = fg.file_slices.keys().map(|k| 
k.as_str()).collect();
 assert_eq!(commit_times, vec!["20240402123035233", 
"20240402144910683"]);
 assert_eq!(
-fg.get_latest_file_slice().unwrap().base_file.commit_time,
-"20240402144910683"
+fg.get_file_slice_as_of("20240402123035233")
+.unwrap()
+.base_file
+.commit_time,
+"20240402123035233"
 )
 }
 
diff --git a/crates/core/src/storage/mod.rs b/crates/core/src/storage/mod.rs
index 76e085f..0f09c05 100644
--- a/crates/core/src/storage/mod.rs
+++ b/crates/core/src/storage/mod.rs
@@ -72,14 +72,14 @@ impl Storage {
 }
 }
 
-pub async fn get_parquet_file_metadata(, relative_path: ) -> 
ParquetMetaData {
-let obj_url = join_url_segments(_url, 
&[relative_path]).unwrap();
-let obj_path = ObjPath::from_url_path(obj_url.path()).unwrap();
+pub async fn get_parquet_file_metadata(, relative_path: ) -> 
Result {
+let obj_url = join_url_segments(_url, &[relative_path])?;
+let obj_path = ObjPath::from_url_path(obj_url.path())?;
 let obj_store = self.object_store.clone();
-let meta = obj_store.head(_path).await.unwrap();
+let meta = obj_store.head(_path).await?;
 let reader = ParquetObjectReader::new(obj_store, meta);
-let builder = 
ParquetRecordBatchStreamBuilder::new(reader).await.unwrap();
-builder.metadata().as_ref().to_owned()
+let builder = ParquetRecordBatchStreamBuilder::new(reader).await?;
+Ok(builder.metadata().as_ref().to_owned())
 }
 
 pub async fn get_file_data(, relative_path: ) -> Bytes {
diff --git a/crates/core/src/table/fs_view.rs b/crates/core/src/table/fs_view.rs
index 8f278dd..b7cd77c 100644
--- a/crates/core/src/table/fs_view.rs
+++ b/crates/core/src/table/fs_view.rs
@@ -44,7 +44,7 @@ impl FileSystemView {
 props: Arc>,
 ) -> Result {
 let storage = Storage::new(base_url, storage_options)?;
-let partition_paths = Self::get_partition_paths().await?;
+let partition_paths = Self::load_partition_paths().await?;
 let partition_to_file_groups =

Re: [PR] feat: add APIs for time-travel read [hudi-rs]

2024-07-02 Thread via GitHub



xushiyan merged PR #33:
URL: https://github.com/apache/hudi-rs/pull/33


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat: add APIs for time-travel read [hudi-rs]

2024-07-02 Thread via GitHub



codecov[bot] commented on PR #33:
URL: https://github.com/apache/hudi-rs/pull/33#issuecomment-2204272848

   ## 
[Codecov](https://app.codecov.io/gh/apache/hudi-rs/pull/33?dropdown=coverage=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 Report
   Attention: Patch coverage is `83.58209%` with `11 lines` in your changes 
missing coverage. Please review.
   > Project coverage is 89.54%. Comparing base 
[(`d5f2231`)](https://app.codecov.io/gh/apache/hudi-rs/commit/d5f2231d838c854f6992902855df9732cfd878cf?dropdown=coverage=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 to head 
[(`9d73320`)](https://app.codecov.io/gh/apache/hudi-rs/commit/9d733201100c6b3e7b7c18c77abb9a63edd5120f?dropdown=coverage=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache).
   
   | 
[Files](https://app.codecov.io/gh/apache/hudi-rs/pull/33?dropdown=coverage=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[crates/core/src/table/mod.rs](https://app.codecov.io/gh/apache/hudi-rs/pull/33?src=pr=tree=crates%2Fcore%2Fsrc%2Ftable%2Fmod.rs_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-Y3JhdGVzL2NvcmUvc3JjL3RhYmxlL21vZC5ycw==)
 | 81.81% | [4 Missing :warning: 
](https://app.codecov.io/gh/apache/hudi-rs/pull/33?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 |
   | 
[crates/core/src/table/timeline.rs](https://app.codecov.io/gh/apache/hudi-rs/pull/33?src=pr=tree=crates%2Fcore%2Fsrc%2Ftable%2Ftimeline.rs_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-Y3JhdGVzL2NvcmUvc3JjL3RhYmxlL3RpbWVsaW5lLnJz)
 | 76.47% | [4 Missing :warning: 
](https://app.codecov.io/gh/apache/hudi-rs/pull/33?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 |
   | 
[crates/core/src/file\_group/mod.rs](https://app.codecov.io/gh/apache/hudi-rs/pull/33?src=pr=tree=crates%2Fcore%2Fsrc%2Ffile_group%2Fmod.rs_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-Y3JhdGVzL2NvcmUvc3JjL2ZpbGVfZ3JvdXAvbW9kLnJz)
 | 77.77% | [2 Missing :warning: 
](https://app.codecov.io/gh/apache/hudi-rs/pull/33?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 |
   | 
[crates/datafusion/src/lib.rs](https://app.codecov.io/gh/apache/hudi-rs/pull/33?src=pr=tree=crates%2Fdatafusion%2Fsrc%2Flib.rs_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-Y3JhdGVzL2RhdGFmdXNpb24vc3JjL2xpYi5ycw==)
 | 66.66% | [1 Missing :warning: 
](https://app.codecov.io/gh/apache/hudi-rs/pull/33?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@Coverage Diff @@
   ## main  #33  +/-   ##
   ==
   - Coverage   90.92%   89.54%   -1.38% 
   ==
 Files  10   10  
 Lines 463  488  +25 
   ==
   + Hits  421  437  +16 
   - Misses 42   51   +9 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/hudi-rs/pull/33?dropdown=coverage=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] feat: add APIs for time-travel read [hudi-rs]

2024-07-02 Thread via GitHub



xushiyan opened a new pull request, #33:
URL: https://github.com/apache/hudi-rs/pull/33

   Add new table APIs `read_snapshot()` and `read_snapshot_as_of()` for 
snapshot and time travel reads.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663046076


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.
+
+### Filegroup/FileSlice changes:
+- Log files contain delta commit time instead of base instant time.
+- Log appends are disabled in 1.x. In other words, each log block is already 
appended to a new log file.
+- File Slice determination logic for log files changed (in 0.x, we have base 
instant time in log files and its straight forward. In 1.x, we find completion 
time for a log file and find the base instant time (parsed from base files) 
which has the highest value lesser than the completion time of the log file).
+- Log file ordering within a file slice. (in 0.x, we use base instant time ➝l 
log file versions ➝ write token) to order diff log files. in 1.x, we will be 
using completion time to order).
+
+### Log format changes:
+- We have added new header types in 1.x. (IS_PARTIAL)
+
+## Changes to be ported over 0.16.x to support reading 1.x tables
+### What will be supported
+- For features introduced in 0.x, and tables written in 1.x, 0.16.0 reader 
should be able to provide consistent reads w/o any breakage.
+### What will not be supported
+- A 0.16 writer cannot write to a table that has been upgraded-to/created

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663043878


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.

Review Comment:
   good point. we might need to solve this elegantly 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663038794


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.
+
+### Filegroup/FileSlice changes:
+- Log files contain delta commit time instead of base instant time.
+- Log appends are disabled in 1.x. In other words, each log block is already 
appended to a new log file.
+- File Slice determination logic for log files changed (in 0.x, we have base 
instant time in log files and its straight forward. In 1.x, we find completion 
time for a log file and find the base instant time (parsed from base files) 
which has the highest value lesser than the completion time of the log file).
+- Log file ordering within a file slice. (in 0.x, we use base instant time ➝l 
log file versions ➝ write token) to order diff log files. in 1.x, we will be 
using completion time to order).
+
+### Log format changes:
+- We have added new header types in 1.x. (IS_PARTIAL)
+
+## Changes to be ported over 0.16.x to support reading 1.x tables
+### What will be supported
+- For features introduced in 0.x, and tables written in 1.x, 0.16.0 reader 
should be able to provide consistent reads w/o any breakage.
+### What will not be supported
+- A 0.16 writer cannot write to a table that has been upgraded-to/created

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663037956


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.
+
+### Filegroup/FileSlice changes:
+- Log files contain delta commit time instead of base instant time.
+- Log appends are disabled in 1.x. In other words, each log block is already 
appended to a new log file.
+- File Slice determination logic for log files changed (in 0.x, we have base 
instant time in log files and its straight forward. In 1.x, we find completion 
time for a log file and find the base instant time (parsed from base files) 
which has the highest value lesser than the completion time of the log file).
+- Log file ordering within a file slice. (in 0.x, we use base instant time ➝l 
log file versions ➝ write token) to order diff log files. in 1.x, we will be 
using completion time to order).

Review Comment:
   or anyways, we are adding filterUncommittedLogs capability in FSV to read 
1.x reader. So, we should be good. but lets add tests for these.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service,

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663036653


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.
+
+### Filegroup/FileSlice changes:
+- Log files contain delta commit time instead of base instant time.
+- Log appends are disabled in 1.x. In other words, each log block is already 
appended to a new log file.
+- File Slice determination logic for log files changed (in 0.x, we have base 
instant time in log files and its straight forward. In 1.x, we find completion 
time for a log file and find the base instant time (parsed from base files) 
which has the highest value lesser than the completion time of the log file).
+- Log file ordering within a file slice. (in 0.x, we use base instant time ➝l 
log file versions ➝ write token) to order diff log files. in 1.x, we will be 
using completion time to order).

Review Comment:
   during 1.x upgrade, we need to ensure we do not do new way of rolling back 
log files for a failed write. bcoz, there could be a concurrent reader in 
0.16.x reading the table. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail:

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663035029


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.
+
+### Filegroup/FileSlice changes:
+- Log files contain delta commit time instead of base instant time.
+- Log appends are disabled in 1.x. In other words, each log block is already 
appended to a new log file.

Review Comment:
   sure. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663032167


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x

Review Comment:
   are you talking about 1.x reader or 0.16.x reader. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663030969


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.

Review Comment:
   if its custom, from where custom configs are picked up. its a table prop in 
0.x, but in 1.x? 
   lets validate the flows



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663024868


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.
+
+### Filegroup/FileSlice changes:
+- Log files contain delta commit time instead of base instant time.
+- Log appends are disabled in 1.x. In other words, each log block is already 
appended to a new log file.
+- File Slice determination logic for log files changed (in 0.x, we have base 
instant time in log files and its straight forward. In 1.x, we find completion 
time for a log file and find the base instant time (parsed from base files) 
which has the highest value lesser than the completion time of the log file).
+- Log file ordering within a file slice. (in 0.x, we use base instant time ➝l 
log file versions ➝ write token) to order diff log files. in 1.x, we will be 
using completion time to order).
+
+### Log format changes:
+- We have added new header types in 1.x. (IS_PARTIAL)
+
+## Changes to be ported over 0.16.x to support reading 1.x tables
+### What will be supported
+- For features introduced in 0.x, and tables written in 1.x, 0.16.0 reader 
should be able to provide consistent reads w/o any breakage.
+### What will not be supported
+- A 0.16 writer cannot write to a table that has been upgraded-to/created

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663022140


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.
+
+### Filegroup/FileSlice changes:
+- Log files contain delta commit time instead of base instant time.
+- Log appends are disabled in 1.x. In other words, each log block is already 
appended to a new log file.
+- File Slice determination logic for log files changed (in 0.x, we have base 
instant time in log files and its straight forward. In 1.x, we find completion 
time for a log file and find the base instant time (parsed from base files) 
which has the highest value lesser than the completion time of the log file).
+- Log file ordering within a file slice. (in 0.x, we use base instant time ➝l 
log file versions ➝ write token) to order diff log files. in 1.x, we will be 
using completion time to order).
+
+### Log format changes:
+- We have added new header types in 1.x. (IS_PARTIAL)
+
+## Changes to be ported over 0.16.x to support reading 1.x tables
+### What will be supported
+- For features introduced in 0.x, and tables written in 1.x, 0.16.0 reader 
should be able to provide consistent reads w/o any breakage.
+### What will not be supported
+- A 0.16 writer cannot write to a table that has been upgraded-to/created

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663014814


##
rfc/rfc-78/rfc-78.md:
##
@@ -191,15 +198,15 @@ We need to add back these older methods to 
HoodieDefaultTimeline, so that we do
 - e. We need to port code changes which accounts for uncommitted log files. In 
0.16.0, from FSV standpoint, all log files(including partially failed) are 
valid. We let the log record reader ignore the partially failed log files. But
   in 1.x, log files could be rolledback (deleted) by a concurrent rollback. 
So, the FSV should ensure it ignores the uncommitted log files.
 - f. Looks like we only have to make changes/appends to few methods in 
HoodieDefaultTimeline. But one option to potentially consider (if we see us 
making lot of changes to 0.16.0 HoodieDefaultTimeline in order to support 
reading 1.x tables), we could introduce Hoodie016xDefaultTimeline and 
Hoodie1xDefaultTimeline and use delegate pattern to delegate to either of the 
timelines. Using hoodie table version we could instantiate (internally to 
HoodieDefaultTimeline) to either of Hoodie016xDefaultTimeline or 
Hoodie1xDefaultTimeline. But for now, we don’t feel we might need to take this 
route. Just calling it out as an option depending on the changes we had to make.
+- g. Since log file ordering logic will differ from 0.16.x and 1.x, and we 
have a table upgrade commit time, we could leverage that to use diff log file 
ordering logic based on whether a file slice's base instant time is less or 
greater than table upgrade commit time. 
 
 ### FileSystemView changes
 Once all timeline changes are incorporated, we need to account for FSV 
changes. Major change as called out earlier is the Completion time based log 
files from 1.x writer and the log file naming referring to delta commit time 
instead of base commit time. So, w/o any changes to 
FSV/HoodieFileGroup/HoodieFileSlice code snippets, our file slice deduction 
logic might be wrong. Each log file could be tagged as its own file slice since 
each has a different base commit time (thats how 0.16.x HoodieLogFile would 
deduce it). So, we might have to port over CompletionTimeQueryView class and 
associated logic to 0.16.0. So, for file slice deduction logic in 0.16.0 will 
be pretty much similar to 1.x reader. But the log file ordering for log reading 
purpose, we do not need to maintain parity with 1.x reader as of yet. (unless 
we make NBCC default with MDT).
 Assuming 1.x reader and 1.x FSV should be able to read data written in older 
hudi versions, we also have a potential option here for avoid making nit-picky 
changes similar to the option called out earlier.
 We could instantiate two different FSV depending on the table version. If 
table version is 7 (0.16.0), we could instantiate FSV_V0 may be and if table 
version is 8 (1.0.0), we could instantiate FSV_V1. So that we don’t 
break/regress any of 0.16.0 read functionality in the interest of supporting 
1.x table reads. We should strive to cover all scenarios and not let any bugs 
creep in, but trying to see if we can keep the changes isolated so that battle 
tested code (FSV) is not touched or changed for the purpose of supporting 1.x 
table reads. If we run into any bugs with 1.x reads, we could ask users to not 
upgrade any of the writers to 1.x and stick with 0.16.0 unless we have say 
1.0.1 or something. But it would be really bad if we break 0.16.0 table read in 
some edge case.  Just calling out as one of the safe option to upgrade.
 
-
  Pending exploration:
-How partially failed log files are ignored in 1.x. I see all log files are 
accounted for while building FSV.
+1. We removed special suffixes to MDT operations in 1x. we need to test the 
flow and flush out details if anything to be added to 0.16.x reader. 

Review Comment:
   understand the new commit time generation logic is foolproof. what incase 
there is a concurrent ingestion in data table co-incidentally generates the 
same commit time? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-02 Thread via GitHub



Limess commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2204071448

   We should probably add some management ourselves to limit the partitions, is 
there any advice/pre-canned example of limiting to say, 450 past partitions and 
deleting any older using spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663011966


##
rfc/rfc-78/rfc-78.md:
##
@@ -179,6 +183,9 @@ Let’s reiterate what we need to support w/ 0.16.0 reader.
 On a high level, we need to ensure commit metadata in either format (avro or 
json) need to be supported. And “cluster” and completed “compaction”s need to 
be readable in 0.16.0 reader.
 - But the challenging part is, for every commit metadata, we might have to 
deserialize to avro and on exception try json. We could deduce the format using 
completion file name, but as per current code layering, deserialization methods 
does not know the file name( method takes byte[]).
 - Similarly for clustering commits, unless we have some kind of watermark, we 
have to keep considering replace commits as well in the FSV building logic to 
ensure we do not miss any clustering commits.
+- To be decided: We also need to use diff LogFileComparators depending on the 
file slice's base instant time. If the file slices's base instant time is < 
table upgrade commit time, we use older log file comparator to order log files. 
but if file slice's base instant time > table upgrade commit time, we have to 
use new log file comparator (completion time). Tricky part is if a file slice 
contains a mix of log files. 
+ This fix definitely needs to go into 1.x, but whether we wanted to port this 
change to 0.16.x or not is yet to be discussed and decided. Lets zoom in a bit 
to see what will happen if a single file slice could contain a mix of log files 
using 1.x reader(this is a basic requirement to support 0.16.x tables in 1.x). 

Review Comment:
   which mean, the log file compartor logic from 1.x needs to be ported to 
0.16.x reader 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663010943


##
rfc/rfc-78/rfc-78.md:
##
@@ -179,6 +183,9 @@ Let’s reiterate what we need to support w/ 0.16.0 reader.
 On a high level, we need to ensure commit metadata in either format (avro or 
json) need to be supported. And “cluster” and completed “compaction”s need to 
be readable in 0.16.0 reader.
 - But the challenging part is, for every commit metadata, we might have to 
deserialize to avro and on exception try json. We could deduce the format using 
completion file name, but as per current code layering, deserialization methods 
does not know the file name( method takes byte[]).
 - Similarly for clustering commits, unless we have some kind of watermark, we 
have to keep considering replace commits as well in the FSV building logic to 
ensure we do not miss any clustering commits.
+- To be decided: We also need to use diff LogFileComparators depending on the 
file slice's base instant time. If the file slices's base instant time is < 
table upgrade commit time, we use older log file comparator to order log files. 
but if file slice's base instant time > table upgrade commit time, we have to 
use new log file comparator (completion time). Tricky part is if a file slice 
contains a mix of log files. 
+ This fix definitely needs to go into 1.x, but whether we wanted to port this 
change to 0.16.x or not is yet to be discussed and decided. Lets zoom in a bit 
to see what will happen if a single file slice could contain a mix of log files 
using 1.x reader(this is a basic requirement to support 0.16.x tables in 1.x). 

Review Comment:
   we need to fix 1.x reader to enforce completion time based log file ordering 
for file slice. after the fix, from our understanding, same logic should work 
for a file slice completely written in 0.x. bcoz, completion time will match 
for all log files. and then we should use log version to determine the 
ordering. 
   we need to have lot of tests covering all these scenarios. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1662991098


##
rfc/rfc-78/rfc-78.md:
##
@@ -145,7 +147,9 @@ This will be an automatic upgrade for users when they start 
using 1.x hudi libra
 - No changes to log reader.
 - Check custom payload class in table properties and switch to payload type.
 - Trigger compaction for latest file slices. We do not want a single file 
slice having a mix of log files from 0.x and log files from 1.x. So, we will 
trigger a full compaction 
-of the table to ensure all latest file slices has just the base files.
+of the table to ensure all latest file slices has just the base files. 
+  - Lets dissect and see what it needs to support not requiring the full 
compaction. In general, we plan to add a table config to track the commit time 
(more on this later in this doc) when the upgrade was done. 
+So, using the upgrade commit time, we should be able to use different log 
file comparator to order log files within a given file slice. 

Review Comment:
   0.16.x reader is not going to order log files based on completion time and 
will only be ordering based on log version even for 1.x tables. 
   which means, even for a single file slice, having a mix of 0.x log files and 
1.x log files, we should be good here. 
   
   file slice determination:
   HoodieLogFile.getBaseInstantTime() has to work for both log files (0.x log 
files and 1.x log files). if we ensure this is intact, we should be good. 
   From skimming 1.x master, it should work OOB for 0.x log files. but lets 
test it out. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-07-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7882:
-
Labels: pull-request-available  (was: )

> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> RFC in progress: [https://github.com/apache/hudi/pull/11514] 
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-02 Thread via GitHub



nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1662991098


##
rfc/rfc-78/rfc-78.md:
##
@@ -145,7 +147,9 @@ This will be an automatic upgrade for users when they start 
using 1.x hudi libra
 - No changes to log reader.
 - Check custom payload class in table properties and switch to payload type.
 - Trigger compaction for latest file slices. We do not want a single file 
slice having a mix of log files from 0.x and log files from 1.x. So, we will 
trigger a full compaction 
-of the table to ensure all latest file slices has just the base files.
+of the table to ensure all latest file slices has just the base files. 
+  - Lets dissect and see what it needs to support not requiring the full 
compaction. In general, we plan to add a table config to track the commit time 
(more on this later in this doc) when the upgrade was done. 
+So, using the upgrade commit time, we should be able to use different log 
file comparator to order log files within a given file slice. 

Review Comment:
   0.16.x reader is not going to order log files based on completion time and 
will only be ordering based on log version even for 1.x tables. 
   which means, even for a single file slice, having a mix of 0.x log files and 
1.x log files, we should be good here. 
   
   pending:
   file slice determination:
   HoodieLogFile.getBaseInstantTime() has to work for both log files (0.x log 
files and 1.x log files). if we ensure this is intact, we should be good. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-02 Thread via GitHub



Limess opened a new issue, #11554:
URL: https://github.com/apache/hudi/issues/11554

   **Describe the problem you faced**
   
   Glue sync fails with INSERT_OVERWRITE when previous partitions are not 
included in new load.
   
   In our case, we have a couple of years worth of data, but only want to load 
the last `n` days, overwriting the old table. Deleting the old, now defunct 
partitions fails.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a Hudi table in AWS Glue with partition name strings exceeding 
2048 combined + additional which *will* be included in step 2
   2. INSERT_OVERWRITE into the same table, excluding any previous partitions, 
exceeding 2048 combined
   3. Observe failure
   
   **Expected behavior**
   
   Partitions are removed from AWS Glue catalog correctly.
   
   **Environment Description**
   
   * Hudi version : 0.14.1-amzn-0
   
   * Spark version : 3.5.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   EMR 7.1.0
   
   **Stacktrace**
   
   
[stacktrace.txt](https://github.com/user-attachments/files/16072826/stacktrace.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7903] Fix storage partition stats index to skip data [hudi]

2024-07-02 Thread via GitHub



yihua commented on code in PR #11472:
URL: https://github.com/apache/hudi/pull/11472#discussion_r1662945410


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala:
##
@@ -34,74 +43,294 @@ class TestPartitionStatsIndexWithSql extends 
HoodieSparkSqlTestBase {
   val sqlTempTable = "hudi_tbl"
 
   test("Test partition stats index following insert, merge into, update and 
delete") {
-withTempDir { tmp =>
-  val tableName = generateTableName
-  val tablePath = s"${tmp.getCanonicalPath}/$tableName"
-  // Create table with date type partition
-  spark.sql(
-s"""
-   | create table $tableName using hudi
-   | partitioned by (dt)
-   | tblproperties(
-   |primaryKey = 'id',
-   |preCombineField = 'ts',
-   |'hoodie.metadata.index.partition.stats.enable' = 'true'
-   | )
-   | location '$tablePath'
-   | AS
-   | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, 
cast('2021-05-06' as date) as dt
+Seq("cow", "mor").foreach { tableType =>
+  withTempDir { tmp =>
+val tableName = generateTableName
+val tablePath = s"${tmp.getCanonicalPath}/$tableName"
+// Create table with date type partition
+spark.sql(
+  s"""
+ | create table $tableName using hudi
+ | partitioned by (dt)
+ | tblproperties(
+ |type = '$tableType',
+ |primaryKey = 'id',
+ |preCombineField = 'ts',
+ |'hoodie.metadata.index.partition.stats.enable' = 'true',
+ |'hoodie.metadata.index.column.stats.column.list' = 'name'
+ | )
+ | location '$tablePath'
+ | AS
+ | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, 
cast('2021-05-06' as date) as dt
  """.stripMargin
-  )
+)
+
+assertResult(WriteOperationType.BULK_INSERT) {
+  HoodieSparkSqlTestBase.getLastCommitMetadata(spark, 
tablePath).getOperationType
+}
+checkAnswer(s"select id, name, price, ts, cast(dt as string) from 
$tableName")(
+  Seq(1, "a1", 10, 1000, "2021-05-06")
+)
+
+val partitionValue = "2021-05-06"
+
+// Check the missing properties for spark sql
+val metaClient = HoodieTableMetaClient.builder()
+  .setBasePath(tablePath)
+  .setConf(HoodieTestUtils.getDefaultStorageConf)
+  .build()
+val properties = metaClient.getTableConfig.getProps.asScala.toMap
+
assertResult(true)(properties.contains(HoodieTableConfig.CREATE_SCHEMA.key))
+assertResult("dt")(properties(HoodieTableConfig.PARTITION_FIELDS.key))
+assertResult("ts")(properties(HoodieTableConfig.PRECOMBINE_FIELD.key))
+assertResult(tableName)(metaClient.getTableConfig.getTableName)
+// Validate partition_stats index exists
+
assertTrue(metaClient.getTableConfig.getMetadataPartitions.contains(PARTITION_STATS.getPartitionPath))
+
+// Test insert into
+spark.sql(s"insert into $tableName values(2, 'a2', 10, 1000, 
cast('$partitionValue' as date))")
+checkAnswer(s"select _hoodie_record_key, _hoodie_partition_path, id, 
name, price, ts, cast(dt as string) from $tableName order by id")(
+  Seq("1", s"dt=$partitionValue", 1, "a1", 10, 1000, partitionValue),
+  Seq("2", s"dt=$partitionValue", 2, "a2", 10, 1000, partitionValue)
+)
+// Test merge into
+spark.sql(
+  s"""
+ |merge into $tableName h0
+ |using (select 1 as id, 'a1' as name, 11 as price, 1001 as ts, 
'$partitionValue' as dt) s0
+ |on h0.id = s0.id
+ |when matched then update set *
+ |""".stripMargin)
+checkAnswer(s"select _hoodie_record_key, _hoodie_partition_path, id, 
name, price, ts, cast(dt as string) from $tableName order by id")(
+  Seq("1", s"dt=$partitionValue", 1, "a1", 11, 1001, partitionValue),
+  Seq("2", s"dt=$partitionValue", 2, "a2", 10, 1000, partitionValue)
+)
+// Test update
+spark.sql(s"update $tableName set price = price + 1 where id = 2")
+checkAnswer(s"select _hoodie_record_key, _hoodie_partition_path, id, 
name, price, ts, cast(dt as string) from $tableName order by id")(
+  Seq("1", s"dt=$partitionValue", 1, "a1", 11, 1001, partitionValue),
+  Seq("2", s"dt=$partitionValue", 2, "a2", 11, 1000, partitionValue)
+)
+// Test delete
+spark.sql(s"delete from $tableName where id = 1")
+checkAnswer(s"select _hoodie_record_key, _hoodie_partition_path, id, 
name, price, ts, cast(dt as string) from $tableName order by id")(
+  Seq("2", s"dt=$partitionValue", 2, "a2", 11, 1000, partitionValue)
+)
+  }
+}
+  }
+
+  test("Test partition

Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2203114173

   
   ## CI report:
   
   * 1e2b60af5b816e1b9cd89fb40a47c8550337fe5f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7841] RLI and secondary index should consider only pruned partitions for file skipping [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11434:
URL: https://github.com/apache/hudi/pull/11434#issuecomment-2200258512

   
   ## CI report:
   
   * e4f9c002561656c0d2facad919e45ceca4d4b66a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24401)
 
   * 1bfdc8258af9758ce439a683d4c29825a526c763 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2203276376

   
   ## CI report:
   
   * 1e2b60af5b816e1b9cd89fb40a47c8550337fe5f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24672)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Fix NPE after a clustering plan is finished [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11550:
URL: https://github.com/apache/hudi/pull/11550#issuecomment-2200259199

   
   ## CI report:
   
   * fce22b2e55a9dcf09b57fd3d60f552dccb389575 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24654)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants

2024-07-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7905:
-
Labels: pull-request-available  (was: )

> Use cluster action for clustering pending instants
> --
>
> Key: HUDI-7905
> URL: https://issues.apache.org/jira/browse/HUDI-7905
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, we use replacecommit for clustering, insert overwrite and delete 
> partition. Clustering should be a separate action for requested and inflight 
> instant. This simplifies a few things such as we do not need to scan the 
> replacecommit.requested to determine whether we are looking at clustering 
> plan or not. This would simplify the usage of pending clustering related 
> APIs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] Fix NPE after a clustering plan is finished [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11550:
URL: https://github.com/apache/hudi/pull/11550#issuecomment-2200241276

   
   ## CI report:
   
   * fce22b2e55a9dcf09b57fd3d60f552dccb389575 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24654)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7859) Rename instant files to be consistent with 0.x naming format

2024-07-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7859:
-
Labels: pull-request-available  (was: )

> Rename instant files to be consistent with 0.x naming format
> 
>
> Key: HUDI-7859
> URL: https://issues.apache.org/jira/browse/HUDI-7859
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: YangXuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Needed for downgrade



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [I] [SUPPORT]Exception when executing log compaction : Unsupported Operation Exception [hudi]

2024-07-02 Thread via GitHub



duntonr commented on issue #10982:
URL: https://github.com/apache/hudi/issues/10982#issuecomment-2203158879

   Thanks @xuzifu666 .  I did build this package but I used the 0.15 release 
branch, you can see the full build log here: 
https://gitlab.com/therackio/big-data/binaries/apache-hudi-bin-arm64/-/jobs/7140291592/raw
   
   I did not modify the branch after checking out the release.
   
   I did actually try applying the patch file from PR 10194, eg 
https://github.com/apache/hudi/pull/10194.patch but it did not take - 
https://gitlab.com/therackio/big-data/binaries/apache-hudi-bin-arm64/-/jobs/7150452377


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2203129962

   
   ## CI report:
   
   * 1e2b60af5b816e1b9cd89fb40a47c8550337fe5f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24672)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-02 Thread via GitHub



hudi-bot commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2203114070

   
   ## CI report:
   
   * fe7aa032f4463035775029ad486ca73ea2d02ac0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24668)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 148 matches

Mail list logo