[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for on-clean commits and hence this is not an issue. None of 
snapshot, time travel query or incremental query will run into issues as they 
are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3, but will clean up t4, t5 and 
t6. So, earliest commit to retain will be pointing to t6. And say savepoint for 
t3 is removed, but cleaner was disabled. In this state of the timeline, if 
archival is executed, (since t3.savepoint is removed), archival might archive 
t3 and t4.rc.  This could lead to data duplicates as both replaced file groups 
and new file groups from t4.rc would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 2 options to go about to solve this.

*Option A:* 

Before archiving any replace commit by the archiver, lets explicitly check that 
all replaced file groups are fully deleted. 

Cons: Might need FileSystemView polling which might be costly. 

*OptionB:*

Cleaner also tracks an additional metadata named, "fully cleaned up file 
groups" at the end of clean planning and in completed clean commit metadata. 

So, archival instead of polling FileSystemView (which might be costly), it can 
check for clean commit metadata for the list of file groups and can deduce if 
all file groups replaced by X.rc is fully deleted. 

Pros: 

Since clean planner anyways polls the file system view and has all file group 
info already, no additional work might be required to deduce "fully cleaned up 
file groups". Just that it needs to add an additional metadata. 

 

 

 

 

 

 

 

 

  was:
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside 

[jira] [Created] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-18 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7779:
-

 Summary: Guarding archival to not archive unintended commits
 Key: HUDI-7779
 URL: https://issues.apache.org/jira/browse/HUDI-7779
 Project: Apache Hudi
  Issue Type: Bug
  Components: archiving
Reporter: sivabalan narayanan


Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for on-clean commits and hence this is not an issue. None of 
snapshot, time travel query or incremental query will run into issues as they 
are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3, but will clean up t4, t5 and 
t6. So, earliest commit to retain will be pointing to t6. And say savepoint for 
t3 is removed, but cleaner was disabled. In this state of the timeline, if 
archival is executed, (since t3.savepoint is removed), archival might archive 
t3 and t4.rc.  This could lead to data duplicates as both replaced file groups 
and new file groups from t4.rc would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 


i. replaced file group is never cleaned up. 
    - ECTR is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 2 options to go about to solve this.

*Option A:* 

Before archiving any replace commit by the archiver, lets explicitly check that 
all replaced file groups are fully deleted. 

Cons: Might need FileSystemView polling which might be costly. 

*OptionB:*

Cleaner also tracks an additional metadata named, "fully cleaned up file 
groups" at the end of clean planning and in completed clean commit metadata. 

So, archival instead of polling FileSystemView (which might be costly), it can 
check for clean commit metadata for the list of file groups and can deduce if 
all file groups replaced by X.rc is fully deleted. 

Pros: 

Since clean planner anyways polls the file system view and has all file group 
info already, no additional work might be required to deduce "fully cleaned up 
file groups". Just that it needs to add an additional metadata. 

 

 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7761) Make the manifest Writer Extendable

2024-05-18 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7761.

Fix Version/s: 1.0.0
   Resolution: Fixed

Fixed via master branch: 0804a79f20c680b1d369c5e84d7c9ae4412ad472

> Make the manifest Writer Extendable
> ---
>
> Key: HUDI-7761
> URL: https://issues.apache.org/jira/browse/HUDI-7761
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sivaguru Kannan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> * Make the manifest writer extendable such that clients can plugin in the 
> custom instance of manifest writer for their syncs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7761] Make the ManifestWriter Extendable (#11253)

2024-05-18 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0804a79f20c [HUDI-7761] Make the ManifestWriter Extendable (#11253)
0804a79f20c is described below

commit 0804a79f20c680b1d369c5e84d7c9ae4412ad472
Author: csivaguru 
AuthorDate: Sat May 18 17:35:18 2024 -0700

[HUDI-7761] Make the ManifestWriter Extendable (#11253)
---
 .../java/org/apache/hudi/gcp/bigquery/BigQuerySchemaResolver.java  | 4 ++--
 .../java/org/apache/hudi/sync/common/util/ManifestFileWriter.java  | 7 +++
 .../org/apache/hudi/sync/common/util/TestManifestFileWriter.java   | 3 +--
 3 files changed, 6 insertions(+), 8 deletions(-)

diff --git 
a/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySchemaResolver.java
 
b/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySchemaResolver.java
index 361f869a9fe..27b5143944f 100644
--- 
a/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySchemaResolver.java
+++ 
b/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySchemaResolver.java
@@ -38,7 +38,7 @@ import java.util.stream.Collectors;
 /**
  * Extracts the BigQuery schema from a Hudi table.
  */
-class BigQuerySchemaResolver {
+public class BigQuerySchemaResolver {
   private static final BigQuerySchemaResolver INSTANCE = new 
BigQuerySchemaResolver(TableSchemaResolver::new);
 
   private final Function 
tableSchemaResolverSupplier;
@@ -48,7 +48,7 @@ class BigQuerySchemaResolver {
 this.tableSchemaResolverSupplier = tableSchemaResolverSupplier;
   }
 
-  static BigQuerySchemaResolver getInstance() {
+  public static BigQuerySchemaResolver getInstance() {
 return INSTANCE;
   }
 
diff --git 
a/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/ManifestFileWriter.java
 
b/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/ManifestFileWriter.java
index 3eaf80dddfe..eb52fe4586d 100644
--- 
a/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/ManifestFileWriter.java
+++ 
b/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/ManifestFileWriter.java
@@ -54,7 +54,7 @@ public class ManifestFileWriter {
   private final HoodieTableMetaClient metaClient;
   private final boolean useFileListingFromMetadata;
 
-  private ManifestFileWriter(HoodieTableMetaClient metaClient, boolean 
useFileListingFromMetadata) {
+  protected ManifestFileWriter(HoodieTableMetaClient metaClient, boolean 
useFileListingFromMetadata) {
 this.metaClient = metaClient;
 this.useFileListingFromMetadata = useFileListingFromMetadata;
   }
@@ -64,7 +64,7 @@ public class ManifestFileWriter {
*/
   public synchronized void writeManifestFile(boolean useAbsolutePath) {
 try {
-  List baseFiles = 
fetchLatestBaseFilesForAllPartitions(metaClient, useFileListingFromMetadata, 
useAbsolutePath)
+  List baseFiles = 
fetchLatestBaseFilesForAllPartitions(useAbsolutePath)
   .collect(Collectors.toList());
   if (baseFiles.isEmpty()) {
 LOG.warn("No base file to generate manifest file.");
@@ -86,8 +86,7 @@ public class ManifestFileWriter {
   }
 
   @VisibleForTesting
-  public static Stream 
fetchLatestBaseFilesForAllPartitions(HoodieTableMetaClient metaClient,
-  boolean useFileListingFromMetadata, boolean useAbsolutePath) {
+  public Stream fetchLatestBaseFilesForAllPartitions(boolean 
useAbsolutePath) {
 try {
   StorageConfiguration storageConf = metaClient.getStorageConf();
   HoodieLocalEngineContext engContext = new 
HoodieLocalEngineContext(storageConf);
diff --git 
a/hudi-sync/hudi-sync-common/src/test/java/org/apache/hudi/sync/common/util/TestManifestFileWriter.java
 
b/hudi-sync/hudi-sync-common/src/test/java/org/apache/hudi/sync/common/util/TestManifestFileWriter.java
index 2db602b65c2..13caefd1d0f 100644
--- 
a/hudi-sync/hudi-sync-common/src/test/java/org/apache/hudi/sync/common/util/TestManifestFileWriter.java
+++ 
b/hudi-sync/hudi-sync-common/src/test/java/org/apache/hudi/sync/common/util/TestManifestFileWriter.java
@@ -34,7 +34,6 @@ import java.util.List;
 import java.util.stream.IntStream;
 
 import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS;
-import static 
org.apache.hudi.sync.common.util.ManifestFileWriter.fetchLatestBaseFilesForAllPartitions;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertFalse;
 import static org.junit.jupiter.api.Assertions.assertTrue;
@@ -51,7 +50,7 @@ public class TestManifestFileWriter extends 
HoodieCommonTestHarness {
 // Generate 10 files under each partition
 createTestDataForPartitionedTable(metaClient, 10);
 ManifestFileWriter manifestFileWriter = 
ManifestFileWriter.builder().setMetaClient(metaClient).build();
-

Re: [PR] [HUDI-7761] Make the ManifestWriter Extendable [hudi]

2024-05-18 Thread via GitHub


danny0405 merged PR #11253:
URL: https://github.com/apache/hudi/pull/11253


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7769] Fix Hudi CDC read with legacy parquet file format on Spark [hudi]

2024-05-18 Thread via GitHub


hudi-bot commented on PR #11242:
URL: https://github.com/apache/hudi/pull/11242#issuecomment-2118885294

   
   ## CI report:
   
   * 922efda55e668b992e1b12b873be49c7f1645fba Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23966)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7769] Fix Hudi CDC read with legacy parquet file format on Spark [hudi]

2024-05-18 Thread via GitHub


hudi-bot commented on PR #11242:
URL: https://github.com/apache/hudi/pull/11242#issuecomment-2118883500

   
   ## CI report:
   
   * 922efda55e668b992e1b12b873be49c7f1645fba UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Cloudwatch metrics not published in moving from 0.12.1 to 0.14[SUPPORT] [hudi]

2024-05-18 Thread via GitHub


soumilshah1995 commented on issue #11205:
URL: https://github.com/apache/hudi/issues/11205#issuecomment-2118833255

   
   Posting the right jars versions
   
   * 
https://repo.maven.apache.org/maven2/org/apache/hudi/hudi-aws-bundle/0.14.0/hudi-aws-bundle-0.14.0.jar
   * 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.3-bundle_2.12/0.14.0/hudi-spark3.3-bundle_2.12-0.14.0.jar
   * 
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.13/3.3.0/spark-avro_2.13-3.3.0.jar
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Cloudwatch metrics not published in moving from 0.12.1 to 0.14[SUPPORT] [hudi]

2024-05-18 Thread via GitHub


soumilshah1995 commented on issue #11205:
URL: https://github.com/apache/hudi/issues/11205#issuecomment-2118830033

   @ajain-cohere  @ad1happy2go 
   
   Hey @Ayush Jain 
   
   im working for community video where i want to show the cloudwatch metrics 
   
   can you please confirm which jar files had resolved your issue 
   
   * hudi-spark3.3-bundle_2.12-0.14.0.jar
   * spark-avro_2.13-3.3.0.jar
   * hudi-aws-0.14.0.jar
   
   when i was trying a test run i got the same issue 
   
   ```
   An error occurred while calling o134.save. 
org.apache.hudi.aws.cloudwatch.CloudWatchReporter.forRegistry(Lorg/apache/hudi/com/codahale/metrics/MetricRegistry;)Lorg/apache/hudi/aws/cloudwatch/CloudWatchReporter$Builder;
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7778] Fixing global index for duplicate updates [hudi]

2024-05-18 Thread via GitHub


hudi-bot commented on PR #11256:
URL: https://github.com/apache/hudi/pull/11256#issuecomment-2118658240

   
   ## CI report:
   
   * 89005916c14107710828a1a76d68cfa58e80bf88 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23991)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org