[GitHub] [hudi] nsivabalan commented on a diff in pull request #5030: [HUDI-3617] MOR compact improve

2022-10-19 Thread GitBox


nsivabalan commented on code in PR #5030:
URL: https://github.com/apache/hudi/pull/5030#discussion_r1000139459


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -123,25 +133,24 @@ public long getNumMergedRecordsInLog() {
 return numMergedRecordsInLog;
   }
 
-  /**
-   * Returns the builder for {@code HoodieMergedLogRecordScanner}.
-   */
-  public static HoodieMergedLogRecordScanner.Builder newBuilder() {
-return new Builder();
-  }
-
   @Override
   protected void processNextRecord(HoodieRecord 
hoodieRecord) throws IOException {
 String key = hoodieRecord.getRecordKey();
 if (records.containsKey(key)) {
   // Merge and store the merged record. The HoodieRecordPayload 
implementation is free to decide what should be
   // done when a delete (empty payload) is encountered before or after an 
insert/update.
-
-  HoodieRecord oldRecord = records.get(key);
-  HoodieRecordPayload oldValue = oldRecord.getData();
-  HoodieRecordPayload combinedValue = 
hoodieRecord.getData().preCombine(oldValue);
-  // If combinedValue is oldValue, no need rePut oldRecord
-  if (combinedValue != oldValue) {
+  HoodieRecord storeRecord = 
records.get(key);
+  HoodieRecordPayload storeValue = storeRecord.getData();
+  HoodieRecordPayload combinedValue;
+  // If revertLogFile = false, storeRecord is the old record.
+  // If revertLogFile = true, incoming data (hoodieRecord) is the old 
record.
+  if (!revertLogFile) {

Review Comment:
   can we close the patch then since we can't generalize. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #5030: [HUDI-3617] MOR compact improve

2022-09-07 Thread GitBox


nsivabalan commented on code in PR #5030:
URL: https://github.com/apache/hudi/pull/5030#discussion_r965448515


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -123,25 +133,24 @@ public long getNumMergedRecordsInLog() {
 return numMergedRecordsInLog;
   }
 
-  /**
-   * Returns the builder for {@code HoodieMergedLogRecordScanner}.
-   */
-  public static HoodieMergedLogRecordScanner.Builder newBuilder() {
-return new Builder();
-  }
-
   @Override
   protected void processNextRecord(HoodieRecord 
hoodieRecord) throws IOException {
 String key = hoodieRecord.getRecordKey();
 if (records.containsKey(key)) {
   // Merge and store the merged record. The HoodieRecordPayload 
implementation is free to decide what should be
   // done when a delete (empty payload) is encountered before or after an 
insert/update.
-
-  HoodieRecord oldRecord = records.get(key);
-  HoodieRecordPayload oldValue = oldRecord.getData();
-  HoodieRecordPayload combinedValue = 
hoodieRecord.getData().preCombine(oldValue);
-  // If combinedValue is oldValue, no need rePut oldRecord
-  if (combinedValue != oldValue) {
+  HoodieRecord storeRecord = 
records.get(key);
+  HoodieRecordPayload storeValue = storeRecord.getData();
+  HoodieRecordPayload combinedValue;
+  // If revertLogFile = false, storeRecord is the old record.
+  // If revertLogFile = true, incoming data (hoodieRecord) is the old 
record.
+  if (!revertLogFile) {

Review Comment:
   oh I see we have put in a fix here. sounds good. 
   
   but does below one holds good?
   ```
   delta commit1: insert rec1: val1. preCombine: 2
   delta commit2: delete rec1: 
   delta commit2: insert rec1: val2. preCombine: 1
   ```
   
   as per master, 
   guess final snapshot will return val2 for rec1. and not the deleted one. 
   
   can you tell me what will happen w/ this patch where in we reverse the 
ordering. 
   
   
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #5030: [HUDI-3617] MOR compact improve

2022-09-07 Thread GitBox


nsivabalan commented on code in PR #5030:
URL: https://github.com/apache/hudi/pull/5030#discussion_r965446760


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java:
##
@@ -280,8 +281,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // In most business scenarios, the latest data is in the latest 
delta log file, so we sort it from large
+  // to small according to the instance time, which can largely avoid 
rewriting the data in the
+  // compact process, and then optimize the compact time
   List logFiles =
-  
s.getLogFiles().sorted(HoodieLogFile.getLogFileComparator()).collect(toList());
+  
s.getLogFiles().sorted(HoodieLogFile.getLogFileComparator().reversed()).collect(toList());

Review Comment:
   We might have to consider few cases before can flip the ordering here:
   case 1: when OverwriteWithLatestAvro is used, if preCombine matches, we pick 
the latest. 
   for eg: 
   say for eg:
   delta commit1: insert rec1: val1. preCombine: 1
   delta commit2: update rec1: val2. preCombine: 1
   delta commit2: insert rec1: val3.  preCombine: 1
   
   if we merge as usual (master), 
   final value fo rec1 should be val3. 
   but if we do reverse, then it could result in val1. 
   
   Case2: 
   some payload implementations take values from old record and combine w/ 
newer ones. in other words they may not be commutative. 
   for eg,
   rec1.combineAndGetUpdate(rec2) != rec2.combineAndGetUpdate(rec1) 
   
   or preCombine() for that matter. 
   
   I do really like the intend behind this patch. but not sure if its as easy 
as flipping the order of log file merging. 
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org