[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-29 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1247377008


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option 
recordOpt) {
 
Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()),
 
Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString()));
   }
+} else {
+  this.isDeletedRecord = true;

Review Comment:
   The key is not set when the avro data is not set, so I'm wondering whether 
the `key == null` actually represents it is a delete record.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-29 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1246019736


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option 
recordOpt) {
 
Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()),
 
Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString()));
   }
+} else {
+  this.isDeletedRecord = true;

Review Comment:
   BTW, I don't like how the deletes are handling with existing column stats, 
even though we already do that does not mean it is correct, maintaining 
preCombining logic for every meta info is disgusting and inefficient.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-28 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1246019736


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option 
recordOpt) {
 
Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()),
 
Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString()));
   }
+} else {
+  this.isDeletedRecord = true;

Review Comment:
   BTW, I don't like how the deletes are handling with existing column stats, 
even though we already do that does not mean it is correct, maintaining 
preCombining logic for every meta info is disgusting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-28 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1246016087


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option 
recordOpt) {
 
Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()),
 
Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString()));
   }
+} else {
+  this.isDeletedRecord = true;

Review Comment:
   @prashantwason I think @xushiyan and @nsivabalan are addressing that many 
metadata POJOs already take a `isDeleted` flag except the recordIndexMetadata, 
during the convertion from write status to RLI metadata payload in 
`HoodieMetadataPayload#createRecordIndexUpdate`, there is a choice is to mark 
the recordIndexMetadata itself as deleted, which keeps the behavior in line 
with the other payloads.
   
   But still, in `HoodieMetadataPayload#preCombine`, the merging of RLI infos 
should be fixed with the `isDeleted` flag just like others.
   
   If we move to this direction, the changes in `HoodieBackedTableMetadata` 
should be unnecessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-28 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1246016087


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option 
recordOpt) {
 
Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()),
 
Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString()));
   }
+} else {
+  this.isDeletedRecord = true;

Review Comment:
   @prashantwason I think @xushiyan and @nsivabalan are addressing that many 
metadata POJOs already take a `isDeleted` flag except the recordIndexMetadata, 
during the convertion from write status to RLI metadata payload in 
`HoodieMetadataPayload#createRecordIndexUpdate`, there is a choice is to mark 
the recordIndexMetadata itself as deleted, which keeps the behavior in line 
with the other payloads.
   
   But still, in `HoodieMetadataPayload#preCombine`, the merging of RLI infos 
should be fixed with the `isDeleted` falg just like others.
   
   If we move to this direction, the changes in `HoodieBackedTableMetadata` 
should be unnecessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-28 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1246016087


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option 
recordOpt) {
 
Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()),
 
Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString()));
   }
+} else {
+  this.isDeletedRecord = true;

Review Comment:
   @prashantwason I think @xushiyan and @nsivabalan are addressing that many 
metadata POJOs already take a `isDeleted` flag except the recordIndexMetadata, 
during the convertion from write status to RLI metadata payload in 
`HoodieMetadataPayload#createRecordIndexUpdate`, there is a choice is to mark 
the recordIndexMetadata itself as deleted, which keeps the behavior in line 
with the other payloads.
   
   But still, in `HoodieMetadataPayload#preCombine`, the merging of RLI infos 
should be fixed with the `isDeleted` falg just like others.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-28 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1245996840


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java:
##
@@ -749,6 +749,67 @@ public void testRecordIndexTagLocationAndUpdate(boolean 
populateMetaFields) thro
 assertEquals(newInsertsCount, recordLocations.filter(entry -> 
newPartitionPath.equalsIgnoreCase(entry._1.getPartitionPath())).count());
   }
 
+  @ParameterizedTest
+  @ValueSource(strings = "INMEMORY")

Review Comment:
   Then just use `@Test` should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-27 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1244567699


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -314,6 +314,12 @@ private boolean writeRecord(HoodieRecord newRecord, 
Option comb
 recordsWritten++;
   } else {
 recordsDeleted++;
+// Clear the new location which is set for all records in init()
+if (needsUpdateLocation()) {
+  newRecord.unseal();
+  newRecord.clearNewLocation();

Review Comment:
   Flink can also impl RLI with deletes, there is no need to add this check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-27 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1243491259


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -314,6 +314,12 @@ private boolean writeRecord(HoodieRecord newRecord, 
Option comb
 recordsWritten++;
   } else {
 recordsDeleted++;
+// Clear the new location which is set for all records in init()
+if (needsUpdateLocation()) {
+  newRecord.unseal();
+  newRecord.clearNewLocation();

Review Comment:
   The check `needsUpdateLocation()` may be not necessary because Flink also 
needs it. And instead, I think we should switch to another check:
   
   ```java
   if (RLI is enabled ...) {
 xxx
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-27 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1243491259


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -314,6 +314,12 @@ private boolean writeRecord(HoodieRecord newRecord, 
Option comb
 recordsWritten++;
   } else {
 recordsDeleted++;
+// Clear the new location which is set for all records in init()
+if (needsUpdateLocation()) {
+  newRecord.unseal();
+  newRecord.clearNewLocation();

Review Comment:
   The check `needsUpdateLocation()` may be not necessary because Flink also 
needs it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-27 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1243490522


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java:
##
@@ -749,6 +749,67 @@ public void testRecordIndexTagLocationAndUpdate(boolean 
populateMetaFields) thro
 assertEquals(newInsertsCount, recordLocations.filter(entry -> 
newPartitionPath.equalsIgnoreCase(entry._1.getPartitionPath())).count());
   }
 
+  @ParameterizedTest
+  @ValueSource(strings = "INMEMORY")

Review Comment:
   Why parameterizing the test with only one value?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.

2023-06-27 Thread via GitHub


danny0405 commented on code in PR #9058:
URL: https://github.com/apache/hudi/pull/9058#discussion_r1243487643


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java:
##
@@ -3002,6 +3004,99 @@ public void testOutOfOrderCommits() throws Exception {
 validateMetadata(client);
   }
 
+  @Test
+  public void testDeleteWithRecordIndex() throws Exception {
+init(HoodieTableType.COPY_ON_WRITE, true);
+HoodieSparkEngineContext engineContext = new HoodieSparkEngineContext(jsc);
+HoodieWriteConfig writeConfig = getWriteConfigBuilder(true, true, false)
+
.withMetadataConfig(HoodieMetadataConfig.newBuilder().withEnableRecordIndex(true).withMaxNumDeltaCommitsBeforeCompaction(1).build())
+// In memory index is required for the writestatus to track the 
written records
+
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.INMEMORY).build())

Review Comment:
   These 2 options to enable the RLI looks weird, for Spark engine the default 
index type is `SIMPLE`, that means in order to enable RLI, people always needs 
to configure these two, shouldn't RLI be a new index type then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org