[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1247377008 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option recordOpt) { Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()), Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString())); } +} else { + this.isDeletedRecord = true; Review Comment: The key is not set when the avro data is not set, so I'm wondering whether the `key == null` actually represents it is a delete record. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1246019736 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option recordOpt) { Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()), Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString())); } +} else { + this.isDeletedRecord = true; Review Comment: BTW, I don't like how the deletes are handling with existing column stats, even though we already do that does not mean it is correct, maintaining preCombining logic for every meta info is disgusting and inefficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1246019736 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option recordOpt) { Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()), Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString())); } +} else { + this.isDeletedRecord = true; Review Comment: BTW, I don't like how the deletes are handling with existing column stats, even though we already do that does not mean it is correct, maintaining preCombining logic for every meta info is disgusting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1246016087 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option recordOpt) { Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()), Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString())); } +} else { + this.isDeletedRecord = true; Review Comment: @prashantwason I think @xushiyan and @nsivabalan are addressing that many metadata POJOs already take a `isDeleted` flag except the recordIndexMetadata, during the convertion from write status to RLI metadata payload in `HoodieMetadataPayload#createRecordIndexUpdate`, there is a choice is to mark the recordIndexMetadata itself as deleted, which keeps the behavior in line with the other payloads. But still, in `HoodieMetadataPayload#preCombine`, the merging of RLI infos should be fixed with the `isDeleted` flag just like others. If we move to this direction, the changes in `HoodieBackedTableMetadata` should be unnecessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1246016087 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option recordOpt) { Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()), Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString())); } +} else { + this.isDeletedRecord = true; Review Comment: @prashantwason I think @xushiyan and @nsivabalan are addressing that many metadata POJOs already take a `isDeleted` flag except the recordIndexMetadata, during the convertion from write status to RLI metadata payload in `HoodieMetadataPayload#createRecordIndexUpdate`, there is a choice is to mark the recordIndexMetadata itself as deleted, which keeps the behavior in line with the other payloads. But still, in `HoodieMetadataPayload#preCombine`, the merging of RLI infos should be fixed with the `isDeleted` falg just like others. If we move to this direction, the changes in `HoodieBackedTableMetadata` should be unnecessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1246016087 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option recordOpt) { Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()), Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString())); } +} else { + this.isDeletedRecord = true; Review Comment: @prashantwason I think @xushiyan and @nsivabalan are addressing that many metadata POJOs already take a `isDeleted` flag except the recordIndexMetadata, during the convertion from write status to RLI metadata payload in `HoodieMetadataPayload#createRecordIndexUpdate`, there is a choice is to mark the recordIndexMetadata itself as deleted, which keeps the behavior in line with the other payloads. But still, in `HoodieMetadataPayload#preCombine`, the merging of RLI infos should be fixed with the `isDeleted` falg just like others. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1245996840 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java: ## @@ -749,6 +749,67 @@ public void testRecordIndexTagLocationAndUpdate(boolean populateMetaFields) thro assertEquals(newInsertsCount, recordLocations.filter(entry -> newPartitionPath.equalsIgnoreCase(entry._1.getPartitionPath())).count()); } + @ParameterizedTest + @ValueSource(strings = "INMEMORY") Review Comment: Then just use `@Test` should be fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1244567699 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -314,6 +314,12 @@ private boolean writeRecord(HoodieRecord newRecord, Option comb recordsWritten++; } else { recordsDeleted++; +// Clear the new location which is set for all records in init() +if (needsUpdateLocation()) { + newRecord.unseal(); + newRecord.clearNewLocation(); Review Comment: Flink can also impl RLI with deletes, there is no need to add this check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1243491259 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -314,6 +314,12 @@ private boolean writeRecord(HoodieRecord newRecord, Option comb recordsWritten++; } else { recordsDeleted++; +// Clear the new location which is set for all records in init() +if (needsUpdateLocation()) { + newRecord.unseal(); + newRecord.clearNewLocation(); Review Comment: The check `needsUpdateLocation()` may be not necessary because Flink also needs it. And instead, I think we should switch to another check: ```java if (RLI is enabled ...) { xxx } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1243491259 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -314,6 +314,12 @@ private boolean writeRecord(HoodieRecord newRecord, Option comb recordsWritten++; } else { recordsDeleted++; +// Clear the new location which is set for all records in init() +if (needsUpdateLocation()) { + newRecord.unseal(); + newRecord.clearNewLocation(); Review Comment: The check `needsUpdateLocation()` may be not necessary because Flink also needs it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1243490522 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java: ## @@ -749,6 +749,67 @@ public void testRecordIndexTagLocationAndUpdate(boolean populateMetaFields) thro assertEquals(newInsertsCount, recordLocations.filter(entry -> newPartitionPath.equalsIgnoreCase(entry._1.getPartitionPath())).count()); } + @ParameterizedTest + @ValueSource(strings = "INMEMORY") Review Comment: Why parameterizing the test with only one value? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
danny0405 commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1243487643 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java: ## @@ -3002,6 +3004,99 @@ public void testOutOfOrderCommits() throws Exception { validateMetadata(client); } + @Test + public void testDeleteWithRecordIndex() throws Exception { +init(HoodieTableType.COPY_ON_WRITE, true); +HoodieSparkEngineContext engineContext = new HoodieSparkEngineContext(jsc); +HoodieWriteConfig writeConfig = getWriteConfigBuilder(true, true, false) + .withMetadataConfig(HoodieMetadataConfig.newBuilder().withEnableRecordIndex(true).withMaxNumDeltaCommitsBeforeCompaction(1).build()) +// In memory index is required for the writestatus to track the written records + .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.INMEMORY).build()) Review Comment: These 2 options to enable the RLI looks weird, for Spark engine the default index type is `SIMPLE`, that means in order to enable RLI, people always needs to configure these two, shouldn't RLI be a new index type then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org