vinothchandar commented on code in PR #9581:
URL: https://github.com/apache/hudi/pull/9581#discussion_r1323803839


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##########
@@ -173,18 +173,18 @@ public static <R> HoodieRecord<R> 
tagRecord(HoodieRecord<R> record, HoodieRecord
    *
    * @param filePath            - File to filter keys from
    * @param candidateRecordKeys - Candidate keys to filter
-   * @return List of candidate keys that are available in the file
+   * @return List of pairs of candidate keys and positions that are available 
in the file
    */
-  public static List<String> filterKeysFromFile(Path filePath, List<String> 
candidateRecordKeys,
-                                                Configuration configuration) 
throws HoodieIndexException {
+  public static List<Pair<String, Long>> filterKeysFromFile(Path filePath, 
List<String> candidateRecordKeys,
+                                                            Configuration 
configuration) throws HoodieIndexException {
     ValidationUtils.checkArgument(FSUtils.isBaseFile(filePath));
-    List<String> foundRecordKeys = new ArrayList<>();
+    List<Pair<String, Long>> foundRecordKeys = new ArrayList<>();
     try (HoodieFileReader fileReader = 
HoodieFileReaderFactory.getReaderFactory(HoodieRecordType.AVRO)
         .getFileReader(configuration, filePath)) {
       // Load all rowKeys from the file, to double-confirm
       if (!candidateRecordKeys.isEmpty()) {
         HoodieTimer timer = HoodieTimer.start();
-        Set<String> fileRowKeys = fileReader.filterRowKeys(new 
TreeSet<>(candidateRecordKeys));
+        Set<Pair<String, Long>> fileRowKeys = fileReader.filterRowKeys(new 
TreeSet<>(candidateRecordKeys));

Review Comment:
   Use `TreeSet` here as well. does the ordering inside `fileRowKeys` matter? 
if so, then better use `SortedSet` here. or a list?



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieStorageConfig.java:
##########
@@ -106,6 +106,15 @@ public class HoodieStorageConfig extends HoodieConfig {
           + "to be appended to a log file. This helps to make sure the data 
appended to the log file is broken up "
           + "into sizable blocks to prevent from OOM errors. This size should 
be greater than the JVM memory.");
 
+  public static final ConfigProperty<Boolean> LOGFILE_WRITE_RECORD_POSITIONS = 
ConfigProperty

Review Comment:
   this is not a storage config? i.e its about a higher layer functionality 
around updates. Move this to where the core write configs are? and call it 
`WRITE_RECORD_POSITIONS`



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -89,10 +90,11 @@ public class HoodieAppendHandle<T, I, K, O> extends 
HoodieWriteHandle<T, I, K, O
   private static final int NUMBER_OF_RECORDS_TO_ESTIMATE_RECORD_SIZE = 100;
 
   protected final String fileId;
+  private final boolean writeRecordPositions;

Review Comment:
   I like this better than even LOG_RECORD_POSITIONS as the config name/



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -89,10 +90,11 @@ public class HoodieAppendHandle<T, I, K, O> extends 
HoodieWriteHandle<T, I, K, O
   private static final int NUMBER_OF_RECORDS_TO_ESTIMATE_RECORD_SIZE = 100;
 
   protected final String fileId;
+  private final boolean writeRecordPositions;
   // Buffer for holding records in memory before they are flushed to disk
   private final List<HoodieRecord> recordList = new ArrayList<>();
   // Buffer for holding records (to be deleted) in memory before they are 
flushed to disk
-  private final List<DeleteRecord> recordsToDelete = new ArrayList<>();
+  private final List<Pair<DeleteRecord, Long>> recordsToDelete = new 
ArrayList<>();

Review Comment:
   rename variable.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -461,11 +464,13 @@ protected void 
appendDataAndDeleteBlocks(Map<HeaderMetadataType, String> header,
             ? HoodieRecord.RECORD_KEY_METADATA_FIELD
             : 
hoodieTable.getMetaClient().getTableConfig().getRecordKeyFieldProp();
 
-        blocks.add(getBlock(config, pickLogDataBlockFormat(), recordList, 
getUpdatedHeader(header, blockSequenceNumber++, 
taskContextSupplier.getAttemptIdSupplier().get()), keyField));
+        blocks.add(getBlock(config, pickLogDataBlockFormat(), recordList, 
writeRecordPositions,

Review Comment:
   name booleans with either a `should` or `is` . e,g 
`shouldWriteRecordPositions`



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java:
##########
@@ -51,26 +51,28 @@ public HoodieKeyLocationFetchHandle(HoodieWriteConfig 
config, HoodieTable<T, I,
     this.keyGeneratorOpt = keyGeneratorOpt;
   }
 
-  private List<HoodieKey> fetchHoodieKeys(HoodieBaseFile baseFile) {
+  private List<Pair<HoodieKey, Long>> 
fetchHoodieKeysAndPositions(HoodieBaseFile baseFile) {

Review Comment:
   while you are at it. rename: `fetchRecordKeysWithPositions`



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -2029,6 +2029,10 @@ public long getLogFileDataBlockMaxSize() {
     return getLong(HoodieStorageConfig.LOGFILE_DATA_BLOCK_MAX_SIZE);
   }
 
+  public boolean shouldLogFileWriteRecordPositions() {
+    return getBoolean(HoodieStorageConfig.LOGFILE_WRITE_RECORD_POSITIONS);

Review Comment:
   This is more of a core write config, than it is a log file config. Can we 
uplevel to "LOG_RECORD_POSITIONS" i.e should we log record positions for 
updates/deletes or not.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDataBlock.java:
##########
@@ -70,14 +75,31 @@ public abstract class HoodieDataBlock extends 
HoodieLogBlock {
    * NOTE: This ctor is used on the write-path (ie when records ought to be 
written into the log)
    */
   public HoodieDataBlock(List<HoodieRecord> records,
+                         boolean writeRecordPositions,
                          Map<HeaderMetadataType, String> header,
                          Map<HeaderMetadataType, String> footer,
                          String keyFieldName) {
     super(header, footer, Option.empty(), Option.empty(), null, false);
+    if (writeRecordPositions) {
+      records.sort((o1, o2) -> {

Review Comment:
   +1 valid scenarios to think through. lets also consider clustering and other 
scenarios where the record position is accurate.?



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDeleteBlock.java:
##########
@@ -65,17 +69,44 @@ public class HoodieDeleteBlock extends HoodieLogBlock {
   private static final Lazy<HoodieDeleteRecord.Builder> 
HOODIE_DELETE_RECORD_BUILDER_STUB =
       Lazy.lazily(HoodieDeleteRecord::newBuilder);
 
+  private final boolean writeRecordPositions;
+  // Records to delete, sorted based on the record position if writing record 
position to the log block header
   private DeleteRecord[] recordsToDelete;
 
-  public HoodieDeleteBlock(DeleteRecord[] recordsToDelete, 
Map<HeaderMetadataType, String> header) {
-    this(Option.empty(), null, false, Option.empty(), header, new HashMap<>());
-    this.recordsToDelete = recordsToDelete;
+  public HoodieDeleteBlock(List<Pair<DeleteRecord, Long>> recordsToDelete,
+                           boolean writeRecordPositions,
+                           Map<HeaderMetadataType, String> header) {
+    this(Option.empty(), null, false, Option.empty(), header, new HashMap<>(), 
writeRecordPositions);
+    if (writeRecordPositions) {
+      recordsToDelete.sort((o1, o2) -> {
+        long v1 = o1.getRight();
+        long v2 = o2.getRight();
+        return Long.compare(v1, v2);
+      });
+      if (recordsToDelete.get(0).getRight() > -1L) {

Review Comment:
   if you can to check if its not invalid, can we use the constant you define 
before and call a helper like HoodieRecordLocation.isValidPosition() or sth?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to