Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-06-23 Thread via GitHub


skyshineb commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-2184920196

   hi @codope! I planned to test this MDT implementation and the Lucene(one 
which I took from previous SI attempt and finished myself). And figure out is 
it profitable to use Lucene or not. But why this PR got closed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-06-11 Thread via GitHub


codope closed pull request #10625:  [HUDI-7384] Secondary index support
URL: https://github.com/apache/hudi/pull/10625


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-06-11 Thread via GitHub


codope commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-2161277447

   Hi @skyshineb , we do plan to add more index types. If you are interested in 
contributing to lucene based secondary index, I can help you to get started 
with multi-modal indexing framework. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-06-11 Thread via GitHub


bhat-vinay commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-2160742331

   > Hi @bhat-vinay! Is this design of secondary index through MDT is the only 
one to be implemented or there plans to make some other Index Types? As I 
remember there was RFC for Lucene Index and maybe some other types in future?
   
   Please get in touch with @codope for the latest update on this. AFAIK, 
lucene based secondary index is not planned at this time and MDT based 
secondary index is the one being developed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-06-08 Thread via GitHub


skyshineb commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-2155988087

   Hi @bhat-vinay! Is this design of secondary index through MDT is the only 
one to be implemented or there plans to make some other Index Types? As I 
remember there was RFC for Lucene Index and maybe some other types in future?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-03-05 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1978614908

   
   ## CI report:
   
   * 32d4469a9f0f7aaa87b510a936a5b74c3d734711 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22793)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-03-05 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1978427717

   
   ## CI report:
   
   * 890599aab5c4658b4d0013539184727e073bba42 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22660)
 
   * 32d4469a9f0f7aaa87b510a936a5b74c3d734711 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22793)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-03-05 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1978413454

   
   ## CI report:
   
   * 890599aab5c4658b4d0013539184727e073bba42 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22660)
 
   * 32d4469a9f0f7aaa87b510a936a5b74c3d734711 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-28 Thread via GitHub


bhat-vinay commented on code in PR #10625:
URL: https://github.com/apache/hudi/pull/10625#discussion_r1505573458


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -546,6 +562,39 @@ private HoodieFunctionalIndexDefinition 
getFunctionalIndexDefinition(String inde
 }
   }
 
+  private Pair> 
initializeSecondaryIndexPartition() throws IOException {
+HoodieMetadataFileSystemView fsView = new 
HoodieMetadataFileSystemView(dataMetaClient, 
dataMetaClient.getActiveTimeline(), metadata);
+// Collect the list of latest file slices present in each partition
+List partitions = metadata.getAllPartitionPaths();
+fsView.loadAllPartitions();
+
+List> partitionFileSlicePairs = new ArrayList<>();
+partitions.forEach(partition -> {

Review Comment:
   This is actually borrowed from how the existing indexes/partitions are built 
in metadata table (like RLI partion in `initializeRecordIndexPartition()`)



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -82,6 +88,9 @@ public class HoodieMergedLogRecordScanner extends 
AbstractHoodieLogRecordReader
   public final HoodieTimer timer = HoodieTimer.create();
   // Map of compacted/merged records
   private final ExternalSpillableMap records;
+
+  private final ExternalSpillableMap> 
nonUniqueKeyRecords;

Review Comment:
   Guava and Chronicle supports persistent multi-map structures (just from the 
documentation of the library)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-28 Thread via GitHub


bhat-vinay commented on code in PR #10625:
URL: https://github.com/apache/hudi/pull/10625#discussion_r1505537600


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -111,6 +123,16 @@ private HoodieMergedLogRecordScanner(FileSystem fs, String 
basePath, List(maxMemorySizeInBytes, 
spillableMapBasePath, new DefaultSizeEstimator(),
   new HoodieRecordSizeEstimator(readerSchema), diskMapType, 
isBitCaskDiskMapCompressionEnabled);
+  this.nonUniqueKeyRecords = new 
ExternalSpillableMap<>(maxMemorySizeInBytes, spillableMapBasePath, new 
DefaultSizeEstimator(),
+  new HoodieRecordSizeEstimator(readerSchema), diskMapType, 
isBitCaskDiskMapCompressionEnabled);
+
+  if (logFilePaths.size() > 0 && 
HoodieTableMetadata.isMetadataTableSecondaryIndexPartition(basePath, 
partitionName)) {

Review Comment:
   Can this all be hidden inside a method (still in this layer) - there needs 
to be some way to determine if the logs can have non-unique-keys. Initial 
implementation  had it one layer above (i.e the callers instantiating 
`HoodieMergedLogRecordScanner` passing in the flag), but having it here looked 
cleaner.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-28 Thread via GitHub


bhat-vinay commented on code in PR #10625:
URL: https://github.com/apache/hudi/pull/10625#discussion_r1505530807


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -82,6 +88,9 @@ public class HoodieMergedLogRecordScanner extends 
AbstractHoodieLogRecordReader
   public final HoodieTimer timer = HoodieTimer.create();
   // Map of compacted/merged records
   private final ExternalSpillableMap records;
+
+  private final ExternalSpillableMap> 
nonUniqueKeyRecords;

Review Comment:
   `ExternalSpillableMap` needs to support a disk based map (or need a disk 
backed multi-map that you pointed out). have not really looked into 
`ExternalSpillableMap` implementation yet.



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java:
##
@@ -63,15 +61,21 @@ public HoodieFileSliceReader(Option 
baseFileReader,
 }
 this.props = props;
 this.simpleKeyGenFieldsOpt = simpleKeyGenFieldsOpt;
-this.records = scanner.getRecords();
   }
 
   private boolean hasNextInternal() {
 while (baseFileIterator.isPresent() && baseFileIterator.get().hasNext()) {
   try {
 HoodieRecord currentRecord = 
baseFileIterator.get().next().wrapIntoHoodieRecordPayloadWithParams(schema, 
props,
 simpleKeyGenFieldsOpt, scanner.isWithOperationField(), 
scanner.getPartitionNameOverride(), false, Option.empty());
-Option logRecord = 
removeLogRecord(currentRecord.getRecordKey());
+
+if (!scanner.hasKey(currentRecord.getRecordKey())) {

Review Comment:
   Will remove. Leftover code from a previous iteration (which is subsumed by 
the changes below)



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1005,6 +1056,73 @@ private HoodieData 
getFunctionalIndexUpdates(HoodieCommitMetadata
 return getFunctionalIndexRecords(partitionFileSlicePairs, indexDefinition, 
dataMetaClient, parallelism, readerSchema, hadoopConf);
   }
 
+  private void updateSecondaryIndexIfPresent(HoodieCommitMetadata 
commitMetadata, Map> 
partitionToRecordMap, HoodieData writeStatus) {
+dataMetaClient.getTableConfig().getMetadataPartitions()
+.stream()
+.filter(partition -> 
partition.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX))
+.forEach(partition -> {
+  HoodieData secondaryIndexRecords;
+  try {
+secondaryIndexRecords = getSecondaryIndexUpdates(commitMetadata, 
writeStatus);
+  } catch (Exception e) {
+throw new HoodieMetadataException("Failed to get secondary index 
updates for partition " + partition, e);
+  }
+  partitionToRecordMap.put(SECONDARY_INDEX, secondaryIndexRecords);
+});
+  }
+
+  private HoodieData 
getSecondaryIndexUpdates(HoodieCommitMetadata commitMetadata, 
HoodieData writeStatus) throws Exception {
+// Build a list of basefiles+delta-log-files for every partition that this 
commit touches
+// {
+//   {
+// "partition1", { {"baseFile11", {"logFile11", "logFile12"}}, 
{"baseFile12", {"logFile11"} } },
+//   },
+//   {
+// "partition2", { {"baseFile21", {"logFile21", "logFile22"}}, 
{"baseFile22", {"logFile21"} } }
+//   }
+// }
+List>>> partitionFilePairs = new 
ArrayList<>();
+commitMetadata.getPartitionToWriteStats().forEach((dataPartition, 
writeStats) -> {
+  writeStats.forEach(writeStat -> {
+if (writeStat instanceof HoodieDeltaWriteStat) {
+  partitionFilePairs.add(Pair.of(dataPartition, 
Pair.of(((HoodieDeltaWriteStat) writeStat).getBaseFile(), 
((HoodieDeltaWriteStat) writeStat).getLogFiles(;
+} else {
+  partitionFilePairs.add(Pair.of(dataPartition, 
Pair.of(writeStat.getPath(), new ArrayList<>(;
+}
+  });
+});
+
+// Build a list of keys that need to be removed. A 'delete' record will be 
emitted into the respective FileGroup of
+// the secondary index partition for each of these keys. For a commit 
which is deleting/updating a lot of records, this
+// operation is going to be expensive (in CPU, memory and IO)
+List keysToRemove = new ArrayList<>();
+writeStatus.collectAsList().forEach(status -> {

Review Comment:
   Yes, the logic here is probably going to change if one uses the 
`WriteStatus` to hold the (old-secondary-key, new-secondary-key) pair. Hence 
did not think of optimising here yet.



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -111,6 +123,16 @@ private HoodieMergedLogRecordScanner(FileSystem fs, String 
basePath, List(maxMemorySizeInBytes, 
spillableMapBasePath, new DefaultSizeEstimator(),
   new HoodieRecordSizeEstimator(readerSchema), diskMapType, 
isBitCaskDiskMapCompressionEnabled);
+  this.nonUniqueKeyRecords = new 
ExternalSpill

Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-27 Thread via GitHub


vinothchandar commented on code in PR #10625:
URL: https://github.com/apache/hudi/pull/10625#discussion_r1504979741


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -223,6 +228,11 @@ private void enablePartitions() {
 }
 if (dataWriteConfig.isRecordIndexEnabled() || 
dataMetaClient.getTableConfig().isMetadataPartitionAvailable(RECORD_INDEX)) {
   this.enabledPartitionTypes.add(RECORD_INDEX);
+
+  // Enable secondary index only iff record index is enabled
+  if (dataWriteConfig.isSecondaryIndexEnabled() || 
dataMetaClient.getTableConfig().isMetadataPartitionAvailable(SECONDARY_INDEX)) {
+this.enabledPartitionTypes.add(SECONDARY_INDEX);

Review Comment:
   nts: rename `SECONDARY_INDEX` ? technically the record index itself is an 
unique secondary index. This is just a non-unique secondary index.  



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -546,6 +562,39 @@ private HoodieFunctionalIndexDefinition 
getFunctionalIndexDefinition(String inde
 }
   }
 
+  private Pair> 
initializeSecondaryIndexPartition() throws IOException {
+HoodieMetadataFileSystemView fsView = new 
HoodieMetadataFileSystemView(dataMetaClient, 
dataMetaClient.getActiveTimeline(), metadata);
+// Collect the list of latest file slices present in each partition
+List partitions = metadata.getAllPartitionPaths();
+fsView.loadAllPartitions();
+
+List> partitionFileSlicePairs = new ArrayList<>();
+partitions.forEach(partition -> {

Review Comment:
   would n't this take a long time, if done on the driver?



##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##
@@ -131,6 +131,12 @@ public class HoodieTableConfig extends HoodieConfig {
   .withDocumentation("Columns used to uniquely identify the table. 
Concatenated values of these fields are used as "
   + " the record key component of HoodieKey.");
 
+  public static final ConfigProperty SECONDARYKEY_FIELDS = 
ConfigProperty

Review Comment:
   I'd like for this to be absorbed into how we are tracking functional index 
flow and not introduce any specific table configs here.



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -111,6 +123,16 @@ private HoodieMergedLogRecordScanner(FileSystem fs, String 
basePath, List(maxMemorySizeInBytes, 
spillableMapBasePath, new DefaultSizeEstimator(),
   new HoodieRecordSizeEstimator(readerSchema), diskMapType, 
isBitCaskDiskMapCompressionEnabled);
+  this.nonUniqueKeyRecords = new 
ExternalSpillableMap<>(maxMemorySizeInBytes, spillableMapBasePath, new 
DefaultSizeEstimator(),
+  new HoodieRecordSizeEstimator(readerSchema), diskMapType, 
isBitCaskDiskMapCompressionEnabled);
+
+  if (logFilePaths.size() > 0 && 
HoodieTableMetadata.isMetadataTableSecondaryIndexPartition(basePath, 
partitionName)) {

Review Comment:
   this layer cannot be aware of metadata table. 



##
hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java:
##
@@ -54,6 +54,13 @@ public class KeyGeneratorOptions extends HoodieConfig {
   + "Actual value will be obtained by invoking .toString() on the 
field value. Nested fields can be specified using\n"
   + "the dot notation eg: `a.b.c`");
 
+  public static final ConfigProperty SECONDARYKEY_FIELD_NAME = 
ConfigProperty

Review Comment:
   +1



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -2493,6 +2493,10 @@ public boolean isLogCompactionEnabledOnMetadata() {
 return 
getBoolean(HoodieMetadataConfig.ENABLE_LOG_COMPACTION_ON_METADATA_TABLE);
   }
 
+  public boolean isSecondaryIndexEnabled() {

Review Comment:
   nts: revisit if this is needed or we need to consolidate using the record 
level index flag



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -546,6 +562,39 @@ private HoodieFunctionalIndexDefinition 
getFunctionalIndexDefinition(String inde
 }
   }
 
+  private Pair> 
initializeSecondaryIndexPartition() throws IOException {
+HoodieMetadataFileSystemView fsView = new 
HoodieMetadataFileSystemView(dataMetaClient, 
dataMetaClient.getActiveTimeline(), metadata);
+// Collect the list of latest file slices present in each partition
+List partitions = metadata.getAllPartitionPaths();
+fsView.loadAllPartitions();
+
+List> partitionFileSlicePairs = new ArrayList<>();
+partitions.forEach(partition -> {
+  fsView.getLatestFileSlices(partition).forEach(fs -> {
+partitionFileSlicePairs.add(Pair.of(partition, fs));
+  });
+});
+
+// Reuse record index parallelism config to build secondary index
+   

Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-27 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1966044087

   
   ## CI report:
   
   * 890599aab5c4658b4d0013539184727e073bba42 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22660)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-26 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1965899879

   
   ## CI report:
   
   * a65a927d36374dbd1cbdc818cfdfa7b3cac2f94c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22566)
 
   * 890599aab5c4658b4d0013539184727e073bba42 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22660)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-26 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1965891900

   
   ## CI report:
   
   * a65a927d36374dbd1cbdc818cfdfa7b3cac2f94c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22566)
 
   * 890599aab5c4658b4d0013539184727e073bba42 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-22 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1960425736

   
   ## CI report:
   
   * a65a927d36374dbd1cbdc818cfdfa7b3cac2f94c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22566)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-22 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1960262454

   
   ## CI report:
   
   * 2656b3eb1564f312a56756679514a670f581daa9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22557)
 
   * a65a927d36374dbd1cbdc818cfdfa7b3cac2f94c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22566)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-22 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1960244329

   
   ## CI report:
   
   * 2656b3eb1564f312a56756679514a670f581daa9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22557)
 
   * a65a927d36374dbd1cbdc818cfdfa7b3cac2f94c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-22 Thread via GitHub


bhat-vinay commented on code in PR #10625:
URL: https://github.com/apache/hudi/pull/10625#discussion_r1499861590


##
hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java:
##
@@ -54,6 +54,13 @@ public class KeyGeneratorOptions extends HoodieConfig {
   + "Actual value will be obtained by invoking .toString() on the 
field value. Nested fields can be specified using\n"
   + "the dot notation eg: `a.b.c`");
 
+  public static final ConfigProperty SECONDARYKEY_FIELD_NAME = 
ConfigProperty

Review Comment:
   One of the current limitations - only supports one secondary index (per 
table). Will remove  the limitation once the functionality is working end-end. 
Thinking of using the same approach as functional index (different partition 
for different secondary index based on a config file/json)



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -360,6 +415,112 @@ private Map> 
lookupKeysFromFileSlice
 }
   }
 
+  /**
+   * Lookup list of keys from a single file slice.
+   *
+   * @param partitionName Name of the partition
+   * @param secondaryKeys The list of secondary keys to lookup
+   * @param fileSlice The file slice to read
+   * @return A {@code Map} of secondary-key to list of {@code HoodieRecord} 
for the secondary-keys which were found in the file slice
+   */
+  private Map>> 
lookupSecondaryKeysFromFileSlice(String partitionName, List 
secondaryKeys, FileSlice fileSlice) {
+Map> logRecordsMap = new HashMap<>();
+
+Pair, HoodieMetadataLogRecordReader> readers = 
getOrCreateReaders(partitionName, fileSlice, Option.of(true));
+try {
+  List timings = new ArrayList<>(1);
+  HoodieSeekingFileReader baseFileReader = readers.getKey();
+  HoodieMetadataLogRecordReader logRecordScanner = readers.getRight();
+  if (baseFileReader == null && logRecordScanner == null) {
+return Collections.emptyMap();
+  }
+
+  // Sort it here once so that we don't need to sort individually for base 
file and for each individual log files.
+  Set secondaryKeySet = new HashSet<>(secondaryKeys.size());
+  List sortedSecondaryKeys = new ArrayList<>(secondaryKeys);
+  Collections.sort(sortedSecondaryKeys);
+  secondaryKeySet.addAll(sortedSecondaryKeys);
+
+  // TODO: Look at using scanByFullKeys() which pushes down the 
'filtering' keys and
+  //  only buffers matching records
+  logRecordScanner.scan();
+
+  Map> nonUniqueRecordsMap = 
logRecordScanner.getNonUniqueRecordsMap();
+  nonUniqueRecordsMap.forEach((secondaryKey, recordsMap) -> {
+if (secondaryKeySet.contains(secondaryKey) && 
!logRecordsMap.containsKey(secondaryKey)) {
+  logRecordsMap.put(secondaryKey, recordsMap);
+}
+  });
+
+  return readNonUniqueRecordsAndMergeWithLogRecords(baseFileReader, 
sortedSecondaryKeys, logRecordsMap, timings, partitionName);
+} catch (IOException ioe) {
+  throw new HoodieIOException("Error merging records from metadata table 
for  " + secondaryKeys.size() + " key : ", ioe);
+} finally {
+  if (!reuse) {
+closeReader(readers);
+  }
+}
+  }
+
+  private void reverseLookupSecondaryKeys(String partitionName, List 
recordKeys, FileSlice fileSlice, Map recordKeyMap) {

Review Comment:
   Please ignore for now. This may not be needed if `WriteStatus` is changed to 
carry `(record-key, Optional,Optional)` 
as discussed earlier



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -548,6 +787,31 @@ private Map>> fetchBaseFileAllR
 .collect(Collectors.groupingBy(Pair::getKey, 
Collectors.mapping(Pair::getValue, Collectors.toList(;
   }
 
+  private Map> 
fetchBaseFileAllRecordsByPayload(HoodieSeekingFileReader reader,

Review Comment:
   Ignore for now. This is again related to the reverse-scanning to aid the 
writer (to emit tombstone records) - will be change based on the design choice.



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieUnMergedLogRecordScanner.java:
##
@@ -33,6 +33,7 @@
 import org.apache.hadoop.fs.FileSystem;
 

Review Comment:
   Please ignore the changes in this file. Will revert in the next upload. 



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -240,6 +255,11 @@ public static HoodieMergedLogRecordScanner.Builder 
newBuilder() {
 
   @Override
   public  void processNextRecord(HoodieRecord newRecord) throws 
IOException {
+if (logContainsNonUniqueKeys) {

Review Comment:
   If the log files are for partitions that can have non-unique keys, then this 
logic makes use of the new map to buffer the scanned records.



##
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java:
##
@@ -305,6 +305,52 @@ public Map> 
readRecordIndex(List
+   * If 

Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-22 Thread via GitHub


bhat-vinay commented on code in PR #10625:
URL: https://github.com/apache/hudi/pull/10625#discussion_r1499840631


##
hudi-common/src/main/avro/HoodieMetadata.avsc:
##
@@ -432,6 +432,34 @@
 }
 ],
 "default" : null
+},
+{
+"name": "SecondaryIndexMetadata",
+"doc": "Metadata Index that contains information about secondary 
keys and the corresponding record keys in the dataset",
+"type": [
+"null",
+{
+"type": "record",
+"name": "HoodieSecondaryIndexInfo",
+"fields": [
+{
+"name": "recordKey",
+"type": [
+"null",
+"string"
+],
+"default": null,
+"doc": "Refers to the record key that this 
secondary key maps to"
+},
+{
+"name": "isDeleted",

Review Comment:
   This field acts as. tombstone marker



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384] Secondary index support [hudi]

2024-02-22 Thread via GitHub


bhat-vinay commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1960130686

   Moved away from using `HoodieUnMergedLogRecordScanner`.  Added new buffer in 
`HoodieMergedLogRecordScanner` (based on SpillableDiskMap) to handle non-unique 
keys (secondary keys)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org