Re: [PR] fix(mdt): Propagate bloom filter config to HFile reader correctly [hudi]

via GitHub Mon, 08 Jun 2026 02:10:45 -0700


hudi-agent commented on code in PR #18936:
URL: https://github.com/apache/hudi/pull/18936#discussion_r3371974516



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java:
##########
@@ -733,12 +732,7 @@ public static void checkKeygenGenerator(boolean 
isComplexHoodieKey, Configuratio
    * @return HoodieMetadataConfig constructed from flink configuration.
    */
   public static HoodieMetadataConfig 
metadataConfig(org.apache.flink.configuration.Configuration conf) {
-    Properties properties = new Properties();
-
-    // set up metadata.enabled=true in table DDL to enable metadata listing
-    properties.put(HoodieMetadataConfig.ENABLE.key(), 
conf.get(FlinkOptions.METADATA_ENABLED));
-
-    return 
HoodieMetadataConfig.newBuilder().fromProperties(properties).build();
+    return FlinkWriteClients.getHoodieClientConfig(conf).getMetadataConfig();

Review Comment:
   🤖 This is a fairly heavy replacement for what used to be a 4-line method — 
`FlinkWriteClients.getHoodieClientConfig(conf)` builds the entire write config 
(clustering, cleaning, archival, compaction, schema, indexes, merger classes, 
locks, etc.) just to extract the metadata sub-config. Two concerns: (1) 
`getHoodieClientConfig` calls `getSourceSchema(conf)` which throws if neither 
`SOURCE_AVRO_SCHEMA_PATH` nor `SOURCE_AVRO_SCHEMA` is set — the old impl had no 
schema dependency, so this is a new precondition for all read-path callers 
(FileIndex, FileStatsIndex, RecordLevelIndex, etc.); (2) the per-call cost is 
substantial for callers in hot read paths. Would something like 
`HoodieMetadataConfig.newBuilder().fromProperties(flinkConf2TypedProperties(conf)).enable(conf.get(FlinkOptions.METADATA_ENABLED)).build()`
 get you the bloom filter propagation you need without the extra dependencies?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java:
##########
@@ -333,8 +333,8 @@ public ClosableIterator<IndexedRecord> 
getIndexedRecordsByKeyPrefixIterator(List
 
   private static BloomFilter readBloomFilter(HFileReader reader) throws 
HoodieException {
     try {
-      ByteBuffer byteBuffer = 
reader.getMetaBlock(KEY_BLOOM_FILTER_META_BLOCK).get();
-      return BloomFilterFactory.fromByteBuffer(byteBuffer,
+      ByteBuffer byteBuffer = 
reader.getMetaBlock(KEY_BLOOM_FILTER_META_BLOCK).orElse(null);
+      return byteBuffer == null ? null : 
BloomFilterFactory.fromByteBuffer(byteBuffer,

Review Comment:
   🤖 With `readBloomFilter()` now returning null (instead of throwing) when the 
meta block is absent, callers that didn't previously expect null may NPE. For 
example, `HoodieKeyLookupHandle.getBloomFilter()` returns the null directly, 
and `HoodieKeyLookupHandle.addKey()` calls 
`bloomFilter.mightContain(recordKey)` without a null check. This would only 
trigger if someone uses HFile as a base file format with the bloom index AND 
sets `hoodie.hfile.bloom.filter.enabled=false`, which is niche — but worth 
confirming this path is intentionally unhandled, or adding a null guard.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##########
@@ -57,7 +57,7 @@
  * base file format.

Review Comment:
   🤖 nit: `paramsMap` is pretty generic for a class field — the old 
`compressionCodec` was specific about what it held. Something like 
`hfileWriterParams` would make it clear at a glance that these are write-time 
HFile configuration values, especially now that the field has grown from a 
single codec option into a full config map.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-hadoop-common/src/test/java/org/apache/hudi/io/hadoop/TestHoodieAvroFileReaderFactory.java:
##########
@@ -70,4 +84,56 @@ public void testGetFileReader() throws IOException {
         .getFileReader(DEFAULT_HUDI_CONFIG_FOR_READER, orcPath);
     assertTrue(orcReader instanceof HoodieAvroOrcReader);
   }
+
+  @ParameterizedTest
+  @ValueSource(booleans = {true, false})
+  public void testHFileReaderPassesBloomFilterConfig(boolean 
bloomFilterEnabled)
+      throws IOException, ReflectiveOperationException {
+    HoodieStorage storage = HoodieTestUtils.getDefaultStorage();
+    HoodieFileReaderFactory readerFactory = 
HoodieIOFactory.getIOFactory(storage)
+        .getReaderFactory(HoodieRecordType.AVRO);
+    HoodieMetadataConfig metadataConfig = HoodieMetadataConfig.newBuilder()
+        .withProperties(DEFAULT_HUDI_CONFIG_FOR_READER.getProps())
+        .withProperties(bloomFilterProps(bloomFilterEnabled))
+        .build();
+
+    assertBloomFilterConfig(readerFactory.getFileReader(metadataConfig, 
HFILE_PATH),
+        bloomFilterEnabled);
+    assertBloomFilterConfig(readerFactory.getFileReader(metadataConfig,
+        new StoragePathInfo(HFILE_PATH, 100, false, (short) 0, 0, 0), 
HoodieFileFormat.HFILE,
+        Option.<HoodieSchema>empty()), bloomFilterEnabled);
+    assertBloomFilterConfig(readerFactory.getContentReader(metadataConfig, 
HFILE_PATH, HoodieFileFormat.HFILE,
+        storage, new byte[0], Option.<HoodieSchema>empty()),
+        bloomFilterEnabled);
+  }
+
+  @Test
+  public void testHFileReaderDoesNotUseBloomFilterByDefault()
+      throws IOException, ReflectiveOperationException {
+    HoodieStorage storage = HoodieTestUtils.getDefaultStorage();
+    HoodieFileReaderFactory readerFactory = 
HoodieIOFactory.getIOFactory(storage)
+        .getReaderFactory(HoodieRecordType.AVRO);
+
+    
assertBloomFilterConfig(readerFactory.getFileReader(DEFAULT_HUDI_CONFIG_FOR_READER,
 HFILE_PATH), false);
+    
assertBloomFilterConfig(readerFactory.getFileReader(DEFAULT_HUDI_CONFIG_FOR_READER,
+        new StoragePathInfo(HFILE_PATH, 100, false, (short) 0, 0, 0), 
HoodieFileFormat.HFILE,
+        Option.<HoodieSchema>empty()), false);
+    
assertBloomFilterConfig(readerFactory.getContentReader(DEFAULT_HUDI_CONFIG_FOR_READER,
 HFILE_PATH,
+        HoodieFileFormat.HFILE, storage, new byte[0], 
Option.<HoodieSchema>empty()),
+        false);
+  }
+
+  private static TypedProperties bloomFilterProps(boolean enableBloomFilter) {
+    TypedProperties properties = new TypedProperties();
+    properties.setProperty(HoodieMetadataConfig.BLOOM_FILTER_ENABLE.key(), 
Boolean.toString(enableBloomFilter));
+    return properties;
+  }
+
+  private static void assertBloomFilterConfig(HoodieFileReader reader, boolean 
expectedBloomFilterEnabled)
+      throws ReflectiveOperationException {
+    HoodieNativeAvroHFileReader hfileReader = 
assertInstanceOf(HoodieNativeAvroHFileReader.class, reader);

Review Comment:
   🤖 nit: accessing `useBloomFilter` by string name via reflection means if the 
field is ever renamed the test will throw `NoSuchFieldException` at runtime 
rather than fail at compile time. Could you instead verify observable behaviour 
— e.g. write a small HFile without bloom filter metadata, open it with the 
reader, and assert that `readBloomFilter()` returns null — so the contract 
rather than the implementation detail is what's tested?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##########


Review Comment:
   🤖 **Line 180:** `lookupRecords` here still uses 
`ConfigUtils.DEFAULT_HUDI_CONFIG_FOR_READER`, while `lookupEngineRecords` below 
was updated to use `readerContext.getHoodieReaderConfig()` — so point lookups 
via the `HoodieRecord` path won't pick up `BLOOM_FILTER_ENABLE` from the reader 
context and won't benefit from the new bloom filter optimization. Is the intent 
to only optimize the engine-record path (since that's the file-group-reader 
path used by MDT lookups), or should this be updated too for consistency?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(mdt): Propagate bloom filter config to HFile reader correctly [hudi]

Reply via email to