hudi-agent commented on code in PR #18936:
URL: https://github.com/apache/hudi/pull/18936#discussion_r3371974516
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java:
##########
@@ -733,12 +732,7 @@ public static void checkKeygenGenerator(boolean
isComplexHoodieKey, Configuratio
* @return HoodieMetadataConfig constructed from flink configuration.
*/
public static HoodieMetadataConfig
metadataConfig(org.apache.flink.configuration.Configuration conf) {
- Properties properties = new Properties();
-
- // set up metadata.enabled=true in table DDL to enable metadata listing
- properties.put(HoodieMetadataConfig.ENABLE.key(),
conf.get(FlinkOptions.METADATA_ENABLED));
-
- return
HoodieMetadataConfig.newBuilder().fromProperties(properties).build();
+ return FlinkWriteClients.getHoodieClientConfig(conf).getMetadataConfig();
Review Comment:
🤖 This is a fairly heavy replacement for what used to be a 4-line method —
`FlinkWriteClients.getHoodieClientConfig(conf)` builds the entire write config
(clustering, cleaning, archival, compaction, schema, indexes, merger classes,
locks, etc.) just to extract the metadata sub-config. Two concerns: (1)
`getHoodieClientConfig` calls `getSourceSchema(conf)` which throws if neither
`SOURCE_AVRO_SCHEMA_PATH` nor `SOURCE_AVRO_SCHEMA` is set — the old impl had no
schema dependency, so this is a new precondition for all read-path callers
(FileIndex, FileStatsIndex, RecordLevelIndex, etc.); (2) the per-call cost is
substantial for callers in hot read paths. Would something like
`HoodieMetadataConfig.newBuilder().fromProperties(flinkConf2TypedProperties(conf)).enable(conf.get(FlinkOptions.METADATA_ENABLED)).build()`
get you the bloom filter propagation you need without the extra dependencies?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java:
##########
@@ -333,8 +333,8 @@ public ClosableIterator<IndexedRecord>
getIndexedRecordsByKeyPrefixIterator(List
private static BloomFilter readBloomFilter(HFileReader reader) throws
HoodieException {
try {
- ByteBuffer byteBuffer =
reader.getMetaBlock(KEY_BLOOM_FILTER_META_BLOCK).get();
- return BloomFilterFactory.fromByteBuffer(byteBuffer,
+ ByteBuffer byteBuffer =
reader.getMetaBlock(KEY_BLOOM_FILTER_META_BLOCK).orElse(null);
+ return byteBuffer == null ? null :
BloomFilterFactory.fromByteBuffer(byteBuffer,
Review Comment:
🤖 With `readBloomFilter()` now returning null (instead of throwing) when the
meta block is absent, callers that didn't previously expect null may NPE. For
example, `HoodieKeyLookupHandle.getBloomFilter()` returns the null directly,
and `HoodieKeyLookupHandle.addKey()` calls
`bloomFilter.mightContain(recordKey)` without a null check. This would only
trigger if someone uses HFile as a base file format with the bloom index AND
sets `hoodie.hfile.bloom.filter.enabled=false`, which is niche — but worth
confirming this path is intentionally unhandled, or adding a null guard.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##########
@@ -57,7 +57,7 @@
* base file format.
Review Comment:
🤖 nit: `paramsMap` is pretty generic for a class field — the old
`compressionCodec` was specific about what it held. Something like
`hfileWriterParams` would make it clear at a glance that these are write-time
HFile configuration values, especially now that the field has grown from a
single codec option into a full config map.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-hadoop-common/src/test/java/org/apache/hudi/io/hadoop/TestHoodieAvroFileReaderFactory.java:
##########
@@ -70,4 +84,56 @@ public void testGetFileReader() throws IOException {
.getFileReader(DEFAULT_HUDI_CONFIG_FOR_READER, orcPath);
assertTrue(orcReader instanceof HoodieAvroOrcReader);
}
+
+ @ParameterizedTest
+ @ValueSource(booleans = {true, false})
+ public void testHFileReaderPassesBloomFilterConfig(boolean
bloomFilterEnabled)
+ throws IOException, ReflectiveOperationException {
+ HoodieStorage storage = HoodieTestUtils.getDefaultStorage();
+ HoodieFileReaderFactory readerFactory =
HoodieIOFactory.getIOFactory(storage)
+ .getReaderFactory(HoodieRecordType.AVRO);
+ HoodieMetadataConfig metadataConfig = HoodieMetadataConfig.newBuilder()
+ .withProperties(DEFAULT_HUDI_CONFIG_FOR_READER.getProps())
+ .withProperties(bloomFilterProps(bloomFilterEnabled))
+ .build();
+
+ assertBloomFilterConfig(readerFactory.getFileReader(metadataConfig,
HFILE_PATH),
+ bloomFilterEnabled);
+ assertBloomFilterConfig(readerFactory.getFileReader(metadataConfig,
+ new StoragePathInfo(HFILE_PATH, 100, false, (short) 0, 0, 0),
HoodieFileFormat.HFILE,
+ Option.<HoodieSchema>empty()), bloomFilterEnabled);
+ assertBloomFilterConfig(readerFactory.getContentReader(metadataConfig,
HFILE_PATH, HoodieFileFormat.HFILE,
+ storage, new byte[0], Option.<HoodieSchema>empty()),
+ bloomFilterEnabled);
+ }
+
+ @Test
+ public void testHFileReaderDoesNotUseBloomFilterByDefault()
+ throws IOException, ReflectiveOperationException {
+ HoodieStorage storage = HoodieTestUtils.getDefaultStorage();
+ HoodieFileReaderFactory readerFactory =
HoodieIOFactory.getIOFactory(storage)
+ .getReaderFactory(HoodieRecordType.AVRO);
+
+
assertBloomFilterConfig(readerFactory.getFileReader(DEFAULT_HUDI_CONFIG_FOR_READER,
HFILE_PATH), false);
+
assertBloomFilterConfig(readerFactory.getFileReader(DEFAULT_HUDI_CONFIG_FOR_READER,
+ new StoragePathInfo(HFILE_PATH, 100, false, (short) 0, 0, 0),
HoodieFileFormat.HFILE,
+ Option.<HoodieSchema>empty()), false);
+
assertBloomFilterConfig(readerFactory.getContentReader(DEFAULT_HUDI_CONFIG_FOR_READER,
HFILE_PATH,
+ HoodieFileFormat.HFILE, storage, new byte[0],
Option.<HoodieSchema>empty()),
+ false);
+ }
+
+ private static TypedProperties bloomFilterProps(boolean enableBloomFilter) {
+ TypedProperties properties = new TypedProperties();
+ properties.setProperty(HoodieMetadataConfig.BLOOM_FILTER_ENABLE.key(),
Boolean.toString(enableBloomFilter));
+ return properties;
+ }
+
+ private static void assertBloomFilterConfig(HoodieFileReader reader, boolean
expectedBloomFilterEnabled)
+ throws ReflectiveOperationException {
+ HoodieNativeAvroHFileReader hfileReader =
assertInstanceOf(HoodieNativeAvroHFileReader.class, reader);
Review Comment:
🤖 nit: accessing `useBloomFilter` by string name via reflection means if the
field is ever renamed the test will throw `NoSuchFieldException` at runtime
rather than fail at compile time. Could you instead verify observable behaviour
— e.g. write a small HFile without bloom filter metadata, open it with the
reader, and assert that `readBloomFilter()` returns null — so the contract
rather than the implementation detail is what's tested?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##########
Review Comment:
🤖 **Line 180:** `lookupRecords` here still uses
`ConfigUtils.DEFAULT_HUDI_CONFIG_FOR_READER`, while `lookupEngineRecords` below
was updated to use `readerContext.getHoodieReaderConfig()` — so point lookups
via the `HoodieRecord` path won't pick up `BLOOM_FILTER_ENABLE` from the reader
context and won't benefit from the new bloom filter optimization. Is the intent
to only optimize the engine-record path (since that's the file-group-reader
path used by MDT lookups), or should this be updated too for consistency?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]