Re: [PR] PARQUET-2373: Improve I/O performance with bloom_filter_length [parquet-mr]

via GitHub Sun, 26 Nov 2023 22:40:46 -0800


zhangjiashen commented on code in PR #1184:
URL: https://github.com/apache/parquet-mr/pull/1184#discussion_r1396702452



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1347,11 +1348,24 @@ public BloomFilter readBloomFilter(ColumnChunkMetaData 
meta) throws IOException
       }
     }
 
-    // Read Bloom filter data header.
+    // Seek to Bloom filter offset.
     f.seek(bloomFilterOffset);
+
+    // Read Bloom filter length.
+    int bloomFilterLength = meta.getBloomFilterLength();
+
+    // If it is set, read Bloom filter header and bitset together.
+    // Otherwise, read Bloom filter header first and then bitset.
+    InputStream in = null;
+    if (bloomFilterLength > 0) {
+      byte[] headerAndBitSet = new byte[bloomFilterLength];
+      f.readFully(headerAndBitSet);
+      in = new ByteArrayInputStream(headerAndBitSet);
+    }
+
     BloomFilterHeader bloomFilterHeader;
     try {
-      bloomFilterHeader = Util.readBloomFilterHeader(f, bloomFilterDecryptor, 
bloomFilterHeaderAAD);
+      bloomFilterHeader = Util.readBloomFilterHeader(in != null ? in : f, 
bloomFilterDecryptor, bloomFilterHeaderAAD);

Review Comment:
   It would make code more complex to read if we separate these into two 
methods. Changed code little bit to avoid sereral checks, please check if it 
makes sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PARQUET-2373: Improve I/O performance with bloom_filter_length [parquet-mr]

Reply via email to