Re: [PR] PARQUET-2373: Improve I/O performance with bloom_filter_length [parquet-mr]

via GitHub Thu, 16 Nov 2023 20:51:26 -0800


zhangjiashen commented on code in PR #1184:
URL: https://github.com/apache/parquet-mr/pull/1184#discussion_r1396702452



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1347,11 +1348,24 @@ public BloomFilter readBloomFilter(ColumnChunkMetaData 
meta) throws IOException
       }
     }
 
-    // Read Bloom filter data header.
+    // Seek to Bloom filter offset.
     f.seek(bloomFilterOffset);
+
+    // Read Bloom filter length.
+    int bloomFilterLength = meta.getBloomFilterLength();
+
+    // If it is set, read Bloom filter header and bitset together.
+    // Otherwise, read Bloom filter header first and then bitset.
+    InputStream in = null;
+    if (bloomFilterLength > 0) {
+      byte[] headerAndBitSet = new byte[bloomFilterLength];
+      f.readFully(headerAndBitSet);
+      in = new ByteArrayInputStream(headerAndBitSet);
+    }
+
     BloomFilterHeader bloomFilterHeader;
     try {
-      bloomFilterHeader = Util.readBloomFilterHeader(f, bloomFilterDecryptor, 
bloomFilterHeaderAAD);
+      bloomFilterHeader = Util.readBloomFilterHeader(in != null ? in : f, 
bloomFilterDecryptor, bloomFilterHeaderAAD);

Review Comment:
   It would make code unreadable if we separate these into two methods. Changed 
code little bit to avoid sereral checks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PARQUET-2373: Improve I/O performance with bloom_filter_length [parquet-mr]

Reply via email to