[jira] [Commented] (PARQUET-2373) Improve I/O performance with bloom_filter_length

ASF GitHub Bot (Jira) Sun, 05 Nov 2023 00:41:06 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782957#comment-17782957
 ]


ASF GitHub Bot commented on PARQUET-2373:
-----------------------------------------

wgtmac commented on code in PR #1184:
URL: https://github.com/apache/parquet-mr/pull/1184#discussion_r1382523507


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1347,11 +1348,24 @@ public BloomFilter readBloomFilter(ColumnChunkMetaData 
meta) throws IOException
       }
     }
 
-    // Read Bloom filter data header.
+    // Seek to Bloom filter offset.
     f.seek(bloomFilterOffset);
+
+    // Read Bloom filter length.
+    int bloomFilterLength = meta.getBloomFilterLength();
+
+    // If it is set, read Bloom filter header and bitset together.
+    // Otherwise, read Bloom filter header first and then bitset.
+    InputStream in = null;
+    if (bloomFilterLength > 0) {
+      byte[] headerAndBitSet = new byte[bloomFilterLength];
+      f.readFully(headerAndBitSet);
+      in = new ByteArrayInputStream(headerAndBitSet);
+    }
+
     BloomFilterHeader bloomFilterHeader;
     try {
-      bloomFilterHeader = Util.readBloomFilterHeader(f, bloomFilterDecryptor, 
bloomFilterHeaderAAD);
+      bloomFilterHeader = Util.readBloomFilterHeader(in != null ? in : f, 
bloomFilterDecryptor, bloomFilterHeaderAAD);

Review Comment:
   Should we separate these two cases into two methods? Here and below have 
several checks like `in != null ? in : f`.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java:
##########
@@ -341,6 +351,15 @@ public long getBloomFilterOffset() {
     return bloomFilterOffset;
   }
 
+  /**
+   * @return the length to the Bloom filter or {@code -1} if there is no bloom 
filter for this column chunk

Review Comment:
   ```suggestion
      * @return the length to the Bloom filter or {@code -1} if there is no 
bloom filter length for this column chunk
   ```



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java:
##########
@@ -341,6 +351,15 @@ public long getBloomFilterOffset() {
     return bloomFilterOffset;
   }
 
+  /**
+   * @return the length to the Bloom filter or {@code -1} if there is no bloom 
filter for this column chunk
+   */
+  @Private
+  public int getBloomFilterLength() {

Review Comment:
   BTW, an e2e test case will be helpful to guarantee we have written the 
length correctly.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java:
##########
@@ -341,6 +351,15 @@ public long getBloomFilterOffset() {
     return bloomFilterOffset;
   }
 
+  /**
+   * @return the length to the Bloom filter or {@code -1} if there is no bloom 
filter for this column chunk
+   */
+  @Private
+  public int getBloomFilterLength() {

Review Comment:
   Return Optional<Int>?





> Improve I/O performance with bloom_filter_length
> ------------------------------------------------
>
>                 Key: PARQUET-2373
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2373
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Jiashen Zhang
>            Priority: Minor
>
> The spec PARQUET-2257 has added bloom_filter_length for reader to load the 
> bloom filter in a single shot. This implementation alters the code to make 
> use of the 'bloom_filter_length' field for loading the bloom filter 
> (consisting of the header and bitset) in order to enhance I/O scheduling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2373) Improve I/O performance with bloom_filter_length

Reply via email to