[
https://issues.apache.org/jira/browse/PARQUET-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782957#comment-17782957
]
ASF GitHub Bot commented on PARQUET-2373:
-----------------------------------------
wgtmac commented on code in PR #1184:
URL: https://github.com/apache/parquet-mr/pull/1184#discussion_r1382523507
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1347,11 +1348,24 @@ public BloomFilter readBloomFilter(ColumnChunkMetaData
meta) throws IOException
}
}
- // Read Bloom filter data header.
+ // Seek to Bloom filter offset.
f.seek(bloomFilterOffset);
+
+ // Read Bloom filter length.
+ int bloomFilterLength = meta.getBloomFilterLength();
+
+ // If it is set, read Bloom filter header and bitset together.
+ // Otherwise, read Bloom filter header first and then bitset.
+ InputStream in = null;
+ if (bloomFilterLength > 0) {
+ byte[] headerAndBitSet = new byte[bloomFilterLength];
+ f.readFully(headerAndBitSet);
+ in = new ByteArrayInputStream(headerAndBitSet);
+ }
+
BloomFilterHeader bloomFilterHeader;
try {
- bloomFilterHeader = Util.readBloomFilterHeader(f, bloomFilterDecryptor,
bloomFilterHeaderAAD);
+ bloomFilterHeader = Util.readBloomFilterHeader(in != null ? in : f,
bloomFilterDecryptor, bloomFilterHeaderAAD);
Review Comment:
Should we separate these two cases into two methods? Here and below have
several checks like `in != null ? in : f`.
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java:
##########
@@ -341,6 +351,15 @@ public long getBloomFilterOffset() {
return bloomFilterOffset;
}
+ /**
+ * @return the length to the Bloom filter or {@code -1} if there is no bloom
filter for this column chunk
Review Comment:
```suggestion
* @return the length to the Bloom filter or {@code -1} if there is no
bloom filter length for this column chunk
```
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java:
##########
@@ -341,6 +351,15 @@ public long getBloomFilterOffset() {
return bloomFilterOffset;
}
+ /**
+ * @return the length to the Bloom filter or {@code -1} if there is no bloom
filter for this column chunk
+ */
+ @Private
+ public int getBloomFilterLength() {
Review Comment:
BTW, an e2e test case will be helpful to guarantee we have written the
length correctly.
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java:
##########
@@ -341,6 +351,15 @@ public long getBloomFilterOffset() {
return bloomFilterOffset;
}
+ /**
+ * @return the length to the Bloom filter or {@code -1} if there is no bloom
filter for this column chunk
+ */
+ @Private
+ public int getBloomFilterLength() {
Review Comment:
Return Optional<Int>?
> Improve I/O performance with bloom_filter_length
> ------------------------------------------------
>
> Key: PARQUET-2373
> URL: https://issues.apache.org/jira/browse/PARQUET-2373
> Project: Parquet
> Issue Type: Improvement
> Reporter: Jiashen Zhang
> Priority: Minor
>
> The spec PARQUET-2257 has added bloom_filter_length for reader to load the
> bloom filter in a single shot. This implementation alters the code to make
> use of the 'bloom_filter_length' field for loading the bloom filter
> (consisting of the header and bitset) in order to enhance I/O scheduling.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)