[ https://issues.apache.org/jira/browse/PARQUET-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789926#comment-17789926 ]
ASF GitHub Bot commented on PARQUET-2373: ----------------------------------------- zhangjiashen commented on code in PR #1184: URL: https://github.com/apache/parquet-mr/pull/1184#discussion_r1396702452 ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ########## @@ -1347,11 +1348,24 @@ public BloomFilter readBloomFilter(ColumnChunkMetaData meta) throws IOException } } - // Read Bloom filter data header. + // Seek to Bloom filter offset. f.seek(bloomFilterOffset); + + // Read Bloom filter length. + int bloomFilterLength = meta.getBloomFilterLength(); + + // If it is set, read Bloom filter header and bitset together. + // Otherwise, read Bloom filter header first and then bitset. + InputStream in = null; + if (bloomFilterLength > 0) { + byte[] headerAndBitSet = new byte[bloomFilterLength]; + f.readFully(headerAndBitSet); + in = new ByteArrayInputStream(headerAndBitSet); + } + BloomFilterHeader bloomFilterHeader; try { - bloomFilterHeader = Util.readBloomFilterHeader(f, bloomFilterDecryptor, bloomFilterHeaderAAD); + bloomFilterHeader = Util.readBloomFilterHeader(in != null ? in : f, bloomFilterDecryptor, bloomFilterHeaderAAD); Review Comment: It would make code more complex to read if we separate these into two methods. Changed code little bit to avoid sereral checks, please check if it makes sense? > Improve I/O performance with bloom_filter_length > ------------------------------------------------ > > Key: PARQUET-2373 > URL: https://issues.apache.org/jira/browse/PARQUET-2373 > Project: Parquet > Issue Type: Improvement > Reporter: Jiashen Zhang > Priority: Minor > > The spec PARQUET-2257 has added bloom_filter_length for reader to load the > bloom filter in a single shot. This implementation alters the code to make > use of the 'bloom_filter_length' field for loading the bloom filter > (consisting of the header and bitset) in order to enhance I/O scheduling. -- This message was sent by Atlassian Jira (v8.20.10#820010)