wgtmac commented on code in PR #1174: URL: https://github.com/apache/parquet-mr/pull/1174#discussion_r1363041453
########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ########## @@ -265,6 +265,10 @@ private void processBlocksFromReader() throws IOException { BlockMetaData blockMetaData = meta.getBlocks().get(blockId); List<ColumnChunkMetaData> columnsInOrder = blockMetaData.getColumns(); + List<ColumnIndex> columnIndexes = readAllColumnIndexes(reader, columnsInOrder, descriptorsMap); Review Comment: Thanks for adding this! The change looks reasonable to me. I would suggest adding a new class to specifically cache and read these indexes. The new class have methods like `readBloomFilter()`, `readColumnIndex()` and `readOffsetIndex()` for a specific column path, and can be configured to cache required columns in advance. With this new class, we can do more optimizations including evict consumed items out of cache and use async I/O prefetch to load items. We can split them into separate patches. For the first one, we may simply add the new class without any caching (i.e. no behavior change). WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org