Re: [PR] PARQUET-2366: Optimize random seek during rewriting [parquet-mr]

via GitHub Tue, 17 Oct 2023 19:37:07 -0700


wgtmac commented on code in PR #1174:
URL: https://github.com/apache/parquet-mr/pull/1174#discussion_r1363041453



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -265,6 +265,10 @@ private void processBlocksFromReader() throws IOException {
       BlockMetaData blockMetaData = meta.getBlocks().get(blockId);
       List<ColumnChunkMetaData> columnsInOrder = blockMetaData.getColumns();
 
+      List<ColumnIndex> columnIndexes = readAllColumnIndexes(reader, 
columnsInOrder, descriptorsMap);

Review Comment:
   Thanks for adding this! The change looks reasonable to me. I would suggest 
adding a new class to specifically cache and read these indexes. The new class 
have methods like `readBloomFilter()`, `readColumnIndex()` and 
`readOffsetIndex()` for a specific column path, and can be configured to cache 
required columns in advance. With this new class, we can do more optimizations 
including evict consumed items out of cache and use async I/O prefetch to load 
items. We can split them into separate patches. For the first one, we may 
simply add the new class without any caching (i.e. no behavior change). WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] PARQUET-2366: Optimize random seek during rewriting [parquet-mr]

Reply via email to