[I] [C++][Parquet] Fast Random Rowgroup Reads [arrow]

via GitHub Wed, 17 Jan 2024 14:40:07 -0800


corwinjoy opened a new issue, #39676:
URL: https://github.com/apache/arrow/issues/39676


   ### Describe the enhancement requested
   
   - Background:
   For parquet files that have a large number of rowgroups and columns, reading 
the full file metadata is prohibitively expensive when you just want a sample 
from a table. (Our customers are using parquet files via Arrow which contains > 
10k columns and thousands of rowgroups). For the case where you just want to 
read a few rowgroups and/or columns we would like to have a fast random access 
reader.
   
   - Idea:
   Read only the minimal metadata from the parquet file to establish columns 
and column types. Require that the file contain an 
[OffsetIndex](https://github.com/apache/parquet-format/blob/master/PageIndex.md)
 section and use the offset index to directly access the requested data pages 
and columns. Preliminary work indicates that this can give a 2x or 3x speedup 
with even a modest number of columns and rowgroups with the existing parquet 
format. With some minor parquet format changes, I believe this could be 100x 
faster.
   
   - Related Work:
   There has been some similar work done in this direction, but I think this is 
more at the interface level rather than direct performance tuning:
   https://github.com/apache/arrow/issues/39392
   https://github.com/apache/arrow/issues/38865
   [Jira: Selective reading of rows for parquet 
file](https://issues.apache.org/jira/browse/ARROW-13517)
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [C++][Parquet] Fast Random Rowgroup Reads [arrow]

Reply via email to