corwinjoy opened a new issue, #39676: URL: https://github.com/apache/arrow/issues/39676
### Describe the enhancement requested - Background: For parquet files that have a large number of rowgroups and columns, reading the full file metadata is prohibitively expensive when you just want a sample from a table. (Our customers are using parquet files via Arrow which contains > 10k columns and thousands of rowgroups). For the case where you just want to read a few rowgroups and/or columns we would like to have a fast random access reader. - Idea: Read only the minimal metadata from the parquet file to establish columns and column types. Require that the file contain an [OffsetIndex](https://github.com/apache/parquet-format/blob/master/PageIndex.md) section and use the offset index to directly access the requested data pages and columns. Preliminary work indicates that this can give a 2x or 3x speedup with even a modest number of columns and rowgroups with the existing parquet format. With some minor parquet format changes, I believe this could be 100x faster. - Related Work: There has been some similar work done in this direction, but I think this is more at the interface level rather than direct performance tuning: https://github.com/apache/arrow/issues/39392 https://github.com/apache/arrow/issues/38865 [Jira: Selective reading of rows for parquet file](https://issues.apache.org/jira/browse/ARROW-13517) ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
