pitrou commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2606773232
> Yeah this is a extreme case just to show the repro. In practice the file has a couple thousands row per file. How many row groups per file (or rows per row group)? It turns out much of the Parquet metadata consumption is in ColumnChunk entries. A Thrift-deserialized ColumnChunk is 640 bytes long, and there are O(C*R*F) ColumnChunks in your dataset, with C=number_columns, R=number_row_groups_per_file and F=number_files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
