hi Maarten, I added dev@parquet.apache.org to this (if you are not subscribed to this list you may want to)
I made a quick notebook to help illustrate: https://gist.github.com/wesm/cabf684db3ce8fdd6df27cf782f7226e Summary: * Files with 1000+ columns can see the metadata-to-data ratio exceed 10% (in the example I made it's 15-20%). * The time to deserialize whole files with many columns starts to balloon superlinearly with extremely wide files On Sat, May 9, 2020 at 4:28 PM Maarten Ballintijn <maart...@xs4all.nl> wrote: > > Wes, > > "Users would be well advised to not write columns with large numbers (> 1000) > of columns" > You've mentioned this before and as this is in my experience not an uncommon > use-case can you maybe expand a bit on the following related questions. > (use-cases include daily or minute data for a few 10's of thousands items > like stocks or other financial instruments, IoT sensors, etc). > > Parquet Standard - Is the issue intrinsic to the Parquet standard you think? > The ability to read a sub-set of the columns and/or row-groups, compact > storage through the use of RLE, categoricals etc, all seem to point to the > format being well suited for these use-cases Parquet files by design are pretty heavy on metadata -- which is fine when the number of columns is small. When files have many columns, the costs associated with dealing with the file metadata really add up because the ratio of metadata to data in the file becomes skewed. Also, the common FileMetaData must be entirely parsed even when you only want to read one column. > Parquet-C++ implementation - Is the issue with current Parquet-C++ > implementation, or any of the dependencies? Is it something which could be > fixed? Would a specialized implementation help? Is the problem related to > going from Parquet -> Arrow -> Python/Pandas? E.g. would a Parquet -> numpy > reader work better? No, it's not an issue specific to the C++ implementation. > Alternatives - What would you recommend as a superior solution? Store this > data tall i.s.o wide? Use another storage format? It really depends on your particular use case. You can try other solutions (e.g. Arrow IPC / Feather files, or row-oriented data formats) and see what works best > Appreciate your (and others) insights. > > Cheers, Maarten.