Re: [Python][Documentation] Add column limit recommendations Parquet page

Wes McKinney Sat, 09 May 2020 14:50:17 -0700

hi Maarten,

I added dev@parquet.apache.org to this (if you are not subscribed to
this list you may want to)

I made a quick notebook to help illustrate:

https://gist.github.com/wesm/cabf684db3ce8fdd6df27cf782f7226e

Summary:

* Files with 1000+ columns can see the metadata-to-data ratio exceed
10% (in the example I made it's 15-20%).
* The time to deserialize whole files with many columns starts to
balloon superlinearly with extremely wide files

On Sat, May 9, 2020 at 4:28 PM Maarten Ballintijn <maart...@xs4all.nl> wrote:
>
> Wes,
>
> "Users would be well advised to not write columns with large numbers (> 1000) 
> of columns"
> You've mentioned this before and as this is in my experience not an uncommon 
> use-case can you maybe expand a bit on the following related questions. 
> (use-cases include daily or minute data for a few 10's of thousands items 
> like stocks or other financial instruments, IoT sensors, etc).
>
> Parquet Standard - Is the issue intrinsic to the Parquet standard you think? 
> The ability to read a sub-set of the columns and/or row-groups, compact 
> storage through the use of RLE, categoricals etc, all seem to point to the 
> format being well suited for these use-cases

Parquet files by design are pretty heavy on metadata -- which is fine
when the number of columns is small. When files have many columns, the
costs associated with dealing with the file metadata really add up
because the ratio of metadata to data in the file becomes skewed.
Also, the common FileMetaData must be entirely parsed even when you
only want to read one column.

> Parquet-C++ implementation - Is the issue with current Parquet-C++ 
> implementation, or any of the dependencies? Is it something which could be 
> fixed? Would a specialized implementation help? Is the problem related to 
> going from Parquet -> Arrow -> Python/Pandas? E.g. would a Parquet -> numpy 
> reader work better?

No, it's not an issue specific to the C++ implementation.

> Alternatives - What would you recommend as a superior solution? Store this 
> data tall i.s.o wide? Use another storage format?

It really depends on your particular use case. You can try other
solutions (e.g. Arrow IPC / Feather files, or row-oriented data
formats) and see what works best

> Appreciate your (and others) insights.
>
> Cheers, Maarten.

Re: [Python][Documentation] Add column limit recommendations Parquet page

Reply via email to