parquet performance for wide tables (many columns)

Joris Peeters Tue, 13 Jul 2021 09:12:35 -0700

Hello,

Sending to user@arrow, as that appears the best place for parquet questions
atm, but feel free to redirect me.


My objective is to store financial data in Parquet files, and read it out
fast.
The columns represent stocks (~= 10,000 or so), and each row is a date (~=
8000, e.g. 30 years). Values are e.g. settlement prices. I might want to
use short row groups of e.g. a year each, for quickly getting to smaller
date ranges, or query for a subset of columns (stocks).

The appeal of parquet is that I could store all of this stuff in one file,
and use the row-groups + column-select for slicing, rather than have a ton
of smaller files etc. Would also integrate well with various ML tech.

When doing some basic performance testing, with random data, I noticed that
the performance for tables with many columns seems fairly poor. I've
attached a little benchmark script - see output at the bottom.

Stylised conslusions,
- Reading/writing a "tall" (nrows >> ncols) dataframe is *much* more
performant than a "wide" dataframe.
- with the Arrow format (as opposed to parquet), the difference is much
smaller.
- Similar results on Windows & Linux, and for Arrow's parquet vs
fastparquet.

Is there something pathological about the parquet format that manifests in
this regime, or is it rather that the code might not have been optimised
for this? Aware that ncols >> nrows is not ideal, but was hoping for less
of a cliff.

Happy to dig in, but polling experts first.

Best,
-J

>python benchmark.py
2021-07-13 16:31:54.786 INFO     Writing parquet to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
2021-07-13 16:31:55.123 INFO     Written.
2021-07-13 16:31:55.123 INFO     Writing parquet to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
2021-07-13 16:31:57.155 INFO     Written.
2021-07-13 16:31:57.155 INFO     Writing parquet to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq
[FastParquet]
2021-07-13 16:31:57.789 INFO     Written.
2021-07-13 16:31:57.790 INFO     Writing parquet to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq
[FastParquet]
2021-07-13 16:32:03.613 INFO     Written.
2021-07-13 16:32:03.613 INFO     Reading parquet from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
2021-07-13 16:32:03.890 INFO     Read.
2021-07-13 16:32:03.899 INFO     Reading parquet from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
2021-07-13 16:32:08.727 INFO     Read.
2021-07-13 16:32:08.737 INFO     Reading parquet from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq
[FastParquet]
2021-07-13 16:32:08.983 INFO     Read.
2021-07-13 16:32:08.991 INFO     Reading parquet from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq
[FastParquet]
2021-07-13 16:32:11.580 INFO     Read.
2021-07-13 16:32:11.589 INFO     Writing Arrow to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
2021-07-13 16:32:13.057 INFO     Arrow written.
2021-07-13 16:32:13.078 INFO     Writing Arrow to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
2021-07-13 16:32:13.425 INFO     Arrow written.
2021-07-13 16:32:13.434 INFO     Reading Arrow from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
2021-07-13 16:32:13.620 INFO     Read.
2021-07-13 16:32:13.637 INFO     Reading Arrow from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
2021-07-13 16:32:13.711 INFO     Read.

benchmark.py
Description: Binary data

parquet performance for wide tables (many columns)

Reply via email to