Hello, Sending to user@arrow, as that appears the best place for parquet questions atm, but feel free to redirect me.
My objective is to store financial data in Parquet files, and read it out fast. The columns represent stocks (~= 10,000 or so), and each row is a date (~= 8000, e.g. 30 years). Values are e.g. settlement prices. I might want to use short row groups of e.g. a year each, for quickly getting to smaller date ranges, or query for a subset of columns (stocks). The appeal of parquet is that I could store all of this stuff in one file, and use the row-groups + column-select for slicing, rather than have a ton of smaller files etc. Would also integrate well with various ML tech. When doing some basic performance testing, with random data, I noticed that the performance for tables with many columns seems fairly poor. I've attached a little benchmark script - see output at the bottom. Stylised conslusions, - Reading/writing a "tall" (nrows >> ncols) dataframe is *much* more performant than a "wide" dataframe. - with the Arrow format (as opposed to parquet), the difference is much smaller. - Similar results on Windows & Linux, and for Arrow's parquet vs fastparquet. Is there something pathological about the parquet format that manifests in this regime, or is it rather that the code might not have been optimised for this? Aware that ncols >> nrows is not ideal, but was hoping for less of a cliff. Happy to dig in, but polling experts first. Best, -J >python benchmark.py 2021-07-13 16:31:54.786 INFO Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow] 2021-07-13 16:31:55.123 INFO Written. 2021-07-13 16:31:55.123 INFO Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow] 2021-07-13 16:31:57.155 INFO Written. 2021-07-13 16:31:57.155 INFO Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq [FastParquet] 2021-07-13 16:31:57.789 INFO Written. 2021-07-13 16:31:57.790 INFO Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq [FastParquet] 2021-07-13 16:32:03.613 INFO Written. 2021-07-13 16:32:03.613 INFO Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow] 2021-07-13 16:32:03.890 INFO Read. 2021-07-13 16:32:03.899 INFO Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow] 2021-07-13 16:32:08.727 INFO Read. 2021-07-13 16:32:08.737 INFO Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [FastParquet] 2021-07-13 16:32:08.983 INFO Read. 2021-07-13 16:32:08.991 INFO Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [FastParquet] 2021-07-13 16:32:11.580 INFO Read. 2021-07-13 16:32:11.589 INFO Writing Arrow to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows 2021-07-13 16:32:13.057 INFO Arrow written. 2021-07-13 16:32:13.078 INFO Writing Arrow to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows 2021-07-13 16:32:13.425 INFO Arrow written. 2021-07-13 16:32:13.434 INFO Reading Arrow from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows 2021-07-13 16:32:13.620 INFO Read. 2021-07-13 16:32:13.637 INFO Reading Arrow from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows 2021-07-13 16:32:13.711 INFO Read.
benchmark.py
Description: Binary data
