Re: parquet performance for wide tables (many columns)

Weston Pace Wed, 14 Jul 2021 05:01:38 -0700

My mistake, the attachment was there. I simply did not check for it
very thoroughly.


I played around with this a bit more and realized the difference was
simply that I was running on the latest build and not using 4.0.1.
Digging a bit further I think you are seeing [1] and there is a bit
more discussion you may be interested in at [2].  The good news is
that you should see much better performance with 5.0.0 (releasing
soon).  It isn't quite the 15% I was seeing in my own benchmark, more
like 3.5x slower for reading.  Rerunning your benchmark script against
the latest master I get the following output:

```
2021-07-14 01:53:38.706 INFO     Writing parquet to
/tmp/tmpk8zj9k1u/example_tall_apq.pq [Arrow]
2021-07-14 01:53:38.906 INFO     Written.
2021-07-14 01:53:38.906 INFO     Writing parquet to
/tmp/tmpk8zj9k1u/example_wide_apq.pq [Arrow]
2021-07-14 01:53:39.688 INFO     Written.
2021-07-14 01:53:39.688 INFO     Writing parquet to
/tmp/tmpk8zj9k1u/example_tall_fpq.pq [FastParquet]
2021-07-14 01:53:40.077 INFO     Written.
2021-07-14 01:53:40.077 INFO     Writing parquet to
/tmp/tmpk8zj9k1u/example_wide_fpq.pq [FastParquet]
2021-07-14 01:53:43.376 INFO     Written.
2021-07-14 01:53:43.377 INFO     Reading parquet from
/tmp/tmpk8zj9k1u/example_tall_apq.pq [Arrow]
2021-07-14 01:53:43.473 INFO     Read.
2021-07-14 01:53:43.474 INFO     Reading parquet from
/tmp/tmpk8zj9k1u/example_wide_apq.pq [Arrow]
2021-07-14 01:53:43.825 INFO     Read.
2021-07-14 01:53:43.826 INFO     Reading parquet from
/tmp/tmpk8zj9k1u/example_tall_apq.pq [FastParquet]
2021-07-14 01:53:43.975 INFO     Read.
2021-07-14 01:53:43.975 INFO     Reading parquet from
/tmp/tmpk8zj9k1u/example_wide_apq.pq [FastParquet]
2021-07-14 01:53:45.012 INFO     Read.
2021-07-14 01:53:45.013 INFO     Writing Arrow to
/tmp/tmpk8zj9k1u/example_wide.arrows
2021-07-14 01:53:45.581 INFO     Arrow written.
2021-07-14 01:53:45.585 INFO     Writing Arrow to
/tmp/tmpk8zj9k1u/example_tall.arrows
2021-07-14 01:53:45.675 INFO     Arrow written.
2021-07-14 01:53:45.676 INFO     Reading Arrow from
/tmp/tmpk8zj9k1u/example_wide.arrows
2021-07-14 01:53:45.780 INFO     Read.
2021-07-14 01:53:45.783 INFO     Reading Arrow from
/tmp/tmpk8zj9k1u/example_tall.arrows
2021-07-14 01:53:45.796 INFO     Read.
```

[1] https://issues.apache.org/jira/browse/ARROW-12736
[2] https://issues.apache.org/jira/browse/ARROW-11469

On Tue, Jul 13, 2021 at 11:06 PM Joris Peeters
<[email protected]> wrote:
>
> I added the script as `benchmark.py` in my original post. Maybe it got 
> filtered somewhere, but 
> https://lists.apache.org/api/email.lua?attachment=true&id=r2a7b4fe367184aabfb335fe0a5dc1d2a871ed52d51047b8130bf1fb5@%3Cuser.arrow.apache.org%3E&file=133b9b7a2693cd02fd841cee58b2bfab059f22f17d377df0c62c573d5aa09fb3
>  might be a stable link.
>
> So, interestingly, I am reproducing my own findings with your script. The 
> tall table takes about 0.8s to load (as you also found), but the wide one 
> takes 6.4s. I'm surprised you see ~=0.8s for the wide read as well. 
> Throughout both my own benchmarks and yours, reading a wide one was always 
> significantly slower, on different machines and Windows/Linux - and even 
> across different parquet implementations (Arrow's <-> fastparquet).
>
> On Tue, Jul 13, 2021 at 7:11 PM Weston Pace <[email protected]> wrote:
>>
>> The short answer is no, there is nothing "pathological" about parquet,
>> it should be more or less as suited for wide columns as arrow's IPC
>> format.  Both formats will require additional metadata when there are
>> more columns and compressibility may differ (although .arrows data is
>> often uncompressed).
>>
>> Can you provide your test script?  I don't get quite the same results.
>> For my test I created two tables, one that was 10,000 columns by 8,000
>> rows and one that was 80,000,000 rows in 1 column.  There is simply
>> more metadata when you have 10k rows and less opportunity for
>> compression.  As a result the file sizes were 611M for the tall and
>> 739M for the wide so the wide requires about 20% more data.  Reading
>> times (hot-in-cache reads) were ~.73s for the tall and ~.84s for the
>> wide and so the wide takes about 15% more time to read.  This seems
>> about right to me.
>>
>> ## Writing script
>>
>> import pyarrow as pa
>> import pyarrow.parquet as pq
>> import numpy as np
>>
>> TALL_ROWS = 80_000_000
>> TALL_COLS = 1
>> WIDE_ROWS = 8_000
>> WIDE_COLS = 10_000
>>
>> tall_data = np.random.rand(TALL_COLS, TALL_ROWS)
>> wide_data = np.random.rand(WIDE_COLS, WIDE_ROWS)
>>
>> tall_table = pa.Table.from_arrays([tall_data[0]], names=["values"])
>> pq.write_table(tall_table, '/tmp/tall.pq')
>>
>> wide_names = [f'f{i}' for i in range(WIDE_COLS)]
>> wide_table = pa.Table.from_arrays(wide_data, names=wide_names)
>> pq.write_table(wide_table, '/tmp/wide.pq')
>>
>> ## Reading script
>>
>> import pyarrow.parquet as pq
>>
>> table = pq.read_table('/tmp/tall.pq')
>> print(table.num_rows)
>> print(table.num_columns)
>>
>> On Tue, Jul 13, 2021 at 6:23 AM Martin Percossi <[email protected]> wrote:
>> >
>> > An alternative representation would be to have a single settlement price 
>> > column, and add a stock_id column. Instead of a single row for each time 
>> > step, you would now have, say, 10K rows - one for each stock.
>> >
>> > I think this will yield better performance.
>> >
>> > On Tue, 13 Jul 2021, 18:12 Joris Peeters, <[email protected]> 
>> > wrote:
>> >>
>> >> Hello,
>> >>
>> >> Sending to user@arrow, as that appears the best place for parquet 
>> >> questions atm, but feel free to redirect me.
>> >>
>> >> My objective is to store financial data in Parquet files, and read it out 
>> >> fast.
>> >> The columns represent stocks (~= 10,000 or so), and each row is a date 
>> >> (~= 8000, e.g. 30 years). Values are e.g. settlement prices. I might want 
>> >> to use short row groups of e.g. a year each, for quickly getting to 
>> >> smaller date ranges, or query for a subset of columns (stocks).
>> >>
>> >> The appeal of parquet is that I could store all of this stuff in one 
>> >> file, and use the row-groups + column-select for slicing, rather than 
>> >> have a ton of smaller files etc. Would also integrate well with various 
>> >> ML tech.
>> >>
>> >> When doing some basic performance testing, with random data, I noticed 
>> >> that the performance for tables with many columns seems fairly poor. I've 
>> >> attached a little benchmark script - see output at the bottom.
>> >>
>> >> Stylised conslusions,
>> >> - Reading/writing a "tall" (nrows >> ncols) dataframe is much more 
>> >> performant than a "wide" dataframe.
>> >> - with the Arrow format (as opposed to parquet), the difference is much 
>> >> smaller.
>> >> - Similar results on Windows & Linux, and for Arrow's parquet vs 
>> >> fastparquet.
>> >>
>> >> Is there something pathological about the parquet format that manifests 
>> >> in this regime, or is it rather that the code might not have been 
>> >> optimised for this? Aware that ncols >> nrows is not ideal, but was 
>> >> hoping for less of a cliff.
>> >>
>> >> Happy to dig in, but polling experts first.
>> >>
>> >> Best,
>> >> -J
>> >>
>> >> >python benchmark.py
>> >> 2021-07-13 16:31:54.786 INFO     Writing parquet to 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq 
>> >> [Arrow]
>> >> 2021-07-13 16:31:55.123 INFO     Written.
>> >> 2021-07-13 16:31:55.123 INFO     Writing parquet to 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq 
>> >> [Arrow]
>> >> 2021-07-13 16:31:57.155 INFO     Written.
>> >> 2021-07-13 16:31:57.155 INFO     Writing parquet to 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq 
>> >> [FastParquet]
>> >> 2021-07-13 16:31:57.789 INFO     Written.
>> >> 2021-07-13 16:31:57.790 INFO     Writing parquet to 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq 
>> >> [FastParquet]
>> >> 2021-07-13 16:32:03.613 INFO     Written.
>> >> 2021-07-13 16:32:03.613 INFO     Reading parquet from 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq 
>> >> [Arrow]
>> >> 2021-07-13 16:32:03.890 INFO     Read.
>> >> 2021-07-13 16:32:03.899 INFO     Reading parquet from 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq 
>> >> [Arrow]
>> >> 2021-07-13 16:32:08.727 INFO     Read.
>> >> 2021-07-13 16:32:08.737 INFO     Reading parquet from 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq 
>> >> [FastParquet]
>> >> 2021-07-13 16:32:08.983 INFO     Read.
>> >> 2021-07-13 16:32:08.991 INFO     Reading parquet from 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq 
>> >> [FastParquet]
>> >> 2021-07-13 16:32:11.580 INFO     Read.
>> >> 2021-07-13 16:32:11.589 INFO     Writing Arrow to 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
>> >> 2021-07-13 16:32:13.057 INFO     Arrow written.
>> >> 2021-07-13 16:32:13.078 INFO     Writing Arrow to 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
>> >> 2021-07-13 16:32:13.425 INFO     Arrow written.
>> >> 2021-07-13 16:32:13.434 INFO     Reading Arrow from 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
>> >> 2021-07-13 16:32:13.620 INFO     Read.
>> >> 2021-07-13 16:32:13.637 INFO     Reading Arrow from 
>> >> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
>> >> 2021-07-13 16:32:13.711 INFO     Read.
>> >>
>> >>

Re: parquet performance for wide tables (many columns)

Reply via email to