That's awesome. I've compiled master locally and can indeed confirm the huge performance improvement. Thanks! -J
On Wed, Jul 14, 2021 at 1:01 PM Weston Pace <[email protected]> wrote: > My mistake, the attachment was there. I simply did not check for it > very thoroughly. > > I played around with this a bit more and realized the difference was > simply that I was running on the latest build and not using 4.0.1. > Digging a bit further I think you are seeing [1] and there is a bit > more discussion you may be interested in at [2]. The good news is > that you should see much better performance with 5.0.0 (releasing > soon). It isn't quite the 15% I was seeing in my own benchmark, more > like 3.5x slower for reading. Rerunning your benchmark script against > the latest master I get the following output: > > ``` > 2021-07-14 01:53:38.706 INFO Writing parquet to > /tmp/tmpk8zj9k1u/example_tall_apq.pq [Arrow] > 2021-07-14 01:53:38.906 INFO Written. > 2021-07-14 01:53:38.906 INFO Writing parquet to > /tmp/tmpk8zj9k1u/example_wide_apq.pq [Arrow] > 2021-07-14 01:53:39.688 INFO Written. > 2021-07-14 01:53:39.688 INFO Writing parquet to > /tmp/tmpk8zj9k1u/example_tall_fpq.pq [FastParquet] > 2021-07-14 01:53:40.077 INFO Written. > 2021-07-14 01:53:40.077 INFO Writing parquet to > /tmp/tmpk8zj9k1u/example_wide_fpq.pq [FastParquet] > 2021-07-14 01:53:43.376 INFO Written. > 2021-07-14 01:53:43.377 INFO Reading parquet from > /tmp/tmpk8zj9k1u/example_tall_apq.pq [Arrow] > 2021-07-14 01:53:43.473 INFO Read. > 2021-07-14 01:53:43.474 INFO Reading parquet from > /tmp/tmpk8zj9k1u/example_wide_apq.pq [Arrow] > 2021-07-14 01:53:43.825 INFO Read. > 2021-07-14 01:53:43.826 INFO Reading parquet from > /tmp/tmpk8zj9k1u/example_tall_apq.pq [FastParquet] > 2021-07-14 01:53:43.975 INFO Read. > 2021-07-14 01:53:43.975 INFO Reading parquet from > /tmp/tmpk8zj9k1u/example_wide_apq.pq [FastParquet] > 2021-07-14 01:53:45.012 INFO Read. > 2021-07-14 01:53:45.013 INFO Writing Arrow to > /tmp/tmpk8zj9k1u/example_wide.arrows > 2021-07-14 01:53:45.581 INFO Arrow written. > 2021-07-14 01:53:45.585 INFO Writing Arrow to > /tmp/tmpk8zj9k1u/example_tall.arrows > 2021-07-14 01:53:45.675 INFO Arrow written. > 2021-07-14 01:53:45.676 INFO Reading Arrow from > /tmp/tmpk8zj9k1u/example_wide.arrows > 2021-07-14 01:53:45.780 INFO Read. > 2021-07-14 01:53:45.783 INFO Reading Arrow from > /tmp/tmpk8zj9k1u/example_tall.arrows > 2021-07-14 01:53:45.796 INFO Read. > ``` > > [1] https://issues.apache.org/jira/browse/ARROW-12736 > [2] https://issues.apache.org/jira/browse/ARROW-11469 > > On Tue, Jul 13, 2021 at 11:06 PM Joris Peeters > <[email protected]> wrote: > > > > I added the script as `benchmark.py` in my original post. Maybe it got > filtered somewhere, but > https://lists.apache.org/api/email.lua?attachment=true&id=r2a7b4fe367184aabfb335fe0a5dc1d2a871ed52d51047b8130bf1fb5@%3Cuser.arrow.apache.org%3E&file=133b9b7a2693cd02fd841cee58b2bfab059f22f17d377df0c62c573d5aa09fb3 > might be a stable link. > > > > So, interestingly, I am reproducing my own findings with your script. > The tall table takes about 0.8s to load (as you also found), but the wide > one takes 6.4s. I'm surprised you see ~=0.8s for the wide read as well. > Throughout both my own benchmarks and yours, reading a wide one was always > significantly slower, on different machines and Windows/Linux - and even > across different parquet implementations (Arrow's <-> fastparquet). > > > > On Tue, Jul 13, 2021 at 7:11 PM Weston Pace <[email protected]> > wrote: > >> > >> The short answer is no, there is nothing "pathological" about parquet, > >> it should be more or less as suited for wide columns as arrow's IPC > >> format. Both formats will require additional metadata when there are > >> more columns and compressibility may differ (although .arrows data is > >> often uncompressed). > >> > >> Can you provide your test script? I don't get quite the same results. > >> For my test I created two tables, one that was 10,000 columns by 8,000 > >> rows and one that was 80,000,000 rows in 1 column. There is simply > >> more metadata when you have 10k rows and less opportunity for > >> compression. As a result the file sizes were 611M for the tall and > >> 739M for the wide so the wide requires about 20% more data. Reading > >> times (hot-in-cache reads) were ~.73s for the tall and ~.84s for the > >> wide and so the wide takes about 15% more time to read. This seems > >> about right to me. > >> > >> ## Writing script > >> > >> import pyarrow as pa > >> import pyarrow.parquet as pq > >> import numpy as np > >> > >> TALL_ROWS = 80_000_000 > >> TALL_COLS = 1 > >> WIDE_ROWS = 8_000 > >> WIDE_COLS = 10_000 > >> > >> tall_data = np.random.rand(TALL_COLS, TALL_ROWS) > >> wide_data = np.random.rand(WIDE_COLS, WIDE_ROWS) > >> > >> tall_table = pa.Table.from_arrays([tall_data[0]], names=["values"]) > >> pq.write_table(tall_table, '/tmp/tall.pq') > >> > >> wide_names = [f'f{i}' for i in range(WIDE_COLS)] > >> wide_table = pa.Table.from_arrays(wide_data, names=wide_names) > >> pq.write_table(wide_table, '/tmp/wide.pq') > >> > >> ## Reading script > >> > >> import pyarrow.parquet as pq > >> > >> table = pq.read_table('/tmp/tall.pq') > >> print(table.num_rows) > >> print(table.num_columns) > >> > >> On Tue, Jul 13, 2021 at 6:23 AM Martin Percossi <[email protected]> > wrote: > >> > > >> > An alternative representation would be to have a single settlement > price column, and add a stock_id column. Instead of a single row for each > time step, you would now have, say, 10K rows - one for each stock. > >> > > >> > I think this will yield better performance. > >> > > >> > On Tue, 13 Jul 2021, 18:12 Joris Peeters, <[email protected]> > wrote: > >> >> > >> >> Hello, > >> >> > >> >> Sending to user@arrow, as that appears the best place for parquet > questions atm, but feel free to redirect me. > >> >> > >> >> My objective is to store financial data in Parquet files, and read > it out fast. > >> >> The columns represent stocks (~= 10,000 or so), and each row is a > date (~= 8000, e.g. 30 years). Values are e.g. settlement prices. I might > want to use short row groups of e.g. a year each, for quickly getting to > smaller date ranges, or query for a subset of columns (stocks). > >> >> > >> >> The appeal of parquet is that I could store all of this stuff in one > file, and use the row-groups + column-select for slicing, rather than have > a ton of smaller files etc. Would also integrate well with various ML tech. > >> >> > >> >> When doing some basic performance testing, with random data, I > noticed that the performance for tables with many columns seems fairly > poor. I've attached a little benchmark script - see output at the bottom. > >> >> > >> >> Stylised conslusions, > >> >> - Reading/writing a "tall" (nrows >> ncols) dataframe is much more > performant than a "wide" dataframe. > >> >> - with the Arrow format (as opposed to parquet), the difference is > much smaller. > >> >> - Similar results on Windows & Linux, and for Arrow's parquet vs > fastparquet. > >> >> > >> >> Is there something pathological about the parquet format that > manifests in this regime, or is it rather that the code might not have been > optimised for this? Aware that ncols >> nrows is not ideal, but was hoping > for less of a cliff. > >> >> > >> >> Happy to dig in, but polling experts first. > >> >> > >> >> Best, > >> >> -J > >> >> > >> >> >python benchmark.py > >> >> 2021-07-13 16:31:54.786 INFO Writing parquet to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow] > >> >> 2021-07-13 16:31:55.123 INFO Written. > >> >> 2021-07-13 16:31:55.123 INFO Writing parquet to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow] > >> >> 2021-07-13 16:31:57.155 INFO Written. > >> >> 2021-07-13 16:31:57.155 INFO Writing parquet to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq > [FastParquet] > >> >> 2021-07-13 16:31:57.789 INFO Written. > >> >> 2021-07-13 16:31:57.790 INFO Writing parquet to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq > [FastParquet] > >> >> 2021-07-13 16:32:03.613 INFO Written. > >> >> 2021-07-13 16:32:03.613 INFO Reading parquet from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow] > >> >> 2021-07-13 16:32:03.890 INFO Read. > >> >> 2021-07-13 16:32:03.899 INFO Reading parquet from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow] > >> >> 2021-07-13 16:32:08.727 INFO Read. > >> >> 2021-07-13 16:32:08.737 INFO Reading parquet from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq > [FastParquet] > >> >> 2021-07-13 16:32:08.983 INFO Read. > >> >> 2021-07-13 16:32:08.991 INFO Reading parquet from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq > [FastParquet] > >> >> 2021-07-13 16:32:11.580 INFO Read. > >> >> 2021-07-13 16:32:11.589 INFO Writing Arrow to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows > >> >> 2021-07-13 16:32:13.057 INFO Arrow written. > >> >> 2021-07-13 16:32:13.078 INFO Writing Arrow to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows > >> >> 2021-07-13 16:32:13.425 INFO Arrow written. > >> >> 2021-07-13 16:32:13.434 INFO Reading Arrow from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows > >> >> 2021-07-13 16:32:13.620 INFO Read. > >> >> 2021-07-13 16:32:13.637 INFO Reading Arrow from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows > >> >> 2021-07-13 16:32:13.711 INFO Read. > >> >> > >> >> >
