Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-22 Thread Weston Pace
Somewhat, though maybe not as bad. The arrow format only lists the schema once and the per-batch data is just lengths. For disk size I ran an experiment on 100k rows x 10k columns of float64 and got: -rw-rw-r-- 1 pace pace 8005840602 Nov 22 09:35 10_batches.arrow -rw-rw-r-- 1 pace pace

Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-22 Thread Aldrin
Hi Weston, This is slightly off-topic, but I'm curious if what you mentioned about the large metadata blocks (inlined below) also applies to IPC format? I am working with matrices and representing them as tables that can have hundreds of thousands of columns, but I'm splitting them into row

Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-17 Thread Weston Pace
> What are the tradeoffs between a low and large and row group size? I can give some perspective from the C++ work I've been doing. I believe this inspired some of the recommendations Jon is referring to. At the moment we have a number of limitations that aren't limitations in the format but

Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-17 Thread Jorge Cardoso Leitão
What are the tradeoffs between a low and large and row group size? Is it that a low value allows for quicker random access (as we can seek row groups based on the number of rows they have), while a larger value allows for higher dict-encoding and compression ratios? Best, Jorge On Wed, Nov

Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-17 Thread Jonathan Keane
This doesn't address the large number of row groups ticket that was raised, but for some visibility: there is some work to change the row group sizing based on the size of data instead of a static number of rows [1] as well as exposing a few more knobs to tune [2] There is a bit of prior art in

Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-17 Thread Joris Van den Bossche
In addition, would it be useful to be able to change this max_row_group_length from Python? Currently that writer property can't be changed from Python, you can only specify the row_group_size (chunk_size in C++) when writing a table, but that's currently only useful to set it to something that is

Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-16 Thread Sarah Gilmore
ct: Re: [Parquet][C++][Python] Maximum Row Group Length Default > > I was wondering if anyone could elaborate on why the default maximum row > group length is set to 67108864< > https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L

Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-15 Thread Micah Kornfield
> > I was wondering if anyone could elaborate on why the default maximum row > group length is set to 67108864< > https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97>. > From Apache Parquet's documentation, the recommended row group size

[Parquet][C++][Python] Maximum Row Group Length Default

2021-11-15 Thread Sarah Gilmore
Hi all, I was wondering if anyone could elaborate on why the default maximum row group length is set to 67108864. From Apache Parquet's documentation, the recommended row group size