Thanks Ted. I will echo these comments and recommend to run tests on
larger and preferably "real" datasets rather than randomly generated
ones. The more repetition and less entropy in a dataset, the better
Parquet performs relative to other storage options. Web-scale datasets
often exhibit these characteristics.

If you can publish your benchmarking code that would also be helpful!

best
Wes

On Wed, Jan 24, 2018 at 1:21 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> Simba
>
> Nice summary. I think that there may be some issues with your tests. In
> particular, you are storing essentially uniform random values. That might
> be a viable test in some situations, there are many where there is
> considerably less entropy in the data being stored. For instance, if you
> store measurements, it is very typical to have very strong correlations.
> Likewise if the rows are, say, the time evolution of an optimization. You
> also have a very small number of rows which can penalize system that expect
> to amortize column meta data over more data.
>
> This test might match your situation, but I would be leery of drawing
> overly broad conclusions from this single data point.
>
>
>
> On Jan 24, 2018 5:44 AM, "simba nyatsanga" <simnyatsa...@gmail.com> wrote:
>
>> Hi Uwe, thanks.
>>
>> I've attached a Google Sheet link
>>
>> https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-
>> SoFYrRcfi1siYKFQ/edit#gid=0
>>
>> Kind Regards
>> Simba
>>
>> On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn <uw...@xhochy.com> wrote:
>>
>> > Hello Simba,
>> >
>> > your plots did not come through. Try uploading them somewhere and link
>> > to them in the mails. Attachments are always stripped on Apache
>> > mailing lists.
>> > Uwe
>> >
>> >
>> > On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote:
>> > > Hi Everyone,
>> > >
>> > > I did some benchmarking to compare the disk size performance when
>> > > writing Pandas DataFrames to parquet files using Snappy and Brotli
>> > > compression. I then compared these numbers with those of my current
>> > > file storage solution.>
>> > > In my current (non Arrow+Parquet solution), every column in a
>> > > DataFrame is extracted as NumPy array then compressed with blosc and
>> > > stored as a binary file. Additionally there's a small accompanying
>> > > json file with some metadata. Attached are my results for several long
>> > > and wide DataFrames:>
>> > > Screen Shot 2018-01-24 at 14.40.48.png
>> > >
>> > > I was also able to correlate this finding by looking at the number of
>> > > allocated blocks:>
>> > > Screen Shot 2018-01-24 at 14.45.29.png
>> > >
>> > > From what I gather Brotli and Snappy perform significantly better for
>> > > wide DataFrames. However the reverse is true for long DataFrames.>
>> > > The DataFrames used in the benchmark are entirely composed of floats
>> > > and my understanding is that there's type specific encoding employed
>> > > on the parquet file. Additionally the compression codecs are applied
>> > > to individual segments of the parquet file.>
>> > > I'd like to get a better understanding of this disk size disparity
>> > > specifically if there are any additional encoding/compression headers
>> > > added to the parquet file in the long DataFrames case.>
>> > > Kind Regards
>> > > Simba
>> >
>> >
>>

Reply via email to