Hi Uwe, thanks.

I've attached a Google Sheet link

https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0

Kind Regards
Simba

On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Simba,
>
> your plots did not come through. Try uploading them somewhere and link
> to them in the mails. Attachments are always stripped on Apache
> mailing lists.
> Uwe
>
>
> On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote:
> > Hi Everyone,
> >
> > I did some benchmarking to compare the disk size performance when
> > writing Pandas DataFrames to parquet files using Snappy and Brotli
> > compression. I then compared these numbers with those of my current
> > file storage solution.>
> > In my current (non Arrow+Parquet solution), every column in a
> > DataFrame is extracted as NumPy array then compressed with blosc and
> > stored as a binary file. Additionally there's a small accompanying
> > json file with some metadata. Attached are my results for several long
> > and wide DataFrames:>
> > Screen Shot 2018-01-24 at 14.40.48.png
> >
> > I was also able to correlate this finding by looking at the number of
> > allocated blocks:>
> > Screen Shot 2018-01-24 at 14.45.29.png
> >
> > From what I gather Brotli and Snappy perform significantly better for
> > wide DataFrames. However the reverse is true for long DataFrames.>
> > The DataFrames used in the benchmark are entirely composed of floats
> > and my understanding is that there's type specific encoding employed
> > on the parquet file. Additionally the compression codecs are applied
> > to individual segments of the parquet file.>
> > I'd like to get a better understanding of this disk size disparity
> > specifically if there are any additional encoding/compression headers
> > added to the parquet file in the long DataFrames case.>
> > Kind Regards
> > Simba
>
>

Reply via email to