Simba

Nice summary. I think that there may be some issues with your tests. In
particular, you are storing essentially uniform random values. That might
be a viable test in some situations, there are many where there is
considerably less entropy in the data being stored. For instance, if you
store measurements, it is very typical to have very strong correlations.
Likewise if the rows are, say, the time evolution of an optimization. You
also have a very small number of rows which can penalize system that expect
to amortize column meta data over more data.

This test might match your situation, but I would be leery of drawing
overly broad conclusions from this single data point.



On Jan 24, 2018 5:44 AM, "simba nyatsanga" <simnyatsa...@gmail.com> wrote:

> Hi Uwe, thanks.
>
> I've attached a Google Sheet link
>
> https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-
> SoFYrRcfi1siYKFQ/edit#gid=0
>
> Kind Regards
> Simba
>
> On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn <uw...@xhochy.com> wrote:
>
> > Hello Simba,
> >
> > your plots did not come through. Try uploading them somewhere and link
> > to them in the mails. Attachments are always stripped on Apache
> > mailing lists.
> > Uwe
> >
> >
> > On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote:
> > > Hi Everyone,
> > >
> > > I did some benchmarking to compare the disk size performance when
> > > writing Pandas DataFrames to parquet files using Snappy and Brotli
> > > compression. I then compared these numbers with those of my current
> > > file storage solution.>
> > > In my current (non Arrow+Parquet solution), every column in a
> > > DataFrame is extracted as NumPy array then compressed with blosc and
> > > stored as a binary file. Additionally there's a small accompanying
> > > json file with some metadata. Attached are my results for several long
> > > and wide DataFrames:>
> > > Screen Shot 2018-01-24 at 14.40.48.png
> > >
> > > I was also able to correlate this finding by looking at the number of
> > > allocated blocks:>
> > > Screen Shot 2018-01-24 at 14.45.29.png
> > >
> > > From what I gather Brotli and Snappy perform significantly better for
> > > wide DataFrames. However the reverse is true for long DataFrames.>
> > > The DataFrames used in the benchmark are entirely composed of floats
> > > and my understanding is that there's type specific encoding employed
> > > on the parquet file. Additionally the compression codecs are applied
> > > to individual segments of the parquet file.>
> > > I'd like to get a better understanding of this disk size disparity
> > > specifically if there are any additional encoding/compression headers
> > > added to the parquet file in the long DataFrames case.>
> > > Kind Regards
> > > Simba
> >
> >
>

Reply via email to