Hi Uwe, thanks. I've attached a Google Sheet link
https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0 Kind Regards Simba On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn <uw...@xhochy.com> wrote: > Hello Simba, > > your plots did not come through. Try uploading them somewhere and link > to them in the mails. Attachments are always stripped on Apache > mailing lists. > Uwe > > > On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote: > > Hi Everyone, > > > > I did some benchmarking to compare the disk size performance when > > writing Pandas DataFrames to parquet files using Snappy and Brotli > > compression. I then compared these numbers with those of my current > > file storage solution.> > > In my current (non Arrow+Parquet solution), every column in a > > DataFrame is extracted as NumPy array then compressed with blosc and > > stored as a binary file. Additionally there's a small accompanying > > json file with some metadata. Attached are my results for several long > > and wide DataFrames:> > > Screen Shot 2018-01-24 at 14.40.48.png > > > > I was also able to correlate this finding by looking at the number of > > allocated blocks:> > > Screen Shot 2018-01-24 at 14.45.29.png > > > > From what I gather Brotli and Snappy perform significantly better for > > wide DataFrames. However the reverse is true for long DataFrames.> > > The DataFrames used in the benchmark are entirely composed of floats > > and my understanding is that there's type specific encoding employed > > on the parquet file. Additionally the compression codecs are applied > > to individual segments of the parquet file.> > > I'd like to get a better understanding of this disk size disparity > > specifically if there are any additional encoding/compression headers > > added to the parquet file in the long DataFrames case.> > > Kind Regards > > Simba > >