Hello Simba,

your plots did not come through. Try uploading them somewhere and link
to them in the mails. Attachments are always stripped on Apache
mailing lists.
Uwe


On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote:
> Hi Everyone,
> 
> I did some benchmarking to compare the disk size performance when
> writing Pandas DataFrames to parquet files using Snappy and Brotli
> compression. I then compared these numbers with those of my current
> file storage solution.> 
> In my current (non Arrow+Parquet solution), every column in a
> DataFrame is extracted as NumPy array then compressed with blosc and
> stored as a binary file. Additionally there's a small accompanying
> json file with some metadata. Attached are my results for several long
> and wide DataFrames:> 
> Screen Shot 2018-01-24 at 14.40.48.png
> 
> I was also able to correlate this finding by looking at the number of
> allocated blocks:> 
> Screen Shot 2018-01-24 at 14.45.29.png
> 
> From what I gather Brotli and Snappy perform significantly better for
> wide DataFrames. However the reverse is true for long DataFrames.> 
> The DataFrames used in the benchmark are entirely composed of floats
> and my understanding is that there's type specific encoding employed
> on the parquet file. Additionally the compression codecs are applied
> to individual segments of the parquet file.> 
> I'd like to get a better understanding of this disk size disparity
> specifically if there are any additional encoding/compression headers
> added to the parquet file in the long DataFrames case.> 
> Kind Regards
> Simba

Reply via email to