Hi Everyone,

I did some benchmarking to compare the disk size performance when writing
Pandas DataFrames to parquet files using Snappy and Brotli compression. I
then compared these numbers with those of my current file storage solution.

In my current (non Arrow+Parquet solution), every column in a DataFrame is
extracted as NumPy array then compressed with blosc and stored as a binary
file. Additionally there's a small accompanying json file with some
metadata. Attached are my results for several long and wide DataFrames:

[image: Screen Shot 2018-01-24 at 14.40.48.png]

I was also able to correlate this finding by looking at the number of
allocated blocks:

[image: Screen Shot 2018-01-24 at 14.45.29.png]

>From what I gather Brotli and Snappy perform significantly better for wide
DataFrames. However the reverse is true for long DataFrames.

The DataFrames used in the benchmark are entirely composed of floats and my
understanding is that there's type specific encoding employed on the
parquet file. Additionally the compression codecs are applied to individual
segments of the parquet file.

I'd like to get a better understanding of this disk size disparity
specifically if there are any additional encoding/compression headers added
to the parquet file in the long DataFrames case.

Kind Regards
Simba

Reply via email to