Thanks Ted. I will echo these comments and recommend to run tests on larger and preferably "real" datasets rather than randomly generated ones. The more repetition and less entropy in a dataset, the better Parquet performs relative to other storage options. Web-scale datasets often exhibit these characteristics.
If you can publish your benchmarking code that would also be helpful! best Wes On Wed, Jan 24, 2018 at 1:21 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Simba > > Nice summary. I think that there may be some issues with your tests. In > particular, you are storing essentially uniform random values. That might > be a viable test in some situations, there are many where there is > considerably less entropy in the data being stored. For instance, if you > store measurements, it is very typical to have very strong correlations. > Likewise if the rows are, say, the time evolution of an optimization. You > also have a very small number of rows which can penalize system that expect > to amortize column meta data over more data. > > This test might match your situation, but I would be leery of drawing > overly broad conclusions from this single data point. > > > > On Jan 24, 2018 5:44 AM, "simba nyatsanga" <simnyatsa...@gmail.com> wrote: > >> Hi Uwe, thanks. >> >> I've attached a Google Sheet link >> >> https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i- >> SoFYrRcfi1siYKFQ/edit#gid=0 >> >> Kind Regards >> Simba >> >> On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn <uw...@xhochy.com> wrote: >> >> > Hello Simba, >> > >> > your plots did not come through. Try uploading them somewhere and link >> > to them in the mails. Attachments are always stripped on Apache >> > mailing lists. >> > Uwe >> > >> > >> > On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote: >> > > Hi Everyone, >> > > >> > > I did some benchmarking to compare the disk size performance when >> > > writing Pandas DataFrames to parquet files using Snappy and Brotli >> > > compression. I then compared these numbers with those of my current >> > > file storage solution.> >> > > In my current (non Arrow+Parquet solution), every column in a >> > > DataFrame is extracted as NumPy array then compressed with blosc and >> > > stored as a binary file. Additionally there's a small accompanying >> > > json file with some metadata. Attached are my results for several long >> > > and wide DataFrames:> >> > > Screen Shot 2018-01-24 at 14.40.48.png >> > > >> > > I was also able to correlate this finding by looking at the number of >> > > allocated blocks:> >> > > Screen Shot 2018-01-24 at 14.45.29.png >> > > >> > > From what I gather Brotli and Snappy perform significantly better for >> > > wide DataFrames. However the reverse is true for long DataFrames.> >> > > The DataFrames used in the benchmark are entirely composed of floats >> > > and my understanding is that there's type specific encoding employed >> > > on the parquet file. Additionally the compression codecs are applied >> > > to individual segments of the parquet file.> >> > > I'd like to get a better understanding of this disk size disparity >> > > specifically if there are any additional encoding/compression headers >> > > added to the parquet file in the long DataFrames case.> >> > > Kind Regards >> > > Simba >> > >> > >>