subject:"\[Python\] Disk size performance of Snappy vs Brotli vs Blosc"

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-30 Thread simba nyatsanga

Hi Everyone, Just an update on the above questions. I've updated the numbers in Google sheet using data with less entropy here: https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0 I've also got the benchmarking code. Although some of the data examples mi

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-25 Thread simba nyatsanga

Thanks all for the great feedback! Thanks Daniel for the sample data sets. I loaded them up and they're quite comparable in size to some of the data I'm dealing with. In my case the shapes range from 150 to ~100million rows. Column wise they range from 2-3 columns to ~500,000 columns. Thanks Wes

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Daniel Lemire

Here are some realistic tabular data sets... https://github.com/lemire/RealisticTabularDataSets They are small by modern standards but they are also one GitHub clone away. - Daniel On Wed, Jan 24, 2018 at 2:26 PM, Wes McKinney wrote: > Thanks Ted. I will echo these comments and recommend to r

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Wes McKinney

Thanks Ted. I will echo these comments and recommend to run tests on larger and preferably "real" datasets rather than randomly generated ones. The more repetition and less entropy in a dataset, the better Parquet performs relative to other storage options. Web-scale datasets often exhibit these ch

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Ted Dunning

Simba Nice summary. I think that there may be some issues with your tests. In particular, you are storing essentially uniform random values. That might be a viable test in some situations, there are many where there is considerably less entropy in the data being stored. For instance, if you store

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread simba nyatsanga

Hi Uwe, thanks. I've attached a Google Sheet link https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0 Kind Regards Simba On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn wrote: > Hello Simba, > > your plots did not come through. Try uploading them somewhere

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Uwe L. Korn

Hello Simba, your plots did not come through. Try uploading them somewhere and link to them in the mails. Attachments are always stripped on Apache mailing lists. Uwe On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote: > Hi Everyone, > > I did some benchmarking to compare the disk size per

[Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread simba nyatsanga

Hi Everyone, I did some benchmarking to compare the disk size performance when writing Pandas DataFrames to parquet files using Snappy and Brotli compression. I then compared these numbers with those of my current file storage solution. In my current (non Arrow+Parquet solution), every column in

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

[Python] Disk size performance of Snappy vs Brotli vs Blosc

8 matches

Site Navigation

Mail list logo

Footer information