Hi Everyone,
Just an update on the above questions. I've updated the numbers in Google
sheet using data with less entropy here:
https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0
I've also got the benchmarking code. Although some of the data examples
mi
Thanks all for the great feedback!
Thanks Daniel for the sample data sets. I loaded them up and they're quite
comparable in size to some of the data I'm dealing with. In my case the
shapes range from 150 to ~100million rows. Column wise they range from 2-3
columns to ~500,000 columns.
Thanks Wes
Here are some realistic tabular data sets...
https://github.com/lemire/RealisticTabularDataSets
They are small by modern standards but they are also one GitHub clone away.
- Daniel
On Wed, Jan 24, 2018 at 2:26 PM, Wes McKinney wrote:
> Thanks Ted. I will echo these comments and recommend to r
Thanks Ted. I will echo these comments and recommend to run tests on
larger and preferably "real" datasets rather than randomly generated
ones. The more repetition and less entropy in a dataset, the better
Parquet performs relative to other storage options. Web-scale datasets
often exhibit these ch
Simba
Nice summary. I think that there may be some issues with your tests. In
particular, you are storing essentially uniform random values. That might
be a viable test in some situations, there are many where there is
considerably less entropy in the data being stored. For instance, if you
store
Hi Uwe, thanks.
I've attached a Google Sheet link
https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0
Kind Regards
Simba
On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn wrote:
> Hello Simba,
>
> your plots did not come through. Try uploading them somewhere
Hello Simba,
your plots did not come through. Try uploading them somewhere and link
to them in the mails. Attachments are always stripped on Apache
mailing lists.
Uwe
On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote:
> Hi Everyone,
>
> I did some benchmarking to compare the disk size per
Hi Everyone,
I did some benchmarking to compare the disk size performance when writing
Pandas DataFrames to parquet files using Snappy and Brotli compression. I
then compared these numbers with those of my current file storage solution.
In my current (non Arrow+Parquet solution), every column in