[sparksql] sparse floating point data compression in sparksql cache

2015-06-24 Thread Nikita Dolgov
When my 22M Parquet test file ended up taking 3G when cached in-memory I looked closer at how column compression works in 1.4.0. My test dataset was 1,000 columns * 800,000 rows of mostly empty floating point columns with a few dense long columns. I was surprised to see that no real

Re: [sparksql] sparse floating point data compression in sparksql cache

2015-06-24 Thread Michael Armbrust
Have you considered instead using the mllib SparseVector type (which is supported in Spark SQL?) On Wed, Jun 24, 2015 at 1:31 PM, Nikita Dolgov n...@beckon.com wrote: When my 22M Parquet test file ended up taking 3G when cached in-memory I looked closer at how column compression works in