When my 22M Parquet test file ended up taking 3G when cached in-memory I
looked closer at how column compression works in 1.4.0. My test dataset was
1,000 columns * 800,000 rows of mostly empty floating point columns with a
few dense long columns.
I was surprised to see that no real
Have you considered instead using the mllib SparseVector type (which is
supported in Spark SQL?)
On Wed, Jun 24, 2015 at 1:31 PM, Nikita Dolgov n...@beckon.com wrote:
When my 22M Parquet test file ended up taking 3G when cached in-memory I
looked closer at how column compression works in