Have you considered instead using the mllib SparseVector type (which is supported in Spark SQL?)
On Wed, Jun 24, 2015 at 1:31 PM, Nikita Dolgov <n...@beckon.com> wrote: > When my 22M Parquet test file ended up taking 3G when cached in-memory I > looked closer at how column compression works in 1.4.0. My test dataset was > 1,000 columns * 800,000 rows of mostly empty floating point columns with a > few dense long columns. > > I was surprised to see that no real > "org.apache.spark.sql.columnar.compression.CompressionScheme" supports > floating-point types and so conversion falls back to the no-op > "PassThrough" implementation. In addition, the way > "org.apache.spark.sql.columnar.NullableColumnBuilder" encodes null values > (with four bytes for each of them) seems to be heavily biased against > sparse data. > > It would be interesting to know if sparse floating point datasets were > neglected for a reason other than some obscure historical accident. Is > there anything in the Tungsten roadmap which would allow, for example, > https://drill.apache.org/docs/value-vectors/ -style efficiency for this > kind of data? >