Have you considered instead using the mllib SparseVector type (which is
supported in Spark SQL?)

On Wed, Jun 24, 2015 at 1:31 PM, Nikita Dolgov <n...@beckon.com> wrote:

> When my 22M Parquet test file ended up taking 3G when cached in-memory I
> looked closer at how column compression works in 1.4.0. My test dataset was
> 1,000 columns * 800,000 rows of mostly empty floating point columns with a
> few dense long columns.
>
> I was surprised to see that no real
> "org.apache.spark.sql.columnar.compression.CompressionScheme" supports
> floating-point types and so conversion falls back to the no-op
> "PassThrough" implementation. In addition, the way
> "org.apache.spark.sql.columnar.NullableColumnBuilder" encodes null values
> (with four bytes for each of them) seems to be heavily biased against
> sparse data.
>
> It would be interesting to know if sparse floating point datasets were
> neglected for a reason other than some obscure historical accident. Is
> there anything in the Tungsten roadmap which would allow, for example,
> https://drill.apache.org/docs/value-vectors/ -style efficiency for this
> kind of data?
>

Reply via email to