Yeah, tachyon does sound like a good option here. Especially if you have nested data, its likely that parquet in tachyon will always be better supported.
On Fri, Dec 19, 2014 at 2:17 PM, Sadhan Sood <sadhan.s...@gmail.com> wrote: > > Hey Michael, > > Thank you for clarifying that. Is tachyon the right way to get compressed > data in memory or should we explore the option of adding compression to > cached data. This is because our uncompressed data set is too big to fit in > memory right now. I see the benefit of tachyon not just with storing > compressed data in memory but we wouldn't have to create a separate table > for caching some partitions like 'cache table table_cached as select * from > table where date = 201412XX' - the way we are doing right now. > > > On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust <mich...@databricks.com> > wrote: >> >> There is only column level encoding (run length encoding, delta encoding, >> dictionary encoding) and no generic compression. >> >> On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood <sadhan.s...@gmail.com> >> wrote: >>> >>> Hi All, >>> >>> Wondering if when caching a table backed by lzo compressed parquet data, >>> if spark also compresses it (using lzo/gzip/snappy) along with column level >>> encoding or just does the column level encoding when >>> "*spark.sql.inMemoryColumnarStorage.compressed" >>> *is set to true. This is because when I try to cache the data, I notice >>> the memory being used is almost as much as the uncompressed size of the >>> data. >>> >>> Thanks! >>> >>