Re: does spark sql support columnar compression with encoding when caching tables

2014-12-22 Thread Sadhan Sood
Thanks Cheng, Michael - that was super helpful.

On Sun, Dec 21, 2014 at 7:27 AM, Cheng Lian lian.cs@gmail.com wrote:

  Would like to add that compression schemes built in in-memory columnar
 storage only supports primitive columns (int, string, etc.), complex types
 like array, map and struct are not supported.


 On 12/20/14 6:17 AM, Sadhan Sood wrote:

  Hey Michael,

 Thank you for clarifying that. Is tachyon the right way to get compressed
 data in memory or should we explore the option of adding compression to
 cached data. This is because our uncompressed data set is too big to fit in
 memory right now. I see the benefit of tachyon not just with storing
 compressed data in memory but we wouldn't have to create a separate table
 for caching some partitions like 'cache table table_cached as select * from
 table where date = 201412XX' - the way we are doing right now.


 On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust mich...@databricks.com
 wrote:

 There is only column level encoding (run length encoding, delta encoding,
 dictionary encoding) and no generic compression.

 On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Hi All,

  Wondering if when caching a table backed by lzo compressed parquet
 data, if spark also compresses it (using lzo/gzip/snappy) along with column
 level encoding or just does the column level encoding when 
 *spark.sql.inMemoryColumnarStorage.compressed
 *is set to true. This is because when I try to cache the data, I notice
 the memory being used is almost as much as the uncompressed size of the
 data.

  Thanks!





Re: does spark sql support columnar compression with encoding when caching tables

2014-12-19 Thread Sadhan Sood
Hey Michael,

Thank you for clarifying that. Is tachyon the right way to get compressed
data in memory or should we explore the option of adding compression to
cached data. This is because our uncompressed data set is too big to fit in
memory right now. I see the benefit of tachyon not just with storing
compressed data in memory but we wouldn't have to create a separate table
for caching some partitions like 'cache table table_cached as select * from
table where date = 201412XX' - the way we are doing right now.


On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust mich...@databricks.com
wrote:

 There is only column level encoding (run length encoding, delta encoding,
 dictionary encoding) and no generic compression.

 On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Hi All,

 Wondering if when caching a table backed by lzo compressed parquet data,
 if spark also compresses it (using lzo/gzip/snappy) along with column level
 encoding or just does the column level encoding when 
 *spark.sql.inMemoryColumnarStorage.compressed
 *is set to true. This is because when I try to cache the data, I notice
 the memory being used is almost as much as the uncompressed size of the
 data.

 Thanks!




Re: does spark sql support columnar compression with encoding when caching tables

2014-12-19 Thread Michael Armbrust
Yeah, tachyon does sound like a good option here.  Especially if you have
nested data, its likely that parquet in tachyon will always be better
supported.

On Fri, Dec 19, 2014 at 2:17 PM, Sadhan Sood sadhan.s...@gmail.com wrote:

 Hey Michael,

 Thank you for clarifying that. Is tachyon the right way to get compressed
 data in memory or should we explore the option of adding compression to
 cached data. This is because our uncompressed data set is too big to fit in
 memory right now. I see the benefit of tachyon not just with storing
 compressed data in memory but we wouldn't have to create a separate table
 for caching some partitions like 'cache table table_cached as select * from
 table where date = 201412XX' - the way we are doing right now.


 On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust mich...@databricks.com
 wrote:

 There is only column level encoding (run length encoding, delta encoding,
 dictionary encoding) and no generic compression.

 On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Hi All,

 Wondering if when caching a table backed by lzo compressed parquet data,
 if spark also compresses it (using lzo/gzip/snappy) along with column level
 encoding or just does the column level encoding when 
 *spark.sql.inMemoryColumnarStorage.compressed
 *is set to true. This is because when I try to cache the data, I notice
 the memory being used is almost as much as the uncompressed size of the
 data.

 Thanks!




Re: does spark sql support columnar compression with encoding when caching tables

2014-12-19 Thread Sadhan Sood
Thanks Michael, that makes sense.

On Fri, Dec 19, 2014 at 3:13 PM, Michael Armbrust mich...@databricks.com
wrote:

 Yeah, tachyon does sound like a good option here.  Especially if you have
 nested data, its likely that parquet in tachyon will always be better
 supported.

 On Fri, Dec 19, 2014 at 2:17 PM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Hey Michael,

 Thank you for clarifying that. Is tachyon the right way to get compressed
 data in memory or should we explore the option of adding compression to
 cached data. This is because our uncompressed data set is too big to fit in
 memory right now. I see the benefit of tachyon not just with storing
 compressed data in memory but we wouldn't have to create a separate table
 for caching some partitions like 'cache table table_cached as select * from
 table where date = 201412XX' - the way we are doing right now.


 On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust mich...@databricks.com
  wrote:

 There is only column level encoding (run length encoding, delta
 encoding, dictionary encoding) and no generic compression.

 On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Hi All,

 Wondering if when caching a table backed by lzo compressed parquet
 data, if spark also compresses it (using lzo/gzip/snappy) along with column
 level encoding or just does the column level encoding when 
 *spark.sql.inMemoryColumnarStorage.compressed
 *is set to true. This is because when I try to cache the data, I
 notice the memory being used is almost as much as the uncompressed size of
 the data.

 Thanks!




does spark sql support columnar compression with encoding when caching tables

2014-12-18 Thread Sadhan Sood
Hi All,

Wondering if when caching a table backed by lzo compressed parquet data, if
spark also compresses it (using lzo/gzip/snappy) along with column level
encoding or just does the column level encoding when
*spark.sql.inMemoryColumnarStorage.compressed
*is set to true. This is because when I try to cache the data, I notice the
memory being used is almost as much as the uncompressed size of the data.

Thanks!


Re: does spark sql support columnar compression with encoding when caching tables

2014-12-18 Thread Michael Armbrust
There is only column level encoding (run length encoding, delta encoding,
dictionary encoding) and no generic compression.

On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com wrote:

 Hi All,

 Wondering if when caching a table backed by lzo compressed parquet data,
 if spark also compresses it (using lzo/gzip/snappy) along with column level
 encoding or just does the column level encoding when 
 *spark.sql.inMemoryColumnarStorage.compressed
 *is set to true. This is because when I try to cache the data, I notice
 the memory being used is almost as much as the uncompressed size of the
 data.

 Thanks!