Re: Caching tables at column level

2015-02-13 Thread Mick Davies
Thanks - we have tried this and it works nicely.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377p10618.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Caching tables at column level

2015-02-01 Thread Mick Davies
I have been working a lot recently with denormalised tables with lots of
columns, nearly 600. We are using this form to avoid joins. 

I have tried to use cache table with this data, but it proves too expensive
as it seems to try to cache all the data in the table.

For data sets such as the one I am using you find that certain columns will
be hot, referenced frequently in queries, others will be used very
infrequently.

Therefore it would be great if caches could be column based. I realise that
this may not be optimal for all use cases, but I think it could be quite a
common need.  Has something like this been considered?

Thanks Mick



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Caching tables at column level

2015-02-01 Thread Michael Armbrust
Its not completely transparent, but you can do something like the following
today:

CACHE TABLE hotData AS SELECT columns, I, care, about FROM fullTable

On Sun, Feb 1, 2015 at 3:03 AM, Mick Davies michael.belldav...@gmail.com
wrote:

 I have been working a lot recently with denormalised tables with lots of
 columns, nearly 600. We are using this form to avoid joins.

 I have tried to use cache table with this data, but it proves too expensive
 as it seems to try to cache all the data in the table.

 For data sets such as the one I am using you find that certain columns will
 be hot, referenced frequently in queries, others will be used very
 infrequently.

 Therefore it would be great if caches could be column based. I realise that
 this may not be optimal for all use cases, but I think it could be quite a
 common need.  Has something like this been considered?

 Thanks Mick



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org