Re: Caching tables at column level
Thanks - we have tried this and it works nicely. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377p10618.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Caching tables at column level
I have been working a lot recently with denormalised tables with lots of columns, nearly 600. We are using this form to avoid joins. I have tried to use cache table with this data, but it proves too expensive as it seems to try to cache all the data in the table. For data sets such as the one I am using you find that certain columns will be hot, referenced frequently in queries, others will be used very infrequently. Therefore it would be great if caches could be column based. I realise that this may not be optimal for all use cases, but I think it could be quite a common need. Has something like this been considered? Thanks Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Caching tables at column level
Its not completely transparent, but you can do something like the following today: CACHE TABLE hotData AS SELECT columns, I, care, about FROM fullTable On Sun, Feb 1, 2015 at 3:03 AM, Mick Davies michael.belldav...@gmail.com wrote: I have been working a lot recently with denormalised tables with lots of columns, nearly 600. We are using this form to avoid joins. I have tried to use cache table with this data, but it proves too expensive as it seems to try to cache all the data in the table. For data sets such as the one I am using you find that certain columns will be hot, referenced frequently in queries, others will be used very infrequently. Therefore it would be great if caches could be column based. I realise that this may not be optimal for all use cases, but I think it could be quite a common need. Has something like this been considered? Thanks Mick -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org