Re: Issue with repartition and cache

2016-10-31 Thread ankits
Hi, Did you ever figure this one out? I'm seeing the same behavior: Calling cache() after a repartition() makes Spark cache the version of the RDD BEFORE the repartition, which means a shuffle everytime it is accessed.. However, calling cache before the repartition() seems to work fine, the cach

Why are all spark deps not shaded to avoid dependency hell?

2015-07-08 Thread ankits
I frequently encounter problems building Spark as a dependency in java projects because of version conflicts with other dependencies. Usually there will be two different versions of a library and we'll see an AbstractMethodError or invalid signature etc. So far, I've seen it happen with jackson, s

Re: Get size of rdd in memory

2015-02-02 Thread ankits
Great, thank you very much. I was confused because this is in the docs: https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the "branch-1.2" branch, https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md "Note that if you call schemaRDD.cache() rather tha

Re: Get size of rdd in memory

2015-02-02 Thread ankits
Thanks for your response. So AFAICT calling parallelize(1 to1024).map(i =>KV(i, i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of the schemardd in memory and parallelize(1 to1024).map(i =>KV(i, i.toString)).cache().count() will show me the size of a regular rdd. But

Get size of rdd in memory

2015-01-30 Thread ankits
Hi, I want to benchmark the memory savings by using the in-memory columnar storage for schemardds (using cacheTable) vs caching the SchemaRDD directly. It would be really helpful to be able to query this from the spark-shell or jobs directly. Could a dev point me to the way to do this? From what I