Hi,
Did you ever figure this one out? I'm seeing the same behavior:
Calling cache() after a repartition() makes Spark cache the version of the
RDD BEFORE the repartition, which means a shuffle everytime it is accessed..
However, calling cache before the repartition() seems to work fine, the
cach
I frequently encounter problems building Spark as a dependency in java
projects because of version conflicts with other dependencies. Usually there
will be two different versions of a library and we'll see an
AbstractMethodError or invalid signature etc.
So far, I've seen it happen with jackson, s
Great, thank you very much. I was confused because this is in the docs:
https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the
"branch-1.2" branch,
https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md
"Note that if you call schemaRDD.cache() rather tha
Thanks for your response. So AFAICT
calling parallelize(1 to1024).map(i =>KV(i,
i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of
the schemardd in memory
and parallelize(1 to1024).map(i =>KV(i, i.toString)).cache().count() will
show me the size of a regular rdd.
But
Hi,
I want to benchmark the memory savings by using the in-memory columnar
storage for schemardds (using cacheTable) vs caching the SchemaRDD directly.
It would be really helpful to be able to query this from the spark-shell or
jobs directly. Could a dev point me to the way to do this? From what I