The goal of rdd.persist is to created a cached rdd that breaks the DAG
lineage. Therefore, computations *in the same job* that use that RDD can
re-use that intermediate result, but it's not meant to survive between job
runs.
for example:
val baseData =
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want
for your use case. As for Parquet support, that's newly arrived in Spark
1.0.0 together with SparkSQL so continue to watch this space.
Gerard's suggestion to look at JobServer, which you can generalize as
building a
RE:
Given that our agg sizes will exceed memory, we expect to cache them to
disk, so save-as-object (assuming there are no out of the ordinary
performance issues) may solve the problem, but I was hoping to store data
is a column orientated format. However I think this in general is not
possible
Hi,
On 06/12/2014 05:47 PM, Toby Douglass wrote:
In these future jobs, when I come to load the aggregted RDD, will Spark
load and only load the columns being accessed by the query? or will Spark
load everything, to convert it into an internal representation, and then
execute the query?
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher
schum...@icsi.berkeley.edu wrote:
On 06/12/2014 05:47 PM, Toby Douglass wrote:
In these future jobs, when I come to load the aggregted RDD, will Spark
load and only load the columns being accessed by the query? or will
Spark
load