Re: initial basic question from new user

2014-06-12 Thread Gerard Maas
The goal of rdd.persist is to created a cached rdd that breaks the DAG lineage. Therefore, computations *in the same job* that use that RDD can re-use that intermediate result, but it's not meant to survive between job runs. for example: val baseData =

Re: initial basic question from new user

2014-06-12 Thread Christopher Nguyen
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want for your use case. As for Parquet support, that's newly arrived in Spark 1.0.0 together with SparkSQL so continue to watch this space. Gerard's suggestion to look at JobServer, which you can generalize as building a

Re: initial basic question from new user

2014-06-12 Thread FRANK AUSTIN NOTHAFT
RE: Given that our agg sizes will exceed memory, we expect to cache them to disk, so save-as-object (assuming there are no out of the ordinary performance issues) may solve the problem, but I was hoping to store data is a column orientated format. However I think this in general is not possible

Re: initial basic question from new user

2014-06-12 Thread Andre Schumacher
Hi, On 06/12/2014 05:47 PM, Toby Douglass wrote: In these future jobs, when I come to load the aggregted RDD, will Spark load and only load the columns being accessed by the query? or will Spark load everything, to convert it into an internal representation, and then execute the query?

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher schum...@icsi.berkeley.edu wrote: On 06/12/2014 05:47 PM, Toby Douglass wrote: In these future jobs, when I come to load the aggregted RDD, will Spark load and only load the columns being accessed by the query? or will Spark load