Re: cached data between jobs

2015-09-02 Thread Eric Walker
Hi Jeff, I think I see what you're saying. I was thinking more of a whole Spark job, where `spark-submit` is run once to completion and then started up again, rather than a "job" as seen in the Spark UI. I take it there is no implicit caching of results between `spark-submit` runs. (In the

Re: cached data between jobs

2015-09-01 Thread Jeff Zhang
Hi Eric, If the 2 jobs share the same parent stages. these stages can be skipped for the second job. Here's one simple example: val rdd1 = sc.parallelize(1 to 10).map(e=>(e,e)) val rdd2 = rdd1.groupByKey() rdd2.map(e=>e._1).collect() foreach println rdd2.map(e=> (e._1, e._2.size)).collect

cached data between jobs

2015-09-01 Thread Eric Walker
Hi, I'm noticing that a 30 minute job that was initially IO-bound may not be during subsequent runs. Is there some kind of between-job caching that happens in Spark or in Linux that outlives jobs and that might be making subsequent runs faster? If so, is there a way to avoid the caching in