Glenn Strycker created SPARK-8666: ------------------------------------- Summary: checkpointing does not take advantage of persisted/cached RDDs Key: SPARK-8666 URL: https://issues.apache.org/jira/browse/SPARK-8666 Project: Spark Issue Type: New Feature Reporter: Glenn Strycker
I have been noticing that when checkpointing RDDs, all operations are occurring TWICE. For example, when I run the following code and watch the stages... {noformat} val newRDD = prevRDD.map(a => (a._1, 1L)).distinct.persist() newRDD.checkpoint print(newRDD.count()) {noformat} I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE. My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org