Re: Custom persist or cache of RDD?
But that requires an (unnecessary) load from disk. I have run into this same issue, where we want to save intermediate results but continue processing. The cache / persist feature of Spark doesn't seem designed for this case. Unfortunately I'm not aware of a better solution with the current version of Spark. On Mon, Nov 10, 2014 at 5:15 PM, Sean Owen so...@cloudera.com wrote: Well you can always create C by loading B from disk, and likewise for E / D. No need for any custom procedure. On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang bewang.t...@gmail.com wrote: When I have a multi-step process flow like this: A - B - C - D - E - F I need to store B and D's results into parquet files B.saveAsParquetFile D.saveAsParquetFile If I don't cache/persist any step, spark might recompute from A,B,C,D and E if something is wrong in F. Of course, I'd better cache all steps if I have enough memory to avoid this re-computation, or persist result to disk. But persisting B and D seems duplicate with saving B and D as parquet files. I'm wondering if spark can restore B and D from the parquet files using a customized persist and restore procedure? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Custom persist or cache of RDD?
When I have a multi-step process flow like this: A - B - C - D - E - F I need to store B and D's results into parquet files B.saveAsParquetFile D.saveAsParquetFile If I don't cache/persist any step, spark might recompute from A,B,C,D and E if something is wrong in F. Of course, I'd better cache all steps if I have enough memory to avoid this re-computation, or persist result to disk. But persisting B and D seems duplicate with saving B and D as parquet files. I'm wondering if spark can restore B and D from the parquet files using a customized persist and restore procedure?
Re: Custom persist or cache of RDD?
Well you can always create C by loading B from disk, and likewise for E / D. No need for any custom procedure. On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang bewang.t...@gmail.com wrote: When I have a multi-step process flow like this: A - B - C - D - E - F I need to store B and D's results into parquet files B.saveAsParquetFile D.saveAsParquetFile If I don't cache/persist any step, spark might recompute from A,B,C,D and E if something is wrong in F. Of course, I'd better cache all steps if I have enough memory to avoid this re-computation, or persist result to disk. But persisting B and D seems duplicate with saving B and D as parquet files. I'm wondering if spark can restore B and D from the parquet files using a customized persist and restore procedure? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org