Re: Custom persist or cache of RDD?

2014-11-11 Thread Daniel Siegmann
But that requires an (unnecessary) load from disk.

I have run into this same issue, where we want to save intermediate results
but continue processing. The cache / persist feature of Spark doesn't seem
designed for this case. Unfortunately I'm not aware of a better solution
with the current version of Spark.

On Mon, Nov 10, 2014 at 5:15 PM, Sean Owen so...@cloudera.com wrote:

 Well you can always create C by loading B from disk, and likewise for
 E / D. No need for any custom procedure.

 On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang bewang.t...@gmail.com wrote:
  When I have a multi-step process flow like this:
 
  A - B - C - D - E - F
 
  I need to store B and D's results into parquet files
 
  B.saveAsParquetFile
  D.saveAsParquetFile
 
  If I don't cache/persist any step, spark might recompute from A,B,C,D
 and E
  if something is wrong in F.
 
  Of course, I'd better cache all steps if I have enough memory to avoid
 this
  re-computation, or persist result to disk. But persisting B and D seems
  duplicate with saving B and D as parquet files.
 
  I'm wondering if spark can restore B and D from the parquet files using a
  customized persist and restore procedure?
 
 
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Custom persist or cache of RDD?

2014-11-10 Thread Benyi Wang
When I have a multi-step process flow like this:

A - B - C - D - E - F

I need to store B and D's results into parquet files

B.saveAsParquetFile
D.saveAsParquetFile

If I don't cache/persist any step, spark might recompute from A,B,C,D and E
if something is wrong in F.

Of course, I'd better cache all steps if I have enough memory to avoid this
re-computation, or persist result to disk. But persisting B and D seems
duplicate with saving B and D as parquet files.

I'm wondering if spark can restore B and D from the parquet files using a
customized persist and restore procedure?


Re: Custom persist or cache of RDD?

2014-11-10 Thread Sean Owen
Well you can always create C by loading B from disk, and likewise for
E / D. No need for any custom procedure.

On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang bewang.t...@gmail.com wrote:
 When I have a multi-step process flow like this:

 A - B - C - D - E - F

 I need to store B and D's results into parquet files

 B.saveAsParquetFile
 D.saveAsParquetFile

 If I don't cache/persist any step, spark might recompute from A,B,C,D and E
 if something is wrong in F.

 Of course, I'd better cache all steps if I have enough memory to avoid this
 re-computation, or persist result to disk. But persisting B and D seems
 duplicate with saving B and D as parquet files.

 I'm wondering if spark can restore B and D from the parquet files using a
 customized persist and restore procedure?





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org