But that requires an (unnecessary) load from disk.
I have run into this same issue, where we want to save intermediate results
but continue processing. The cache / persist feature of Spark doesn't seem
designed for this case. Unfortunately I'm not aware of a better solution
with the current
When I have a multi-step process flow like this:
A - B - C - D - E - F
I need to store B and D's results into parquet files
B.saveAsParquetFile
D.saveAsParquetFile
If I don't cache/persist any step, spark might recompute from A,B,C,D and E
if something is wrong in F.
Of course, I'd better
Well you can always create C by loading B from disk, and likewise for
E / D. No need for any custom procedure.
On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang bewang.t...@gmail.com wrote:
When I have a multi-step process flow like this:
A - B - C - D - E - F
I need to store B and D's results