When I have a multi-step process flow like this: A -> B -> C -> D -> E -> F
I need to store B and D's results into parquet files B.saveAsParquetFile D.saveAsParquetFile If I don't cache/persist any step, spark might recompute from A,B,C,D and E if something is wrong in F. Of course, I'd better cache all steps if I have enough memory to avoid this re-computation, or persist result to disk. But persisting B and D seems duplicate with saving B and D as parquet files. I'm wondering if spark can restore B and D from the parquet files using a customized persist and restore procedure?