Github user rezasafi commented on the issue: https://github.com/apache/spark/pull/19848 @steveloughran Thank you very much for your detailed comment. I really appreciate it. I think In the above list when you reach step 6, for Stage2 you will have a different JobId and it cannot be zero considering the current fix. That is because the JobId is rdd.id and in the spark context you will have a new rddId for each new rdd (nextRddId.getAndIncrement()). Across different executions (with different SparkContexts) we may hit the same jobId using this fix. What I understand from your detailed analysis, to resolve that we have two options: 1) Check if the same jobId already is committed and then remove existing files and commit again. 2) Use a UUID and each time create a new unique jobId even across different executions. Option 2 can be problematic since we may not want to have different copies of an rdd at different times. We probably just want the latest one. So maybe the first option is better.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org