YanTang Zhai created SPARK-5316:
-----------------------------------

             Summary: DAGScheduler may make shuffleToMapStage leak if 
getParentStages failes
                 Key: SPARK-5316
                 URL: https://issues.apache.org/jira/browse/SPARK-5316
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: YanTang Zhai
            Priority: Minor


DAGScheduler may make shuffleToMapStage leak if getParentStages failes.
If getParentStages has exception for example input path does not exist, 
DAGScheduler would fail to handle job submission, while shuffleToMapStage may 
be put some records when getParentStages. However these records in 
shuffleToMapStage aren't going to be cleaned.
A simple job as follows:
{code:java}
val inputFile1 = ... // Input path does not exist when this job submits
val inputFile2 = ...
val outputFile = ...
val conf = new SparkConf()
val sc = new SparkContext(conf)
val rdd1 = sc.textFile(inputFile1)
                    .flatMap(line => line.split(" "))
                    .map(word => (word, 1))
                    .reduceByKey(_ + _, 1)
val rdd2 = sc.textFile(inputFile2)
                    .flatMap(line => line.split(","))
                    .map(word => (word, 1))
                    .reduceByKey(_ + _, 1)
try {
  val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1)
  rdd3.saveAsTextFile(outputFile)
} catch {
  case e : Exception =>
      logError(e)
}
// print the information of DAGScheduler's shuffleToMapStage to check
// whether it still has uncleaned records.
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to