[ https://issues.apache.org/jira/browse/SPARK-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-5316: ----------------------------- Component/s: (was: Spark Core) Scheduler > DAGScheduler may make shuffleToMapStage leak if getParentStages failes > ---------------------------------------------------------------------- > > Key: SPARK-5316 > URL: https://issues.apache.org/jira/browse/SPARK-5316 > Project: Spark > Issue Type: Bug > Components: Scheduler > Reporter: YanTang Zhai > Priority: Minor > > DAGScheduler may make shuffleToMapStage leak if getParentStages failes. > If getParentStages has exception for example input path does not exist, > DAGScheduler would fail to handle job submission, while shuffleToMapStage may > be put some records when getParentStages. However these records in > shuffleToMapStage aren't going to be cleaned. > A simple job as follows: > {code:java} > val inputFile1 = ... // Input path does not exist when this job submits > val inputFile2 = ... > val outputFile = ... > val conf = new SparkConf() > val sc = new SparkContext(conf) > val rdd1 = sc.textFile(inputFile1) > .flatMap(line => line.split(" ")) > .map(word => (word, 1)) > .reduceByKey(_ + _, 1) > val rdd2 = sc.textFile(inputFile2) > .flatMap(line => line.split(",")) > .map(word => (word, 1)) > .reduceByKey(_ + _, 1) > try { > val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1) > rdd3.saveAsTextFile(outputFile) > } catch { > case e : Exception => > logError(e) > } > // print the information of DAGScheduler's shuffleToMapStage to check > // whether it still has uncleaned records. > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org