Prabhu Joseph created SPARK-13181:
-------------------------------------

             Summary: Spark delay in task scheduling within executor
                 Key: SPARK-13181
                 URL: https://issues.apache.org/jira/browse/SPARK-13181
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.5.2
            Reporter: Prabhu Joseph
             Fix For: 1.5.2


When Spark job with some RDD in memory and some in Hadoop, the tasks within 
Executor which reads from memory is started parallel but task to read from 
hadoop is started after some delay.

Repro: 

    A logFile of 1.25 GB is given as input. (5 RDD each of 256MB) 

    val logData = sc.textFile(logFile, 2).cache()
    var numAs = logData.filter(line => line.contains("a")).count()
    var numBs = logData.filter(line => line.contains("b")).count()

Run Spark Job with 1 executor with 6GB memory, 12 cores

Stage A (reading line with a) - executor starts 5 tasks parallel and all reads 
data from Hadoop.

Stage B(reading line with b) - As the data is cached (4 RDD is in memory, 1 is 
in Hadoop) - executor starts 4 tasks parallel and after 4 seconds delay, starts 
the last task to read from Hadoop.

On Running the same Spark Job with 12GB memory, all 5 RDD are in memory ans 5 
tasks in Stage B started parallel. 

On Running the job with 2GB memory, all 5 RDD are in Hadoop and 5 tasks in 
stage B started parallel. 

The task delay happens only when some RDD in memory and some in Hadoop.

Check the attached image.









--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to