Re: Spark job does not perform well when some RDD in memory and some on Disk

Alonso Isidoro Roman Thu, 04 Feb 2016 03:43:48 -0800

"But learned that it is better not to reduce it to 0."

could you explain a bit more this sentence?


thanks

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-02-04 11:33 GMT+01:00 Prabhu Joseph <prabhujose.ga...@gmail.com>:

> Okay, the reason for the task delay within executor when some RDD in
> memory and some in Hadoop i.e, Multiple Locality Levels NODE_LOCAL and ANY,
> in this case Scheduler waits
> for *spark.locality.wait *3 seconds default. During this period,
> scheduler waits to launch a data-local task before giving up and launching
> it on a less-local node.
>
> So after making it 0, all tasks started parallel. But learned that it is
> better not to reduce it to 0.
>
>
> On Mon, Feb 1, 2016 at 2:02 PM, Prabhu Joseph <prabhujose.ga...@gmail.com>
> wrote:
>
>> Hi All,
>>
>>
>> Sample Spark application which reads a logfile from hadoop (1.2GB - 5
>> RDD's created each approx 250MB data) and there are two jobs. Job A gets
>> the line with "a" and the Job B gets the line with "b". The spark
>> application is ran multiple times, each time with
>> different executor memory, and enable/disable cache() function. Job A
>> performance is same in all the runs as it has to read the entire data first
>> time from Disk.
>>
>> Spark Cluster - standalone mode with Spark Master, single worker node (12
>> cores, 16GB memory)
>>
>>     val logData = sc.textFile(logFile, 2)
>>     var numAs = logData.filter(line => line.contains("a")).count()
>>     var numBs = logData.filter(line => line.contains("b")).count()
>>
>>
>> *Job B (which has 5 tasks) results below:*
>>
>> *Run 1:* 1 executor with 2GB memory, 12 cores took 2 seconds [ran1 image]
>>
>>     Since logData is not cached, the job B has to again read the 1.2GB
>> data from hadoop into memory and all the 5 tasks started parallel and each
>> took 2 sec (29ms for GC) and the
>>  overall job completed in 2 seconds.
>>
>> *Run 2:* 1 executor with 2GB memory, 12 cores and logData is cached took
>> 4 seconds [ran2 image, ran2_cache image]
>>
>>      val logData = sc.textFile(logFile, 2).cache()
>>
>>      The Executor does not have enough memory to cache and hence again
>> needs to read the entire 1.2GB data from hadoop into memory.  But since the
>> cache() is used, leads to lot of GC pause leading to slowness in task
>> completion. Each task started parallel and
>> completed in 4 seconds (more than 1 sec for GC).
>>
>> *Run 3: 1 executor with 6GB memory, 12 cores and logData is cached took
>> 10 seconds [ran3 image]*
>>
>>      The Executor has memory that can fit 4 RDD partitions into memory
>> but 5th RDD it has to read from Hadoop. 4 tasks are started parallel and
>> they completed in 0.3 seconds without GC. But the 5th task which has to
>> read RDD from disk is started after 4 seconds, and gets completed in 2
>> seconds. Analysing why the 5th task is not started parallel with other
>> tasks or at least why it is not started immediately after the other task
>> completion.
>>
>> *Run 4:* 1 executor with 16GB memory , 12 cores and logData is cached
>> took 0.3 seconds [ran4 image]
>>
>>      The executor has enough memory to cache all the 5 RDD. All 5 tasks
>> are started in parallel and gets completed within 0.3 seconds.
>>
>>
>> So Spark performs well when entire input data is in Memory or None. In
>> case of some RDD in memory and some from disk, there is a delay in
>> scheduling the fifth task, is it a expected behavior or a possible Bug.
>>
>>
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>>
>>
>

Re: Spark job does not perform well when some RDD in memory and some on Disk

Reply via email to