"But learned that it is better not to reduce it to 0." could you explain a bit more this sentence?
thanks Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..." - Edsger Dijkstra My favorite quotes (today): "If debugging is the process of removing software bugs, then programming must be the process of putting ..." - Edsger Dijkstra "If you pay peanuts you get monkeys" 2016-02-04 11:33 GMT+01:00 Prabhu Joseph <prabhujose.ga...@gmail.com>: > Okay, the reason for the task delay within executor when some RDD in > memory and some in Hadoop i.e, Multiple Locality Levels NODE_LOCAL and ANY, > in this case Scheduler waits > for *spark.locality.wait *3 seconds default. During this period, > scheduler waits to launch a data-local task before giving up and launching > it on a less-local node. > > So after making it 0, all tasks started parallel. But learned that it is > better not to reduce it to 0. > > > On Mon, Feb 1, 2016 at 2:02 PM, Prabhu Joseph <prabhujose.ga...@gmail.com> > wrote: > >> Hi All, >> >> >> Sample Spark application which reads a logfile from hadoop (1.2GB - 5 >> RDD's created each approx 250MB data) and there are two jobs. Job A gets >> the line with "a" and the Job B gets the line with "b". The spark >> application is ran multiple times, each time with >> different executor memory, and enable/disable cache() function. Job A >> performance is same in all the runs as it has to read the entire data first >> time from Disk. >> >> Spark Cluster - standalone mode with Spark Master, single worker node (12 >> cores, 16GB memory) >> >> val logData = sc.textFile(logFile, 2) >> var numAs = logData.filter(line => line.contains("a")).count() >> var numBs = logData.filter(line => line.contains("b")).count() >> >> >> *Job B (which has 5 tasks) results below:* >> >> *Run 1:* 1 executor with 2GB memory, 12 cores took 2 seconds [ran1 image] >> >> Since logData is not cached, the job B has to again read the 1.2GB >> data from hadoop into memory and all the 5 tasks started parallel and each >> took 2 sec (29ms for GC) and the >> overall job completed in 2 seconds. >> >> *Run 2:* 1 executor with 2GB memory, 12 cores and logData is cached took >> 4 seconds [ran2 image, ran2_cache image] >> >> val logData = sc.textFile(logFile, 2).cache() >> >> The Executor does not have enough memory to cache and hence again >> needs to read the entire 1.2GB data from hadoop into memory. But since the >> cache() is used, leads to lot of GC pause leading to slowness in task >> completion. Each task started parallel and >> completed in 4 seconds (more than 1 sec for GC). >> >> *Run 3: 1 executor with 6GB memory, 12 cores and logData is cached took >> 10 seconds [ran3 image]* >> >> The Executor has memory that can fit 4 RDD partitions into memory >> but 5th RDD it has to read from Hadoop. 4 tasks are started parallel and >> they completed in 0.3 seconds without GC. But the 5th task which has to >> read RDD from disk is started after 4 seconds, and gets completed in 2 >> seconds. Analysing why the 5th task is not started parallel with other >> tasks or at least why it is not started immediately after the other task >> completion. >> >> *Run 4:* 1 executor with 16GB memory , 12 cores and logData is cached >> took 0.3 seconds [ran4 image] >> >> The executor has enough memory to cache all the 5 RDD. All 5 tasks >> are started in parallel and gets completed within 0.3 seconds. >> >> >> So Spark performs well when entire input data is in Memory or None. In >> case of some RDD in memory and some from disk, there is a delay in >> scheduling the fifth task, is it a expected behavior or a possible Bug. >> >> >> >> Thanks, >> Prabhu Joseph >> >> >> >> >