Hi. 

Have you ruled out that this may just be I/O time?
Word count is a very "light-wight" task for the CPU but you will be needing
to read the initial data from what ever storage device you have your HDFS
running on. 
As you have 3 machines, 22 cores each but perhaps just one or a few HDD /
SSD / NAS the 22 cores may be saturating your I/O capacity and thus I/O
determines the running time or your task?
If it is some form of NAS storage you may be saturating the network
capacity. 

If this is the case, that would explain fluctuations in the observed running
times. A given Map-task may have been lucky, and the data was read when the
I/O was not busy, or unlucky, many machine cores (map-tasks) starting a new
block at about the same time. 

Also, 22*256MB = 5632 MB: This is the RAM you need to cache a block of data
for each map-task running in parallel on the same machine. 
Depending on how much RAM you have per node, you may want to re-block the
data on HDFS for optimal performance.

Hope this helps, 
   Gylfi. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-length-of-each-task-varies-tp24008p24014.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to