Re: increase number of map tasks

GorGo Fri, 13 Jan 2012 00:47:00 -0800

Hi. 

I may not be the best person to answer this as my mapper is not written in
JAVA but here are my 2 cents worth.


screen wrote:
> 
> 1. Performance tunning/optimization any good suggestions or links?
> 
With only 10 tasks, and a heavy calculation in the map() tuning the hadoop
is not your first priority. 
However, if you plan to make more tasks on cluster of machines performance
still depends on your code data and I/O profile etc. 
I recommend the book, "Hadoop - The Definitive Guide" published by O'Reilly. 
It does not answer all my questions but it is handy to have on  your
desktop. 


screen wrote:
> 
> 2. Logging - If I do any logging in map/reduce class where will be logging
> or system.out information written?
> 
I am no expert here, but you give the job a working directory in the job
configuration file, I think you will find the logs there. 
 

screen wrote:
> 
> 3. How do we reuse jvm? map tasks creation takes time.
> 
The overhead of starting a JVM is about a second on the Hadoop side and will
only benefit you if you have many very short  tasks, or if you have lengthy
initialization on your side.
"mapred.job.reuse.jvm.num.tasks can be used to specify the maximum number of
tasks to run for a given job for each JVM " (from the book recommended
above). -1 sets it to no limit. 
It can also be set on a per job basis with setNumTasksToExecutePerJvm(). 
Again, I recommend the book.


screen wrote:
> 
> 4. Different types of spills - how do we avoid them?
> 
Not sure exactly what you mean here. In general it is best to have many
short tasks then to manually pick the map tasks and segment so it fits to
CPUS. 
Why, well if something goes wrong (lets say one CPU acts up or is busy doing
something else) your manual plan fails. Second, the number of map tasks you
set should be set to much higher as that is just the maximum and only a hint
to Hadoop on how many to spawn. You do not control the map spawning.
If you want such accurate control I would think doing the work in the
reducer would be better as you do control how many they are. 
Also, you could consider speculative execution, i.e. allowing two CPUs to do
the same work at the same time. This is useful if you have workers that are
not keeping up, and in the end there are some idle workers waiting for the
last work to be done. 
In your case you have 10 tasks, and 7 workers .. lets say that for some
reason 1 of the cpus is much slower than the others, then speculative
execution could have all free cpus do the last 3 tasks and even start the
ones that have not finished of the first 7, just in case they are very very
slow. 
However this is usually much more useful in multi machine cases where the
diff. can be inherent (not the same hardware etc.) 

hope you find some of this useful. 
Regards 
    Gorgo








-- 
View this message in context: 
http://old.nabble.com/increase-number-of-map-tasks-tp33107775p33132789.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: increase number of map tasks

Reply via email to