Re: How to control the number of map tasks for each nodes?

Vitaliy Semochkin Thu, 22 Jul 2010 04:41:35 -0700

On Thu, Jul 22, 2010 at 1:07 AM, Allen Wittenauer
<awittena...@linkedin.com>wrote:


>
> On Jul 21, 2010, at 9:17 AM, Vitaliy Semochkin wrote:
> > might I ask how did you come to such result?
> >
> > In my cluster I use number of mappers and reducers twice more than I have
> > cpu*cores
>
> This is probably a sign that your data is in too many small files.
>
> > How did I come to this solution, - first I noticed that in TOP avg load
> is
> > very law (3-4%) and I noticed that cpu do a lot of WA.
> >
> > After several experiments I found out that having number of mappers
> reducers
> > TWICE more than I have cpu*core does the best result (the result was
> almost
> > TWICE BETTER).
>
> But that should put more strain on the IO system since now more tasks are
> waiting for input.... so chances are good that your wait isn't IO, but in
> context switching....  Another good sign that you have too many files in too
> many blocks.
>
If it was a context switching would the increasing number of
mappers/reducers lead to performance improvement?


>
> > That I can explain by the fast that I do relativly simple log counting
> > (count number of visitors,hits, etc)
> > and in this case I have relativly huge amount of IO (logs are huge) and
> > small amount computation.
> > I also use mapred.job.reuse.jvm.num.tasks=-1
>
> How many files, and what is your block count, and how large is the average
> file?  'huge' is fairly relative. :)

I have one log file ~140GB I use default hdfs block size (64mb)

also I set  dfs.replication=1
Am I right that the higher dfs.replication the faster map reduce will work
because the probability that split will be on a local node will be equal to
1?

Also, is it correct that it will slow down put operations? (technically put
operations will run in parallel so I'm not sure if it will slow down
performance or not)



> > What I do not understand is why
> > mapred.child.java.opts=-Xmx256m boosts performance in comparison to
> -Xmx160m
> > how bigger amount of RAM can give me any benefit even if I don't receive
> out
> > of memory errors with smaller -Xmx values?!
>
> More memory means that Hadoop doesn't have to spill to disk as often due to
> being able to use a larger buffer in RAM.
>
> Does hadoop  check if it has enough memory for such operation?

Regards,
Vitaliy S

Re: How to control the number of map tasks for each nodes?

Reply via email to