Map-Reduce in memory

2011-10-28 Thread N.N. Gesli
Hello,

We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each
node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of
this huge RAM and run the map-reduce jobs mostly in memory with no spill, if
possible. We use Hive for most of the processes. I have set:
mapred.tasktracker.map.tasks.maximum = 16
mapred.tasktracker.reduce.tasks.maximum = 8
mapred.child.java.opts = 6144
mapred.reduce.parallel.copies = 20
mapred.job.shuffle.merge.percent = 1.0
mapred.job.reduce.input.buffer.percent = 0.25
mapred.inmem.merge.threshold = 0

One of my Hive queries is producing 6 stage map-reduce jobs. On the third
stage when it queries from a 200GB table, the last 14 reducers hang. I
changed mapred.task.timeout to 0 to see if they really hang. It has been 5
hours, so something terribly wrong in my setup. Parts of the log is below.
My questions:
* What should be my configurations to make reducers to run in the memory?
* Why it keeps waiting for map outputs?
* What does it mean "dup hosts"?

Thank you!
N.Gesli



2011-10-27 16:35:24,304 WARN org.apache.hadoop.util.NativeCodeLoader: Unable
to load native-hadoop library for your platform... using builtin-java
classes where applicable

2011-10-27 16:35:24,611 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=SHUFFLE, sessionId=
2011-10-27 16:35:24,722 INFO org.apache.hadoop.mapred.ReduceTask:
ShuffleRamManager: MemoryLimit=1503238528,
MaxSingleShuffleLimit=375809632
2011-10-27 16:35:24,733 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Thread started: Thread for
merging on-disk files
2011-10-27 16:35:24,733 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Thread waiting: Thread for
merging on-disk files
2011-10-27 16:35:24,734 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Thread started: Thread for
merging in memory files
2011-10-27 16:35:24,735 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Thread started: Thread for
polling Map Completion Events
2011-10-27 16:35:24,735 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Need another 1308 map output(s)
where 0 is already in progress
2011-10-27 16:35:24,736 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)
2011-10-27 16:35:29,738 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 12 outputs (0 slow
hosts and0 dup hosts)
2011-10-27 16:35:30,364 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 5 outputs (0 slow hosts
and753 dup hosts)
2011-10-27 16:35:30,367 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts
and1182 dup hosts)
2011-10-27 16:35:30,368 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts
and1184 dup hosts)
2011-10-27 16:35:30,371 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 2 outputs (0 slow hosts
and1073 dup hosts)

...

2011-10-27 16:36:04,284 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts
and958 dup hosts)

2011-10-27 16:36:04,310 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts
and958 dup hosts)
2011-10-27 16:36:07,721 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts
and950 dup hosts)
2011-10-27 16:36:16,455 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts
and948 dup hosts)
2011-10-27 16:36:16,464 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts
and948 dup hosts)
2011-10-27 16:36:24,736 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Need another 1061 map output(s)
where 12 is already in progress
2011-10-27 16:36:24,736 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 0 outputs (0 slow hosts
and1049 dup hosts)
2011-10-27 16:37:24,737 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Need another 1061 map output(s)
where 12 is already in progress
2011-10-27 16:37:24,737 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 0 outputs (0 slow hosts
and1049 dup hosts)
2011-10-27 16:38:24,738 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Need another 1061 map output(s)
where 12 is already in progress
2011-10-27 16:38:24,738 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_07_1 Scheduled 0 outputs (0 slow hosts
and1049 dup hosts)
2011-10-27 16:39:24,739 INFO org.apache.had

Re: Map-Reduce in memory

2011-10-28 Thread Harsh J
Hey N.N. Gesli,

(Inline)

On Fri, Oct 28, 2011 at 12:38 PM, N.N. Gesli  wrote:
> Hello,
>
> We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each
> node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of
> this huge RAM and run the map-reduce jobs mostly in memory with no spill, if
> possible. We use Hive for most of the processes. I have set:
> mapred.tasktracker.map.tasks.maximum = 16
> mapred.tasktracker.reduce.tasks.maximum = 8

This is *crazy* for an 8 core machine. Try to keep M+R slots well
below 8 instead - You're probably CPU-thrashed in this setup once
large number of tasks get booted.

> mapred.child.java.opts = 6144

You can also raise io.sort.mb to 2000, and tweak io.sort.factor.

The child opts raise to 6~ GB looks a bit unnecessary since most of
your tasks work on record basis and would not care much about total
RAM. Perhaps use all that RAM for a service like HBase which can
leverage caching nicely!

> One of my Hive queries is producing 6 stage map-reduce jobs. On the third
> stage when it queries from a 200GB table, the last 14 reducers hang. I
> changed mapred.task.timeout to 0 to see if they really hang. It has been 5
> hours, so something terribly wrong in my setup. Parts of the log is below.

It is probably just your slot settings. You may be massively
over-subscribing your CPU resources with 16 map task slots + 8 reduce
tasks slots. At worst case, it would mean 24 total JVMs competing over
8 available physical processors. Doesn't make sense to me at least -
Make it more like 7 M / 2 R or so :)

> My questions:
> * What should be my configurations to make reducers to run in the memory?
> * Why it keeps waiting for map outputs?

It has to fetch map outputs to get some data to start with. And it
pulls the map outputs a few at a time - to not overload the network
during shuffle phases of several reducers across the cluster.

> * What does it mean "dup hosts"?

Duplicate hosts. Hosts it already knows about and has already
scheduled fetch work upon.



-- 
Harsh J


Re: Map-Reduce in memory

2011-10-28 Thread Michel Segel
Uhm...
He has plenty of memory... Depending on what sort of m/r tasks... He could push 
it.
Didn't say how much disk...

I wouldn't start that high... Try 10 mappers and 2. Reducers. Granted it is a 
bit asymmetric and you can bump up the reducers...

Watch your jobs in ganglia and see what is happening... 

Harsh, assuming he is using intel, each core is hyper threaded so the box sees 
this as 2x CPUs.
8 cores looks like 16. 


Sent from a remote device. Please excuse any typos...

Mike Segel

On Oct 28, 2011, at 3:08 AM, Harsh J  wrote:

> Hey N.N. Gesli,
> 
> (Inline)
> 
> On Fri, Oct 28, 2011 at 12:38 PM, N.N. Gesli  wrote:
>> Hello,
>> 
>> We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each
>> node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of
>> this huge RAM and run the map-reduce jobs mostly in memory with no spill, if
>> possible. We use Hive for most of the processes. I have set:
>> mapred.tasktracker.map.tasks.maximum = 16
>> mapred.tasktracker.reduce.tasks.maximum = 8
> 
> This is *crazy* for an 8 core machine. Try to keep M+R slots well
> below 8 instead - You're probably CPU-thrashed in this setup once
> large number of tasks get booted.
> 
>> mapred.child.java.opts = 6144
> 
> You can also raise io.sort.mb to 2000, and tweak io.sort.factor.
> 
> The child opts raise to 6~ GB looks a bit unnecessary since most of
> your tasks work on record basis and would not care much about total
> RAM. Perhaps use all that RAM for a service like HBase which can
> leverage caching nicely!
> 
>> One of my Hive queries is producing 6 stage map-reduce jobs. On the third
>> stage when it queries from a 200GB table, the last 14 reducers hang. I
>> changed mapred.task.timeout to 0 to see if they really hang. It has been 5
>> hours, so something terribly wrong in my setup. Parts of the log is below.
> 
> It is probably just your slot settings. You may be massively
> over-subscribing your CPU resources with 16 map task slots + 8 reduce
> tasks slots. At worst case, it would mean 24 total JVMs competing over
> 8 available physical processors. Doesn't make sense to me at least -
> Make it more like 7 M / 2 R or so :)
> 
>> My questions:
>> * What should be my configurations to make reducers to run in the memory?
>> * Why it keeps waiting for map outputs?
> 
> It has to fetch map outputs to get some data to start with. And it
> pulls the map outputs a few at a time - to not overload the network
> during shuffle phases of several reducers across the cluster.
> 
>> * What does it mean "dup hosts"?
> 
> Duplicate hosts. Hosts it already knows about and has already
> scheduled fetch work upon.
> 
> 
> 
> -- 
> Harsh J
> 


Re: Map-Reduce in memory

2011-11-04 Thread Michel Segel
Hi, 
First, you have 8 physical cores. Hyper threading makes the machine think that 
it has 16. The trouble is that you really don't have 16 cores so you need to be 
a little more conservative.

You don't mention HBase, so I'm going to assume that you don't have it 
installed.
So in terms of tasks, allocate a core each to DN and TT leaving 6 cores or 12 
hyper threaded cores. This leaves a little headroom for the other linux 
processes...

Now you can split the number of remaining cores however you want.
You can even overlap a bit since you are not going to be running all of your 
reducers at the same time.
So let's say 10 mappers and the 4 reducers to start.

Since you have all that memory, you can bump up you DN and TT allocations.

W ith respect to your tuning... You need to change them one at a time...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Nov 4, 2011, at 1:46 AM, "N.N. Gesli"  wrote:

> Thank you very much for your replies.
> 
> Michel, disk is 3TB (6x550GB, 50 GB from each disk is reserved for local 
> basically for mapred.local.dir). You are right on the CPU; it is 8 core but 
> shows as 16. Is that mean it can handle 16 JVMs at a time? CPU is a little 
> overloaded, but that is not a huge problem at this point.
> 
> I made io.sort.factor 200 and io.sort.mb 2000. Still got the same 
> error/timeout. I played with all related conf settings one by one. At last, 
> changing mapred.job.shuffle.merge.percent from 1.0 back to 0.66 solved the 
> problem.
> 
> However, the job is still taking long time. There are 84 reducers, but only 
> one of them takes a very long time. I attached the log file of that reduce 
> task. Majority of the data gets spilled to disk. Even if I set 
> mapred.child.java.opts to 6144, the reduce task log shows
> ShuffleRamManager: MemoryLimit=1503238528, MaxSingleShuffleLimit=375809632
> as if memory is 2GB (70% of 2GB=1503238528b). In the same log file later 
> there is also this line:
> INFO ExecReducer: maximum memory = 6414139392
> I am not using memory monitoring. Tasktrackers have this line in the log:
> TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is 
> disabled.
> Why is ShuffleRamManager is finding that number as if the max memory is 2GB?
> Why am I still getting that much spill even with these aggressive memory 
> settings?
> Why only one reducer taking that long?
> What else I can change to make this job processed in the memory and finish 
> faster?
> 
> Thank you.
> -N.N.Gesli
> 
> On Fri, Oct 28, 2011 at 2:14 AM, Michel Segel  
> wrote:
> Uhm...
> He has plenty of memory... Depending on what sort of m/r tasks... He could 
> push it.
> Didn't say how much disk...
> 
> I wouldn't start that high... Try 10 mappers and 2. Reducers. Granted it is a 
> bit asymmetric and you can bump up the reducers...
> 
> Watch your jobs in ganglia and see what is happening...
> 
> Harsh, assuming he is using intel, each core is hyper threaded so the box 
> sees this as 2x CPUs.
> 8 cores looks like 16.
> 
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Oct 28, 2011, at 3:08 AM, Harsh J  wrote:
> 
> > Hey N.N. Gesli,
> >
> > (Inline)
> >
> > On Fri, Oct 28, 2011 at 12:38 PM, N.N. Gesli  wrote:
> >> Hello,
> >>
> >> We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each
> >> node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of
> >> this huge RAM and run the map-reduce jobs mostly in memory with no spill, 
> >> if
> >> possible. We use Hive for most of the processes. I have set:
> >> mapred.tasktracker.map.tasks.maximum = 16
> >> mapred.tasktracker.reduce.tasks.maximum = 8
> >
> > This is *crazy* for an 8 core machine. Try to keep M+R slots well
> > below 8 instead - You're probably CPU-thrashed in this setup once
> > large number of tasks get booted.
> >
> >> mapred.child.java.opts = 6144
> >
> > You can also raise io.sort.mb to 2000, and tweak io.sort.factor.
> >
> > The child opts raise to 6~ GB looks a bit unnecessary since most of
> > your tasks work on record basis and would not care much about total
> > RAM. Perhaps use all that RAM for a service like HBase which can
> > leverage caching nicely!
> >
> >> One of my Hive queries is producing 6 stage map-reduce jobs. On the third
> >> stage when it queries from a 200GB table, the last 14 reducers hang. I
> >> changed mapred.task.timeout to 0 to see if they really hang. It has been 5
> >> hours, so something terribly wrong in my setup. Parts of the log is below.
> >
> > It is probably just your slot settings. You may be massively
> > over-subscribing your CPU resources with 16 map task slots + 8 reduce
> > tasks slots. At worst case, it would mean 24 total JVMs competing over
> > 8 available physical processors. Doesn't make sense to me at least -
> > Make it more like 7 M / 2 R or so :)
> >
> >> My questions:
> >> * What should be my configurations to make reducers to run in the memory?
> >> * Why it keeps wa