Map-Reduce in memory
Hello, We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of this huge RAM and run the map-reduce jobs mostly in memory with no spill, if possible. We use Hive for most of the processes. I have set: mapred.tasktracker.map.tasks.maximum = 16 mapred.tasktracker.reduce.tasks.maximum = 8 mapred.child.java.opts = 6144 mapred.reduce.parallel.copies = 20 mapred.job.shuffle.merge.percent = 1.0 mapred.job.reduce.input.buffer.percent = 0.25 mapred.inmem.merge.threshold = 0 One of my Hive queries is producing 6 stage map-reduce jobs. On the third stage when it queries from a 200GB table, the last 14 reducers hang. I changed mapred.task.timeout to 0 to see if they really hang. It has been 5 hours, so something terribly wrong in my setup. Parts of the log is below. My questions: * What should be my configurations to make reducers to run in the memory? * Why it keeps waiting for map outputs? * What does it mean "dup hosts"? Thank you! N.Gesli 2011-10-27 16:35:24,304 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2011-10-27 16:35:24,611 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=SHUFFLE, sessionId= 2011-10-27 16:35:24,722 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=1503238528, MaxSingleShuffleLimit=375809632 2011-10-27 16:35:24,733 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Thread started: Thread for merging on-disk files 2011-10-27 16:35:24,733 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Thread waiting: Thread for merging on-disk files 2011-10-27 16:35:24,734 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Thread started: Thread for merging in memory files 2011-10-27 16:35:24,735 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Thread started: Thread for polling Map Completion Events 2011-10-27 16:35:24,735 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Need another 1308 map output(s) where 0 is already in progress 2011-10-27 16:35:24,736 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 0 outputs (0 slow hosts and0 dup hosts) 2011-10-27 16:35:29,738 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 12 outputs (0 slow hosts and0 dup hosts) 2011-10-27 16:35:30,364 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 5 outputs (0 slow hosts and753 dup hosts) 2011-10-27 16:35:30,367 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts and1182 dup hosts) 2011-10-27 16:35:30,368 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts and1184 dup hosts) 2011-10-27 16:35:30,371 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 2 outputs (0 slow hosts and1073 dup hosts) ... 2011-10-27 16:36:04,284 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts and958 dup hosts) 2011-10-27 16:36:04,310 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts and958 dup hosts) 2011-10-27 16:36:07,721 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts and950 dup hosts) 2011-10-27 16:36:16,455 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts and948 dup hosts) 2011-10-27 16:36:16,464 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 1 outputs (0 slow hosts and948 dup hosts) 2011-10-27 16:36:24,736 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Need another 1061 map output(s) where 12 is already in progress 2011-10-27 16:36:24,736 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 0 outputs (0 slow hosts and1049 dup hosts) 2011-10-27 16:37:24,737 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Need another 1061 map output(s) where 12 is already in progress 2011-10-27 16:37:24,737 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 0 outputs (0 slow hosts and1049 dup hosts) 2011-10-27 16:38:24,738 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Need another 1061 map output(s) where 12 is already in progress 2011-10-27 16:38:24,738 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201110201507_0061_r_07_1 Scheduled 0 outputs (0 slow hosts and1049 dup hosts) 2011-10-27 16:39:24,739 INFO org.apache.had
Re: Map-Reduce in memory
Hey N.N. Gesli, (Inline) On Fri, Oct 28, 2011 at 12:38 PM, N.N. Gesli wrote: > Hello, > > We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each > node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of > this huge RAM and run the map-reduce jobs mostly in memory with no spill, if > possible. We use Hive for most of the processes. I have set: > mapred.tasktracker.map.tasks.maximum = 16 > mapred.tasktracker.reduce.tasks.maximum = 8 This is *crazy* for an 8 core machine. Try to keep M+R slots well below 8 instead - You're probably CPU-thrashed in this setup once large number of tasks get booted. > mapred.child.java.opts = 6144 You can also raise io.sort.mb to 2000, and tweak io.sort.factor. The child opts raise to 6~ GB looks a bit unnecessary since most of your tasks work on record basis and would not care much about total RAM. Perhaps use all that RAM for a service like HBase which can leverage caching nicely! > One of my Hive queries is producing 6 stage map-reduce jobs. On the third > stage when it queries from a 200GB table, the last 14 reducers hang. I > changed mapred.task.timeout to 0 to see if they really hang. It has been 5 > hours, so something terribly wrong in my setup. Parts of the log is below. It is probably just your slot settings. You may be massively over-subscribing your CPU resources with 16 map task slots + 8 reduce tasks slots. At worst case, it would mean 24 total JVMs competing over 8 available physical processors. Doesn't make sense to me at least - Make it more like 7 M / 2 R or so :) > My questions: > * What should be my configurations to make reducers to run in the memory? > * Why it keeps waiting for map outputs? It has to fetch map outputs to get some data to start with. And it pulls the map outputs a few at a time - to not overload the network during shuffle phases of several reducers across the cluster. > * What does it mean "dup hosts"? Duplicate hosts. Hosts it already knows about and has already scheduled fetch work upon. -- Harsh J
Re: Map-Reduce in memory
Uhm... He has plenty of memory... Depending on what sort of m/r tasks... He could push it. Didn't say how much disk... I wouldn't start that high... Try 10 mappers and 2. Reducers. Granted it is a bit asymmetric and you can bump up the reducers... Watch your jobs in ganglia and see what is happening... Harsh, assuming he is using intel, each core is hyper threaded so the box sees this as 2x CPUs. 8 cores looks like 16. Sent from a remote device. Please excuse any typos... Mike Segel On Oct 28, 2011, at 3:08 AM, Harsh J wrote: > Hey N.N. Gesli, > > (Inline) > > On Fri, Oct 28, 2011 at 12:38 PM, N.N. Gesli wrote: >> Hello, >> >> We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each >> node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of >> this huge RAM and run the map-reduce jobs mostly in memory with no spill, if >> possible. We use Hive for most of the processes. I have set: >> mapred.tasktracker.map.tasks.maximum = 16 >> mapred.tasktracker.reduce.tasks.maximum = 8 > > This is *crazy* for an 8 core machine. Try to keep M+R slots well > below 8 instead - You're probably CPU-thrashed in this setup once > large number of tasks get booted. > >> mapred.child.java.opts = 6144 > > You can also raise io.sort.mb to 2000, and tweak io.sort.factor. > > The child opts raise to 6~ GB looks a bit unnecessary since most of > your tasks work on record basis and would not care much about total > RAM. Perhaps use all that RAM for a service like HBase which can > leverage caching nicely! > >> One of my Hive queries is producing 6 stage map-reduce jobs. On the third >> stage when it queries from a 200GB table, the last 14 reducers hang. I >> changed mapred.task.timeout to 0 to see if they really hang. It has been 5 >> hours, so something terribly wrong in my setup. Parts of the log is below. > > It is probably just your slot settings. You may be massively > over-subscribing your CPU resources with 16 map task slots + 8 reduce > tasks slots. At worst case, it would mean 24 total JVMs competing over > 8 available physical processors. Doesn't make sense to me at least - > Make it more like 7 M / 2 R or so :) > >> My questions: >> * What should be my configurations to make reducers to run in the memory? >> * Why it keeps waiting for map outputs? > > It has to fetch map outputs to get some data to start with. And it > pulls the map outputs a few at a time - to not overload the network > during shuffle phases of several reducers across the cluster. > >> * What does it mean "dup hosts"? > > Duplicate hosts. Hosts it already knows about and has already > scheduled fetch work upon. > > > > -- > Harsh J >
Re: Map-Reduce in memory
Hi, First, you have 8 physical cores. Hyper threading makes the machine think that it has 16. The trouble is that you really don't have 16 cores so you need to be a little more conservative. You don't mention HBase, so I'm going to assume that you don't have it installed. So in terms of tasks, allocate a core each to DN and TT leaving 6 cores or 12 hyper threaded cores. This leaves a little headroom for the other linux processes... Now you can split the number of remaining cores however you want. You can even overlap a bit since you are not going to be running all of your reducers at the same time. So let's say 10 mappers and the 4 reducers to start. Since you have all that memory, you can bump up you DN and TT allocations. W ith respect to your tuning... You need to change them one at a time... Sent from a remote device. Please excuse any typos... Mike Segel On Nov 4, 2011, at 1:46 AM, "N.N. Gesli" wrote: > Thank you very much for your replies. > > Michel, disk is 3TB (6x550GB, 50 GB from each disk is reserved for local > basically for mapred.local.dir). You are right on the CPU; it is 8 core but > shows as 16. Is that mean it can handle 16 JVMs at a time? CPU is a little > overloaded, but that is not a huge problem at this point. > > I made io.sort.factor 200 and io.sort.mb 2000. Still got the same > error/timeout. I played with all related conf settings one by one. At last, > changing mapred.job.shuffle.merge.percent from 1.0 back to 0.66 solved the > problem. > > However, the job is still taking long time. There are 84 reducers, but only > one of them takes a very long time. I attached the log file of that reduce > task. Majority of the data gets spilled to disk. Even if I set > mapred.child.java.opts to 6144, the reduce task log shows > ShuffleRamManager: MemoryLimit=1503238528, MaxSingleShuffleLimit=375809632 > as if memory is 2GB (70% of 2GB=1503238528b). In the same log file later > there is also this line: > INFO ExecReducer: maximum memory = 6414139392 > I am not using memory monitoring. Tasktrackers have this line in the log: > TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is > disabled. > Why is ShuffleRamManager is finding that number as if the max memory is 2GB? > Why am I still getting that much spill even with these aggressive memory > settings? > Why only one reducer taking that long? > What else I can change to make this job processed in the memory and finish > faster? > > Thank you. > -N.N.Gesli > > On Fri, Oct 28, 2011 at 2:14 AM, Michel Segel > wrote: > Uhm... > He has plenty of memory... Depending on what sort of m/r tasks... He could > push it. > Didn't say how much disk... > > I wouldn't start that high... Try 10 mappers and 2. Reducers. Granted it is a > bit asymmetric and you can bump up the reducers... > > Watch your jobs in ganglia and see what is happening... > > Harsh, assuming he is using intel, each core is hyper threaded so the box > sees this as 2x CPUs. > 8 cores looks like 16. > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Oct 28, 2011, at 3:08 AM, Harsh J wrote: > > > Hey N.N. Gesli, > > > > (Inline) > > > > On Fri, Oct 28, 2011 at 12:38 PM, N.N. Gesli wrote: > >> Hello, > >> > >> We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each > >> node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of > >> this huge RAM and run the map-reduce jobs mostly in memory with no spill, > >> if > >> possible. We use Hive for most of the processes. I have set: > >> mapred.tasktracker.map.tasks.maximum = 16 > >> mapred.tasktracker.reduce.tasks.maximum = 8 > > > > This is *crazy* for an 8 core machine. Try to keep M+R slots well > > below 8 instead - You're probably CPU-thrashed in this setup once > > large number of tasks get booted. > > > >> mapred.child.java.opts = 6144 > > > > You can also raise io.sort.mb to 2000, and tweak io.sort.factor. > > > > The child opts raise to 6~ GB looks a bit unnecessary since most of > > your tasks work on record basis and would not care much about total > > RAM. Perhaps use all that RAM for a service like HBase which can > > leverage caching nicely! > > > >> One of my Hive queries is producing 6 stage map-reduce jobs. On the third > >> stage when it queries from a 200GB table, the last 14 reducers hang. I > >> changed mapred.task.timeout to 0 to see if they really hang. It has been 5 > >> hours, so something terribly wrong in my setup. Parts of the log is below. > > > > It is probably just your slot settings. You may be massively > > over-subscribing your CPU resources with 16 map task slots + 8 reduce > > tasks slots. At worst case, it would mean 24 total JVMs competing over > > 8 available physical processors. Doesn't make sense to me at least - > > Make it more like 7 M / 2 R or so :) > > > >> My questions: > >> * What should be my configurations to make reducers to run in the memory? > >> * Why it keeps wa