Reducer Out of Memory
Hi all, I am running a data-intensive job on 18 nodes on EC2, each with just 1.7GB of memory. The input size is 50GB, and as a result, my mapper splits it up automatically to 786 map tasks. This runs fine. However, I am setting the reduce task number to 18. This is where I get a java heap out of memory error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) at org.apache.hadoop.mapred.Child.main(Child.java:155)
Re: Reducer Out of Memory
Maybe you need allocate larger vm- memory to use parameter -Xmx1024m On Thu, Feb 12, 2009 at 10:56 AM, Kris Jirapinyo kjirapi...@biz360.comwrote: Hi all, I am running a data-intensive job on 18 nodes on EC2, each with just 1.7GB of memory. The input size is 50GB, and as a result, my mapper splits it up automatically to 786 map tasks. This runs fine. However, I am setting the reduce task number to 18. This is where I get a java heap out of memory error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) at org.apache.hadoop.mapred.Child.main(Child.java:155)
Re: Reducer Out of Memory
Darn that send button. Anyways, so I was wondering if my understanding is correct. There will only be the exact same number of output files as the number of reducer tasks I set. Thus, in my output directory from the reducer, I should always see only 18 files. However, if my understanding is correct, then when I call the output.collect() in my reducer, does it only get flushed at the end when that particular reducer task finishes? If that is the case, then it does seem like as my input grow, 18 reducers will not be able to handle the sheer volume of my data, as the collector will keep having to add more and more data to it. Thus, I guess this is the question. Do I have to keep increasing the number of reduce tasks so that the reducer can take smaller bites out of the chunk? Thus, if I'm running out of java heap space and I don't want to add more nodes, then I need to set my reducer task number to say 36, etc.? It just seems like I'm missing something. Of course, I could always add more nodes or upgrade to a higher instance so I get more memory, but that's the obvious solution (I just hope it's not the only solution). I guess what I'm saying is that I thought the reducer would be kind of smart enough to know that it's taking too big of a bite out of the whole chunk (like the mapper) and readjust itself, as I don't really care how many output files I get in the end, just that the result from the reducer stays under one directory. On Wed, Feb 11, 2009 at 6:56 PM, Kris Jirapinyo kjirapi...@biz360.comwrote: Hi all, I am running a data-intensive job on 18 nodes on EC2, each with just 1.7GB of memory. The input size is 50GB, and as a result, my mapper splits it up automatically to 786 map tasks. This runs fine. However, I am setting the reduce task number to 18. This is where I get a java heap out of memory error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) at org.apache.hadoop.mapred.Child.main(Child.java:155)
Re: Reducer Out of Memory
I tried that, but with 1.7GB, that will not allow me to run 1 mapper and 1 reducer concurrently (as I think when you do -Xmx1024m it tries to reserve that physical memory?). Thus, to be safe, I set it to -Xmx768m. The error I get when I do 1024m is this: java.io.IOException: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:160) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2079) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access$400(ReduceTask.java:457) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380) at org.apache.hadoop.mapred.Child.main(Child.java:155) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 10 more On Wed, Feb 11, 2009 at 7:02 PM, Rocks Lei Wang beyiw...@gmail.com wrote: Maybe you need allocate larger vm- memory to use parameter -Xmx1024m On Thu, Feb 12, 2009 at 10:56 AM, Kris Jirapinyo kjirapi...@biz360.com wrote: Hi all, I am running a data-intensive job on 18 nodes on EC2, each with just 1.7GB of memory. The input size is 50GB, and as a result, my mapper splits it up automatically to 786 map tasks. This runs fine. However, I am setting the reduce task number to 18. This is where I get a java heap out of memory error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) at org.apache.hadoop.mapred.Child.main(Child.java:155)