> I looked into the implementation and noticed that all the intermediate key > value pairs are collected in memory for the entire duration of any single > MapRunner instance. As I understand from reading the code, the MapRunner > keeps calling the user-defined map() method for all the key-value pairs > assigned to it by the MapTask. The MapTask does the check for whether it > should be dumping the intermediate key value pairs to the disk only after > the MapRunner.run() method has returned.
This is not true. While there is a sort/spill-to-disk done at the end of the map task, intermediary sorts/spill-to-disks are also done based on the amount of memory consumed so far by the in-memory buffer. The memory is capped at io.sort.mb config value. > Now, I was facing problems because due to the nature of this application, > I > ended up emitting too many intermediate key-value pairs for some set of > the > input data getting allocated to a single MapRunner instance. This was > leading to JVM going OutofMemory. While I agree that the memory model can be refined further (and will submit a patch for Hadoop-875), you should not see these exceptions under normal circumstances. I would recommend that you increase the heap size that a child JVM uses by tweaking the value of mapred.child.java.opts (for e.g., you can try setting it to -Xmx512m or higher). > If my understanding of the implementation is correct, then I am wondering > if > there is any particular reason to take this approach. A better approach > (and > I may be wrong here) would be to let MapRunner keep track of the memory it > has been utilizing and if the allocations run too high then it should: > > 1) Either dump the intermediate key-value pairs to disk itself. OR > 2) Better option will be to call an API (new) provided by the MapTask that > would dump the key-value pair to the disk and then pass the control back > to > the MapRunner. MapRunner will simply resume the task and return ultimately > return in the normal way. This is already there (the MapTask keeps track of memory usage). > -----Original Message----- > From: Gaurav Agarwal [mailto:[EMAIL PROTECTED] > Sent: Thursday, March 15, 2007 4:07 AM > To: [email protected] > Subject: Why does MapRunner collect all intermediate key-value in memory? > > > Hi all, > > I have started using Hadoop for a few of my Natural Language Processing > applications. I was facing a problem due to the my programs throwing up > OutOfMemory Exception during the Map phase. > > I looked into the implementation and noticed that all the intermediate key > value pairs are collected in memory for the entire duration of any single > MapRunner instance. As I understand from reading the code, the MapRunner > keeps calling the user-defined map() method for all the key-value pairs > assigned to it by the MapTask. The MapTask does the check for whether it > should be dumping the intermediate key value pairs to the disk only after > the MapRunner.run() method has returned. > > Now, I was facing problems because due to the nature of this application, > I > ended up emitting too many intermediate key-value pairs for some set of > the > input data getting allocated to a single MapRunner instance. This was > leading to JVM going OutofMemory. > > If my understanding of the implementation is correct, then I am wondering > if > there is any particular reason to take this approach. A better approach > (and > I may be wrong here) would be to let MapRunner keep track of the memory it > has been utilizing and if the allocations run too high then it should: > > 1) Either dump the intermediate key-value pairs to disk itself. OR > 2) Better option will be to call an API (new) provided by the MapTask that > would dump the key-value pair to the disk and then pass the control back > to > the MapRunner. MapRunner will simply resume the task and return ultimately > return in the normal way. > > I am suggesting this approach as there are other applications too which > may > benefit if they are not restricted by this limitations. > > Please let me know what your opinions on this. If this is not incorporated > into the main Hadoop release and then I intend to add this as a patch for > my > applications. Do you see any obvious loopholes which I might have > overlooked. > > Thanks in advance for the help! > > Regards > Gaurav > -- > View this message in context: http://www.nabble.com/Why-does-MapRunner- > collect-all-intermediate-key-value-in-memory--tf3405027.html#a9484185 > Sent from the Hadoop Dev mailing list archive at Nabble.com.
