Hi Andrzej, Thanks for your quick response.
Please find my comments below. > Perhaps you are running with too large heap, as strange as it may sound > ... If I understand this message correctly, JVM complains that GC is > taking too much resources. I started with defaults, 200m and maxBufferedDocs to 100. But I got too many open files error. Then I increased maxBuffredDocs to 2000, I got OOM. Hence I went thru a series of changes to arrive at this conclusion that irrespective of any config, one reduce fails. > > This may be also related to ulimit on this account I checked and it has a limt of 1024. The number of segements generated was around 500 for 1 million docs in each part. > I think that with this configuration you could > increase the number of > reduces, to decrease the amount of data each reduce > task has to handle. Ideally I want a partition for 10-15 million docs per reduce since I want to index 100 million. I can try with 10 or 12 reduces. But, even with 8, one fails and in isolation that works fine with the same settings. > In your current config you run at most 2 reduces per machine. True. Why do you say so. I've set 4 tasks/node but I was at 8 too and faced the same issue. > You can also use IsolationRunner to re-run > individual tasks under > debugger and see where they fail. I tried with mapred.job.tracker = local and things fly without errors. I also tried the same with a slave and they work too. Locally on windows using Cygwin, it works too. Any thoughts are greatly appreciated. I'm doing a proof-of-concept and this is really a big hurdle. Thanks, Venkat --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Venkat Seeth wrote: > > Hi there, > > > > Howdy. I've been using hadoop to parse and index > XML > > documents. Its a 2 step process similar to Nutch. > I > > parse the XML and create field-value tuples > written to > > a file. > > > > I read this file and index the field-value pairs > in > > the next step. > > > > Everything works fine but always one reduce out of > N > > fails in the last step when merging segments. It > fails > > with one or more of the following: > > - Task failed to report status for 608 seconds. > > Killing. > > - java.lang.OutOfMemoryError: GC overhead limit > > exceeded > > > > Perhaps you are running with too large heap, as > strange as it may sound > ... If I understand this message correctly, JVM > complains that GC is > taking too much resources. > > This may be also related to ulimit on this account > ... > > > > Configuration: > > I have about 128 maps and 8 reduces so I get to > create > > 8 partitions of my index. It runs on a 4 node > cluster > > with 4-Dual-proc 64GB machines. > > > > I think that with this configuration you could > increase the number of > reduces, to decrease the amount of data each reduce > task has to handle. > In your current config you run at most 2 reduces per > machine. > > > Number of documents: 1.65 million each about 10K > in > > size. > > > > I ran with 4 or 8 task trackers per node with 4 GB > > Heap for Job, Task trackers and the child JVMs. > > > > mergeFactor set to 50 and maxBufferedDocs at 1000. > > > > I fail to understand whats going on. When I run > the > > job individually, it works with the same settings. > > > > Why would all jobs work where in only one fails. > > > > You can also use IsolationRunner to re-run > individual tasks under > debugger and see where they fail. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ > __________________________________ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > > ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com