Hi Andrzej, A quick question on your suggestion.
>> Configuration: >> I have about 128 maps and 8 reduces so I get to create 8 partitions of my index. > I think that with this configuration you could increase the number of > reduces, to decrease the amount of data each reduce task has to handle. > In your current config you run at most 2 reduces per machine. You suggested to increase the number of reduces. I did come up with 8 partitions for my index each containing about 10 million documents. Are you saying I could probably create 32 partitions and then later merge into smaller number of partitions? If I have a huge number of partitions, I do not know how it'll affect federating search across these large number of indexes and merging the results from those searches. Any thoughts are greatly appreciated. Thanks, Venkat --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Venkat Seeth wrote: > > Hi there, > > > > Howdy. I've been using hadoop to parse and index > XML > > documents. Its a 2 step process similar to Nutch. > I > > parse the XML and create field-value tuples > written to > > a file. > > > > I read this file and index the field-value pairs > in > > the next step. > > > > Everything works fine but always one reduce out of > N > > fails in the last step when merging segments. It > fails > > with one or more of the following: > > - Task failed to report status for 608 seconds. > > Killing. > > - java.lang.OutOfMemoryError: GC overhead limit > > exceeded > > > > Perhaps you are running with too large heap, as > strange as it may sound > ... If I understand this message correctly, JVM > complains that GC is > taking too much resources. > > This may be also related to ulimit on this account > ... > > > > Configuration: > > I have about 128 maps and 8 reduces so I get to > create > > 8 partitions of my index. It runs on a 4 node > cluster > > with 4-Dual-proc 64GB machines. > > > > I think that with this configuration you could > increase the number of > reduces, to decrease the amount of data each reduce > task has to handle. > In your current config you run at most 2 reduces per > machine. > > > Number of documents: 1.65 million each about 10K > in > > size. > > > > I ran with 4 or 8 task trackers per node with 4 GB > > Heap for Job, Task trackers and the child JVMs. > > > > mergeFactor set to 50 and maxBufferedDocs at 1000. > > > > I fail to understand whats going on. When I run > the > > job individually, it works with the same settings. > > > > Why would all jobs work where in only one fails. > > > > You can also use IsolationRunner to re-run > individual tasks under > debugger and see where they fail. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ > __________________________________ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > > ____________________________________________________________________________________ No need to miss a message. Get email on-the-go with Yahoo! Mail for Mobile. Get started. http://mobile.yahoo.com/mail