Hi Andrzej,

A quick question on your suggestion.

>> Configuration:
>> I have about 128 maps and 8 reduces so I get to
create 8 partitions of my index. 

> I think that with this configuration you could
increase the number of 
> reduces, to decrease the amount of data each reduce
task has to handle. 
> In your current config you run at most 2 reduces per
machine.

You suggested to increase the number of reduces. I did
come up with 8 partitions for my index each containing
about 10 million documents.

Are you saying I could probably create 32 partitions
and then later merge into smaller number of
partitions?

If I have a huge number of partitions, I do not know
how it'll affect federating search across these large
number of indexes and merging the results from those
searches. 

Any thoughts are greatly appreciated.

Thanks,
Venkat

--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Venkat Seeth wrote:
> > Hi there,
> >
> > Howdy. I've been using hadoop to parse and index
> XML
> > documents. Its a 2 step process similar to Nutch.
> I
> > parse the XML and create field-value tuples
> written to
> > a file.
> >
> > I read this file and index the field-value pairs
> in
> > the next step.
> >
> > Everything works fine but always one reduce out of
> N
> > fails in the last step when merging segments. It
> fails
> > with one or more of the following:
> > - Task failed to report status for 608 seconds.
> > Killing. 
> > - java.lang.OutOfMemoryError: GC overhead limit
> > exceeded 
> >   
> 
> Perhaps you are running with too large heap, as
> strange as it may sound 
> ... If I understand this message correctly, JVM
> complains that GC is 
> taking too much resources.
> 
> This may be also related to ulimit on this account
> ...
> 
> 
> > Configuration:
> > I have about 128 maps and 8 reduces so I get to
> create
> > 8 partitions of my index. It runs on a 4 node
> cluster
> > with 4-Dual-proc 64GB machines.
> >   
> 
> I think that with this configuration you could
> increase the number of 
> reduces, to decrease the amount of data each reduce
> task has to handle. 
> In your current config you run at most 2 reduces per
> machine.
> 
> > Number of documents: 1.65 million each about 10K
> in
> > size.
> >
> > I ran with 4 or 8 task trackers per node with 4 GB
> > Heap for Job, Task trackers and the child JVMs.
> >
> > mergeFactor set to 50 and maxBufferedDocs at 1000.
> >
> > I fail to understand whats going on. When I run
> the
> > job individually, it works with the same settings.
> >
> > Why would all jobs work where in only one fails.
> >   
> 
> You can also use IsolationRunner to re-run
> individual tasks under 
> debugger and see where they fail.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 



 
____________________________________________________________________________________
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.
http://mobile.yahoo.com/mail 

Reply via email to