Venkat Seeth wrote:
Hi there,
Howdy. I've been using hadoop to parse and index XML
documents. Its a 2 step process similar to Nutch. I
parse the XML and create field-value tuples written to
a file.
I read this file and index the field-value pairs in
the next step.
Everything works fine but always one reduce out of N
fails in the last step when merging segments. It fails
with one or more of the following:
- Task failed to report status for 608 seconds.
Killing.
- java.lang.OutOfMemoryError: GC overhead limit
exceeded
Perhaps you are running with too large heap, as strange as it may sound
... If I understand this message correctly, JVM complains that GC is
taking too much resources.
This may be also related to ulimit on this account ...
Configuration:
I have about 128 maps and 8 reduces so I get to create
8 partitions of my index. It runs on a 4 node cluster
with 4-Dual-proc 64GB machines.
I think that with this configuration you could increase the number of
reduces, to decrease the amount of data each reduce task has to handle.
In your current config you run at most 2 reduces per machine.
Number of documents: 1.65 million each about 10K in
size.
I ran with 4 or 8 task trackers per node with 4 GB
Heap for Job, Task trackers and the child JVMs.
mergeFactor set to 50 and maxBufferedDocs at 1000.
I fail to understand whats going on. When I run the
job individually, it works with the same settings.
Why would all jobs work where in only one fails.
You can also use IsolationRunner to re-run individual tasks under
debugger and see where they fail.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com