Hi,

I'm trying to crawl approx. 500.000 urls. After inject and generate I
started fetchers using 6 map tasks and 3 reduce tasks. All the map tasks had
successfully completed while all the reduce tasks got an OutOfMemory
exception. This exception was caught after the append phase (during the sort
phase). As far as I observed, during a fetch operation, all the map tasks
outputs to a temp. sequence file. During the reduce operation, each reducer
copies all map outputs to their local disk and append them to a single seq.
file. After this operation reducer try to sort this file and output the
sorted file to its local disk. And then, a record writer is opened to write
this sorted file to the segment, which is in DFS. If this scenario is
correct, then all the reduce tasks are supposed to do the same job. All try
to sort the whole map outputs and the winner of this operation will be able
to write to dfs. So only one reducer is expected to write to dfs. If this is
the case then an OutOfMemory exception will not be surprising for
500.000+urls. Since reducers will try to sort a file bigger then 1GB.
Any comments
on this scenario are welcome. And how can I avoid these exceptions? Thanx,

--
Hamza KAYA

Reply via email to