Hi,

so you have 3 boxes, since you run 3 reduce tasks?
What happens is that 3 splits of your data are sorted. In the very end you will get as much out put files as you have reduce tasks.
The sorting itself does happen in memory.
Check in hadoop-default.xml (it is may be in the hadoop jar)
 <name>io.sort.factor</name>
and
  <name>io.sort.mb</name>

HTH
Stefan


Am 24.05.2006 um 11:13 schrieb Hamza Kaya:

Hi,

I'm trying to crawl approx. 500.000 urls. After inject and generate I
started fetchers using 6 map tasks and 3 reduce tasks. All the map tasks had
successfully completed while all the reduce tasks got an OutOfMemory
exception. This exception was caught after the append phase (during the sort phase). As far as I observed, during a fetch operation, all the map tasks outputs to a temp. sequence file. During the reduce operation, each reducer copies all map outputs to their local disk and append them to a single seq. file. After this operation reducer try to sort this file and output the sorted file to its local disk. And then, a record writer is opened to write
this sorted file to the segment, which is in DFS. If this scenario is
correct, then all the reduce tasks are supposed to do the same job. All try to sort the whole map outputs and the winner of this operation will be able to write to dfs. So only one reducer is expected to write to dfs. If this is
the case then an OutOfMemory exception will not be surprising for
500.000+urls. Since reducers will try to sort a file bigger then 1GB.
Any comments
on this scenario are welcome. And how can I avoid these exceptions? Thanx,

--
Hamza KAYA



-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to