Hi All
I have a question regarding the parse step, the map part executes on 1 machine 
only (the primary namenode) and is NOT spread across other machines in the 
cluster, while the reduce part is spread across the machines passed on the 
value of "mapred.reduce.tasks" which is set to 9 in the crawl script.

Here is more info about what I have:
------------------------------------------------
Nutch 1.9 on Hadoop 1.2.1. No Solr parsing. Centos 6.5. my cluster has 9 
machines.
The size of the data I'm working on so far is:
#of urls unfetched:    15,886,229
#of urls fetched:      2,316,187

The most important step, the fetch, was correctly spread across 9 machines in 
both, the map part, and the reduce part.
But when it comes to parsing, all processing goes to one machine during the map 
phase, and this causes the parse step to take over 5 hours, compared for 2.6 
hours for fetching.

The number of map tasks that were completed on this single machine is 342, and 
the number of reduce tasks was 9 (reduce was correctly spread across 9 
machines).

Most of my configs are OEM,,,,I have not altered the url filters or anything.

Any tips to speed this up is appreciated!
Thanks!
Tamer

Reply via email to