Hi All I have a question regarding the parse step, the map part executes on 1 machine only (the primary namenode) and is NOT spread across other machines in the cluster, while the reduce part is spread across the machines passed on the value of "mapred.reduce.tasks" which is set to 9 in the crawl script.
Here is more info about what I have: ------------------------------------------------ Nutch 1.9 on Hadoop 1.2.1. No Solr parsing. Centos 6.5. my cluster has 9 machines. The size of the data I'm working on so far is: #of urls unfetched: 15,886,229 #of urls fetched: 2,316,187 The most important step, the fetch, was correctly spread across 9 machines in both, the map part, and the reduce part. But when it comes to parsing, all processing goes to one machine during the map phase, and this causes the parse step to take over 5 hours, compared for 2.6 hours for fetching. The number of map tasks that were completed on this single machine is 342, and the number of reduce tasks was 9 (reduce was correctly spread across 9 machines). Most of my configs are OEM,,,,I have not altered the url filters or anything. Any tips to speed this up is appreciated! Thanks! Tamer

