I think I figured it out, it is distributed, I thought the server listed under 
the Status column is the server handling the task, I had to dig more into the 
task to read which machine is actually executing the task.



-----Original Message-----
From: Tamer Yousef 
Sent: Thursday, January 15, 2015 9:29 AM
To: [email protected]
Subject: RE: Parse map step executes on one node only

Any feedback guys on this? The last parse task for next iteration had over 700 
tasks on the same machine, and eventually failed. How can I distribute the 
parse task over these machines?

thanks

-----Original Message-----
From: Tamer Yousef 
Sent: Wednesday, January 14, 2015 9:50 AM
To: [email protected]
Subject: Parse map step executes on one node only

Hi All
I have a question regarding the parse step, the map part executes on 1 machine 
only (the primary namenode) and is NOT spread across other machines in the 
cluster, while the reduce part is spread across the machines passed on the 
value of "mapred.reduce.tasks" which is set to 9 in the crawl script.

Here is more info about what I have:
------------------------------------------------
Nutch 1.9 on Hadoop 1.2.1. No Solr parsing. Centos 6.5. my cluster has 9 
machines.
The size of the data I'm working on so far is:
#of urls unfetched:    15,886,229
#of urls fetched:      2,316,187

The most important step, the fetch, was correctly spread across 9 machines in 
both, the map part, and the reduce part.
But when it comes to parsing, all processing goes to one machine during the map 
phase, and this causes the parse step to take over 5 hours, compared for 2.6 
hours for fetching.

The number of map tasks that were completed on this single machine is 342, and 
the number of reduce tasks was 9 (reduce was correctly spread across 9 
machines).

Most of my configs are OEM,,,,I have not altered the url filters or anything.

Any tips to speed this up is appreciated!
Thanks!
Tamer

Reply via email to