On 24/06/11 18:16, Niels Boldt wrote:
Hi,

I'm running nutch in pseudo cluster, eg all daemons are running on the same
server. I'm writing to the hadoop list, as it looks like a problem related
to hadoop

Some of my jobs partially fails and in the error log I get output like

2011-06-24 08:45:05,765 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_000000_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)

2011-06-24 08:45:05,771 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_000000_0 copy failed:
attempt_201106231520_0190_m_000000_0 from worker1
2011-06-24 08:45:05,772 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.UnknownHostException: worker1


The above basically said that my worker is unknown, but I can't really make
any sense of it. Other jobs running before, at the same time or after
completes fine without any error messages and without any changes on the
server. Also other reduce task in the same run has succeded. So it looks
like that my worker sometimes 'disappear' and can't be reached.

If the worker had "disappeared" of the net, you'd be more likely to see a NoRouteToHost

My current theory is that it only happens when there are a couple of jobs
running at the same time. Is that a plausible explanation

Would anybody have some suggestions how I could get more infomation from the
system, or point me in a direction where I should look(I'm also quite new to
hadoop)

I'd assume that one machine in the cluster doesn't have an /etc/hosts entry to worker1, or that the DNS server is suffering under load. If you can, put the host lists into the /etc/hosts table instead of relying on DNS. If you do it on all machines, it avoids having to work out which one is playing up. That said, some better logging of which host is trying to make the connection would be nice

Reply via email to