On 24/06/11 18:16, Niels Boldt wrote:
Hi,
I'm running nutch in pseudo cluster, eg all daemons are running on the same
server. I'm writing to the hadoop list, as it looks like a problem related
to hadoop
Some of my jobs partially fails and in the error log I get output like
2011-06-24 08:45:05,765 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_000000_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2011-06-24 08:45:05,771 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_000000_0 copy failed:
attempt_201106231520_0190_m_000000_0 from worker1
2011-06-24 08:45:05,772 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.UnknownHostException: worker1
The above basically said that my worker is unknown, but I can't really make
any sense of it. Other jobs running before, at the same time or after
completes fine without any error messages and without any changes on the
server. Also other reduce task in the same run has succeded. So it looks
like that my worker sometimes 'disappear' and can't be reached.
If the worker had "disappeared" of the net, you'd be more likely to see
a NoRouteToHost
My current theory is that it only happens when there are a couple of jobs
running at the same time. Is that a plausible explanation
Would anybody have some suggestions how I could get more infomation from the
system, or point me in a direction where I should look(I'm also quite new to
hadoop)
I'd assume that one machine in the cluster doesn't have an /etc/hosts
entry to worker1, or that the DNS server is suffering under load. If you
can, put the host lists into the /etc/hosts table instead of relying on
DNS. If you do it on all machines, it avoids having to work out which
one is playing up. That said, some better logging of which host is
trying to make the connection would be nice