I'm having some trouble on one node of a 5-node cluster. I can
successfully run maps on all of them, but the reduce phase always
stalls on one particular host. It throws a connection refused
exception when attempting to connect to itself to get the data from
the map outputs. The only difference between host5 and the other
hosts that I can see is that on host5, its hostname resolves to
127.0.0.1 instead of its external IP address. I can't imagine that
should prevent it from connecting to itself, however. Has anyone else
had a similar problem? Is there a document somewhere that indicates
the requirements for host name resolution for nodes in a cluster?
Thanks,
Brandon
snippet of log of the reduce failing to copy data from itself on
host5.test:
2008-12-16 12:25:12,532 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200812161640_0003_r_000004_0: Got 2 new map-outputs & number
of known map outputs is 2
2008-12-16 12:25:12,532 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200812161640_0003_r_000004_0 Scheduled 1 of 2 known outputs (0
slow hosts and 1 dup hosts)
2008-12-16 12:25:12,533 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200812161640_0003_r_000004_0 copy failed:
attempt_200812161640_0003_m_000003_0 from host5.test
2008-12-16 12:25:12,534 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.ConnectException: Connection refused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun
.reflect
.NativeConstructorAccessorImpl
.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun
.reflect
.DelegatingConstructorAccessorImpl
.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at sun.net.www.protocol.http.HttpURLConnection
$6.run(HttpURLConnection.java:1296)
at java.security.AccessController.doPrivileged(Native Method)
at
sun
.net
.www
.protocol
.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1290)
at
sun
.net
.www
.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:
944)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$MapOutputCopier.getInputStream(ReduceTask.java:1143)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$MapOutputCopier.getMapOutput(ReduceTask.java:1084)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$MapOutputCopier.copyOutput(ReduceTask.java:997)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$MapOutputCopier.run(ReduceTask.java:946)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.Socket.connect(Socket.java:519)
at sun.net.NetworkClient.doConnect(NetworkClient.java:152)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
at sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient.java:323)
at
sun
.net
.www
.protocol
.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:788)
at
sun
.net
.www
.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:
729)
at
sun
.net
.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:654)
at
sun
.net
.www
.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:
977)