i've been seeing a lot of jobs where large numbers of reducers keep failing at the shuffle phase due to timeouts (see a sample reducer syslog entry below). our setup consists of 8-core machines, with one box acting as both a slave and a namenode. the load on the namenode is not at full capacity so that doesn't appear to be the problem. we also run 0.18.1
reducers which run on the namenode are fine, it is only those running on slaves which seem affected. note that i seem to get this if i vary the number of reducers run, so it doesn't appear to be a function of the shard size is there some flag i should modify to increase the timeout value? or, is this fixed in the latest release? (i found one thread on this which talked about DNS entries and another which mentioned HADOOP-3155) thanks Miles > 2009-01-30 10:26:14,085 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=SHUFFLE, sessionId= 2009-01-30 10:26:14,229 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/disk2/hadoop/mapred/local/taskTracker/jobcache/job_200901301017_0001/attempt_200901301017_0001_r_000011_0/work/./r-compute-ngram-counts] 2009-01-30 10:26:14,368 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=78643200, MaxSingleShuffleLimit=19660800 2009-01-30 10:26:14,488 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200901301017_0001_r_000011_0 Thread started: Thread for merging on-disk files 2009-01-30 10:26:14,488 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200901301017_0001_r_000011_0 Thread waiting: Thread for merging on-disk files 2009-01-30 10:26:14,488 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200901301017_0001_r_000011_0 Thread started: Thread for merging in memory files 2009-01-30 10:26:14,489 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200901301017_0001_r_000011_0 Need another 3895 map output(s) where 0 is already in progress 2009-01-30 10:26:14,495 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200901301017_0001_r_000011_0: Got 6 new map-outputs & number of known map outputs is 6 2009-01-30 10:26:14,496 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200901301017_0001_r_000011_0 Scheduled 1 of 6 known outputs (0 slow hosts and 5 dup hosts) 2009-01-30 10:26:44,566 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200901301017_0001_r_000011_0 copy failed: attempt_200901301017_0001_m_000003_0 from crom.inf.ed.ac.uk 2009-01-30 10:26:44,567 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1296) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1290) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:944) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1143) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1084) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:997) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:946) Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.Socket.connect(Socket.java:519) at sun.net.NetworkClient.doConnect(NetworkClient.java:152) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:788) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:729) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:654) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:977) ... 4 more 2009-01-30 10:26:45,493 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_200901301017_0001_r_000011_0: Failed fetch #1 from attempt_200901301017_0001_m_000003_0 2009-01-30 10:26:45,494 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200901301017_0001_r_000011_0 adding host crom.inf.ed.ac.uk to penalty box, next contact in 4 seconds 2009-01-30 10:26:46,493 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200901301017_0001_r_000011_0: Got 6 map-outputs from previous faisyslog lines 8-44/190 19% -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.