Are your nodes actually stuck or are you in e.g. a reduce step that is reading so much data across the network that the node SEEMS unreachable?
Since you mention "gets stuck for a while at 25%", that suggests that eventually the node finishes up its work ... *.......* *“Life should not be a journey to the grave with the intention of arriving safely in apretty and well preserved body, but rather to skid in broadside in a cloud of smoke,thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* On Mon, Feb 9, 2015 at 2:49 AM, Telles Nobrega <tellesnobr...@gmail.com> wrote: > Thanks > > On Mon Feb 09 2015 at 01:43:24 Xuan Gong <xg...@hortonworks.com> wrote: > >> That is for client connect retry in ipc level. >> >> You can decrease the max.retries by configuring >> >> ipc.client.connect.max.retries.on.timeouts >> >> in core-site.xml >> >> >> Thanks >> >> Xuan Gong >> >> From: Telles Nobrega <tellesnobr...@gmail.com> >> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org> >> Date: Saturday, February 7, 2015 at 8:37 PM >> To: "user@hadoop.apache.org" <user@hadoop.apache.org> >> Subject: Max Connect retries >> >> Hi, I changed my cluster config so a failed nodemanager can be >> detected in about 30 seconds. When I'm running a wordcount the reduce gets >> stuck in 25% for a quite while and logs show nodes trying to connect to the >> failed node: >> >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> hadoop-telles-844fb3f0-dfd8-456d-89c3-1d7cfdbdcad2/10.3.2.99:49911. Already >> tried 28 time(s); maxRetries=45 >> 2015-02-08 04:26:42,088 INFO [IPC Server handler 16 on 50037] >> org.apache.hadoop.mapred.TaskAttemptListenerImpl: MapCompletionEvents >> request from attempt_1423319128424_0025_r_000000_0. startIndex 24 maxEvents >> 10000 >> >> Is this the expected behaviour? should I change max retries to a lower >> values? if so, which config is that? >> >> Thanks >> >> >>