Guys,

We are seeing these warnings on our cluster nodes sometimes after a redeploy. 
The primary cluster node stays up,
but the other 2 (we have 3) die and keep saying this.
 
2009-10-01 20:04:34,302 main WARN  [org.jgroups.protocols.pbcast.GMS] 
join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), 
retrying
  | 2009-10-01 20:04:37,315 main WARN  [org.jgroups.protocols.pbcast.GMS] 
join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), 
retrying
  | 2009-10-01 20:04:40,318 main WARN  [org.jgroups.protocols.pbcast.GMS] 
join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), 
retrying
  | 2009-10-01 20:04:43,321 main WARN  [org.jgroups.protocols.pbcast.GMS] 
join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), 
retrying
  | 2009-10-01 20:04:46,323 main WARN  [org.jgroups.protocols.pbcast.GMS] 
join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), 
retrying
  | 2009-10-01 20:04:49,326 main WARN  [org.jgroups.protocols.pbcast.GMS] 
join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), 
retrying
  | 2009-10-01 20:04:52,329 main WARN  [org.jgroups.protocols.pbcast.GMS] 
join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), 
retrying
  | 2009-10-01 20:04:55,332 main WARN  [org.jgroups.protocols.pbcast.GMS] 
join(192.168.1.2:37802) sent to 192.168.1.1:52187 timed out (after 3000 ms), 
retrying
 
They keep saying this even after a restart. Only restarting the primary node 
fixes things. We can see 192.168.1.1:52187 is listening on the primary with 
netstat -panu
 
On that server, the logs say

192.168.1.2:37802 ERROR [org.jgroups.protocols.FD_SOCK] socket address for 
192.168.1.2:37802 could not be fetched, retrying.
 
However, on the 192.168.1.2 box, we can see 192.168.1.2:37802 is also listening.
 
We can ping and telnet to tcp ports on both boxes and between each box, and 
ipconfig shows no packet losses, collisions or other
networking errors.  Also a restart of JBoss on the primary fixes the problem.
 
Therefore the problem seems to be something within JBoss.
 
On the primary server there are a number of threads referring to port 52187 and 
none of these seem locked up (as far as I can see).
 
Here are some of the threads from the Primary:
 
This looks like the server socket:
 
"FD_SOCK server socket acceptor,192.168.1.1:52187" daemon prio=10 
tid=0x0ac19800 nid=0x6281 runnable [0x66293000]
  |    java.lang.Thread.State: RUNNABLE
  |         at java.net.PlainSocketImpl.socketAccept(Native Method)
  |         at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
  |         - locked <0xab976ea0> (a java.net.SocksSocketImpl)
  |         at java.net.ServerSocket.implAccept(ServerSocket.java:453)
  |         at java.net.ServerSocket.accept(ServerSocket.java:421)
  |         at 
org.jgroups.protocols.FD_SOCK$ServerSocketHandler.run(FD_SOCK.java:1022)
  |         at java.lang.Thread.run(Thread.java:619)
 

This looks like the connection handler:
 
"FD_SOCK client connection handler,AvantProduction,192.168.1.1:52187" daemon 
prio=10 tid=0x097e8000 nid=0x1a6c runnable [0x62c
  | 71000]
  |    java.lang.Thread.State: RUNNABLE
  |         at java.net.SocketInputStream.socketRead0(Native Method)
  |         at java.net.SocketInputStream.read(SocketInputStream.java:129)
  |         at java.net.SocketInputStream.read(SocketInputStream.java:182)
  |         at 
org.jgroups.protocols.FD_SOCK$ClientConnectionHandler.run(FD_SOCK.java:1089)
  |         at java.lang.Thread.run(Thread.java:619)
 

And these are threads waiting in a pool by the look of it:
 
"Incoming-14,192.168.1.1:52187" prio=10 tid=0x091c9400 nid=0x5f7f waiting on 
condition [0x68e1f000]
  |    java.lang.Thread.State: WAITING (parking)
  |         at sun.misc.Unsafe.park(Native Method)
  |         - parking to wait for  <0x8b246af0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  |         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
  |         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
  |         at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
  |         at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
  |         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
  |         at java.lang.Thread.run(Thread.java:619)
  |  
  | "Incoming-13,192.168.1.1:52187" prio=10 tid=0x091c7c00 nid=0x5f7e waiting 
on condition [0x68e70000]
  |    java.lang.Thread.State: WAITING (parking)
  |         at sun.misc.Unsafe.park(Native Method)
  |         - parking to wait for  <0x8b246af0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  |         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
  |         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
  |         at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
  |         at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
  |         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
  |         at java.lang.Thread.run(Thread.java:619)
 

Here is a list of all threads referring to port 52187:
 
"VERIFY_SUSPECT.TimerThread,AvantProduction-HAPartitionCache,192.168.1.1:52187" 
daemon prio=10 tid=0x0b838400 nid=0x1d9d waiting on condition [0x5f1fe000]
  | "OOB-680,192.168.1.1:52187" prio=10 tid=0x73163400 nid=0x1d9c waiting on 
condition [0x66c48000]
  | "OOB-678,192.168.1.1:52187" prio=10 tid=0x6ef51800 nid=0x1c8e waiting on 
condition [0x66a62000]
  | "FD_SOCK client connection handler,AvantProduction,192.168.1.1:52187" 
daemon prio=10 tid=0x097e8000 nid=0x1a6c runnable [0x62c71000]
  | "FD_SOCK pinger,AvantProduction-HAPartitionCache,192.168.1.1:52187" daemon 
prio=10 tid=0x092ba400 nid=0x69b7 waiting on condition [0x5f15c000]
  | "ViewHandler,AvantProduction-HAPartitionCache,192.168.1.1:52187" prio=10 
tid=0x0c48d400 nid=0x6506 waiting on condition [0x5f1ad000]
  | "FD_SOCK server socket acceptor,192.168.1.1:52187" daemon prio=10 
tid=0x0ac19800 nid=0x6281 runnable [0x66293000]
  | "Incoming-20,192.168.1.1:52187" prio=10 tid=0x71df4400 nid=0x5f8d waiting 
on condition [0x68a02000]
  | "Incoming-19,192.168.1.1:52187" prio=10 tid=0x091ed400 nid=0x5f89 waiting 
on condition [0x68b46000]
  | "Incoming-18,192.168.1.1:52187" prio=10 tid=0x7161dc00 nid=0x5f88 waiting 
on condition [0x68b97000]
  | "Incoming-17,192.168.1.1:52187" prio=10 tid=0x091ecc00 nid=0x5f84 waiting 
on condition [0x68c8a000]
  | "Incoming-16,192.168.1.1:52187" prio=10 tid=0x72a4a800 nid=0x5f83 waiting 
on condition [0x68cdb000]
  | "Incoming-15,192.168.1.1:52187" prio=10 tid=0x091ca400 nid=0x5f80 waiting 
on condition [0x68dce000]
  | "Incoming-14,192.168.1.1:52187" prio=10 tid=0x091c9400 nid=0x5f7f waiting 
on condition [0x68e1f000]
  | "Incoming-13,192.168.1.1:52187" prio=10 tid=0x091c7c00 nid=0x5f7e waiting 
on condition [0x68e70000]
  | "Incoming-12,192.168.1.1:52187" prio=10 tid=0x08f7ec00 nid=0x5f7d waiting 
on condition [0x68ec1000]
  | "Incoming-11,192.168.1.1:52187" prio=10 tid=0x6d3f8400 nid=0x5f7c waiting 
on condition [0x68f12000]
  | "Incoming-10,192.168.1.1:52187" prio=10 tid=0x091c5800 nid=0x5f7b waiting 
on condition [0x68f63000]
  | "Incoming-9,192.168.1.1:52187" prio=10 tid=0x713f5800 nid=0x5f7a waiting on 
condition [0x68fb4000]
  | "Incoming-8,192.168.1.1:52187" prio=10 tid=0x09cbe000 nid=0x5f79 waiting on 
condition [0x69005000]
  | "Timer-12,192.168.1.1:52187" daemon prio=10 tid=0x713e5800 nid=0x5f78 
waiting on condition [0x69056000]
  | "Timer-11,192.168.1.1:52187" daemon prio=10 tid=0x6cdf8800 nid=0x5f77 
waiting on condition [0x690a7000]
  | "Timer-10,192.168.1.1:52187" daemon prio=10 tid=0x71471c00 nid=0x5f75 
waiting on condition [0x69149000]
  | "Incoming-7,192.168.1.1:52187" prio=10 tid=0x09cbac00 nid=0x5f74 waiting on 
condition [0x6919a000]
  | "Incoming-6,192.168.1.1:52187" prio=10 tid=0x08f7f800 nid=0x5f72 waiting on 
condition [0x6923c000]
  | "Timer-9,192.168.1.1:52187" daemon prio=10 tid=0x08d8a800 nid=0x5f71 
waiting on condition [0x6928d000]
  | "Timer-8,192.168.1.1:52187" daemon prio=10 tid=0x08fd1000 nid=0x5f6f 
waiting on condition [0x6932f000]
  | "Timer-7,192.168.1.1:52187" daemon prio=10 tid=0x0927f800 nid=0x5f6d 
waiting on condition [0x693d1000]
  | "Incoming-5,192.168.1.1:52187" prio=10 tid=0x091ae800 nid=0x5f6c waiting on 
condition [0x69422000]
  | "Incoming-4,192.168.1.1:52187" prio=10 tid=0x091ad400 nid=0x5f6b waiting on 
condition [0x69473000]
  | "Incoming-3,192.168.1.1:52187" prio=10 tid=0x08e98800 nid=0x5f6a waiting on 
condition [0x694c4000]
  | "Incoming-2,192.168.1.1:52187" prio=10 tid=0x08e97800 nid=0x5f69 waiting on 
condition [0x69515000]
  | "Incoming-1,192.168.1.1:52187" prio=10 tid=0x08ee5800 nid=0x5f68 waiting on 
condition [0x69566000]
  | "Timer-6,192.168.1.1:52187" daemon prio=10 tid=0x717d2800 nid=0x5f67 
waiting on condition [0x695b7000]
  | "Timer-5,192.168.1.1:52187" daemon prio=10 tid=0x712ff000 nid=0x5f66 
waiting on condition [0x69608000]
  | "Timer-4,192.168.1.1:52187" daemon prio=10 tid=0x711fd800 nid=0x5f64 
waiting on condition [0x696aa000]
  | "Timer-3,192.168.1.1:52187" daemon prio=10 tid=0x09297800 nid=0x5f62 
waiting on condition [0x696fb000]
  | "Timer-2,192.168.1.1:52187" daemon prio=10 tid=0x090a6800 nid=0x5f5f 
waiting on condition [0x697ee000]
  | "UDP mcast,192.168.1.1:52187" prio=10 tid=0x09956c00 nid=0x5f5e runnable 
[0x6983f000]
  | "UDP ucast,192.168.1.1:52187" prio=10 tid=0x09956800 nid=0x5f5d runnable 
[0x69890000]
  | "DiagnosticsHandler,192.168.1.1:52187" daemon prio=10 tid=0x08ee6c00 
nid=0x5f5c runnable [0x698e1000]
  | "Timer-1,192.168.1.1:52187" daemon prio=10 tid=0x08e56c00 nid=0x5f5a 
waiting on condition [0x69983000]
 
 This is really hurting our up-time. Could anybody point us in the right 
direction?
 
 Regards,
 
 Richard.

View the original post : 
http://www.jboss.org/index.html?module=bb&op=viewtopic&p=4258102#4258102

Reply to the post : 
http://www.jboss.org/index.html?module=bb&op=posting&mode=reply&p=4258102
_______________________________________________
jboss-user mailing list
jboss-user@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/jboss-user

Reply via email to