[ https://issues.apache.org/jira/browse/IGNITE-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177460#comment-15177460 ]
Neil Wightman edited comment on IGNITE-2688 at 3/3/16 8:10 AM: --------------------------------------------------------------- Were getting the exact same issue when using data streamers. Both nodes (we're only runnning 2) are up but it appears a GC causes them to timeout after over 10 seconds. {code} 08:41:43.640 [tcp-disco-msg-worker-#2%metrics-store%] WARN org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to short-time network problems). 08:41:43.641 [disco-event-worker-#44%metrics-store%] WARN org.apache.ignite.internal.managers.discovery.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=a435652b-babd-41f0-96b9-33822965b779, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.0.230], sockAddrs=[int00e6/192.168.0.230:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.0.230:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1456990903638, loc=true, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false] 08:41:43.698 [tcp-disco-msg-worker-#2%metrics-store%] ERROR org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522) at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5779) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2161) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) 08:41:44.003 [pub-#1%metrics-store%] ERROR org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor - Failed to respond to node [nodeId=313b89e1-26f0-4208-9888-f2e361e2c275, res=DataStreamerResponse [reqId=335, forceLocDep=true]] org.apache.ignite.IgniteCheckedException: Failed to send message (node may have left the grid or TCP connection cannot be established due to firewall issues) [node=TcpDiscoveryNode [id=313b89e1-26f0-4208-9888-f2e361e2c275, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.0.229], sockAddrs=[int00e5/192.168.0.229:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.0.229:47500], discPort=47500, order=8, intOrder=5, lastExchangeTime=1456990534913, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false], topic=T1 [topic=TOPIC_DATASTREAM, id=e9b5f083351-313b89e1-26f0-4208-9888-f2e361e2c275], msg=DataStreamerResponse [reqId=335, forceLocDep=true], policy=0] at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1082) at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1134) at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1104) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.sendResponse(DataStreamProcessor.java:342) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.processRequest(DataStreamProcessor.java:312) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.access$000(DataStreamProcessor.java:49) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor$1.onMessage(DataStreamProcessor.java:79) at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:821) at org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:103) at org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:784) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.ignite.spi.IgniteSpiException: Failed to send message to remote node: TcpDiscoveryNode [id=313b89e1-26f0-4208-9888-f2e361e2c275, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.0.229], sockAddrs=[int00e5/192.168.0.229:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.0.229:47500], discPort=47500, order=8, intOrder=5, lastExchangeTime=1456990534913, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1959) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1899) at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1077) ... 12 common frames omitted Caused by: org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each GridComputeTask and GridCacheTransaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=313b89e1-26f0-4208-9888-f2e361e2c275, addrs=[int00e5/192.168.0.229:47100, /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2462) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2103) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:1997) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1933) ... 14 common frames omitted Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address: int00e5/192.168.0.229:47100 at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2467) ... 17 common frames omitted Caused by: org.apache.ignite.IgniteCheckedException: Failed to read remote node recovery handshake (connection closed). at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2672) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2334) ... 17 common frames omitted Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address: /0:0:0:0:0:0:0:1%lo:47100 at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2467) ... 17 common frames omitted Caused by: org.apache.ignite.IgniteCheckedException: Remote node ID is not as expected [expected=313b89e1-26f0-4208-9888-f2e361e2c275, rcvd=a435652b-babd-41f0-96b9-33822965b779] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2577) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2334) ... 17 common frames omitted Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address: /127.0.0.1:47100 at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2467) ... 17 common frames omitted Caused by: org.apache.ignite.IgniteCheckedException: Remote node ID is not as expected [expected=313b89e1-26f0-4208-9888-f2e361e2c275, rcvd=a435652b-babd-41f0-96b9-33822965b779] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2577) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2334) ... 17 common frames omitted {code} was (Author: neilwightman): Were getting the exact same issue when using data streamers. Both nodes (we're only runnning 2) are up but it appears a GC causes them to timeout even though the timeout is 10 seconds and the nodes are not doing GC for 10 seconds. {code} 08:41:43.640 [tcp-disco-msg-worker-#2%metrics-store%] WARN org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to short-time network problems). 08:41:43.641 [disco-event-worker-#44%metrics-store%] WARN org.apache.ignite.internal.managers.discovery.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=a435652b-babd-41f0-96b9-33822965b779, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.0.230], sockAddrs=[int00e6/192.168.0.230:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.0.230:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1456990903638, loc=true, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false] 08:41:43.698 [tcp-disco-msg-worker-#2%metrics-store%] ERROR org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522) at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5779) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2161) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) 08:41:44.003 [pub-#1%metrics-store%] ERROR org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor - Failed to respond to node [nodeId=313b89e1-26f0-4208-9888-f2e361e2c275, res=DataStreamerResponse [reqId=335, forceLocDep=true]] org.apache.ignite.IgniteCheckedException: Failed to send message (node may have left the grid or TCP connection cannot be established due to firewall issues) [node=TcpDiscoveryNode [id=313b89e1-26f0-4208-9888-f2e361e2c275, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.0.229], sockAddrs=[int00e5/192.168.0.229:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.0.229:47500], discPort=47500, order=8, intOrder=5, lastExchangeTime=1456990534913, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false], topic=T1 [topic=TOPIC_DATASTREAM, id=e9b5f083351-313b89e1-26f0-4208-9888-f2e361e2c275], msg=DataStreamerResponse [reqId=335, forceLocDep=true], policy=0] at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1082) at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1134) at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1104) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.sendResponse(DataStreamProcessor.java:342) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.processRequest(DataStreamProcessor.java:312) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.access$000(DataStreamProcessor.java:49) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor$1.onMessage(DataStreamProcessor.java:79) at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:821) at org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:103) at org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:784) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.ignite.spi.IgniteSpiException: Failed to send message to remote node: TcpDiscoveryNode [id=313b89e1-26f0-4208-9888-f2e361e2c275, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.0.229], sockAddrs=[int00e5/192.168.0.229:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.0.229:47500], discPort=47500, order=8, intOrder=5, lastExchangeTime=1456990534913, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1959) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1899) at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1077) ... 12 common frames omitted Caused by: org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each GridComputeTask and GridCacheTransaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=313b89e1-26f0-4208-9888-f2e361e2c275, addrs=[int00e5/192.168.0.229:47100, /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2462) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2103) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:1997) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1933) ... 14 common frames omitted Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address: int00e5/192.168.0.229:47100 at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2467) ... 17 common frames omitted Caused by: org.apache.ignite.IgniteCheckedException: Failed to read remote node recovery handshake (connection closed). at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2672) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2334) ... 17 common frames omitted Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address: /0:0:0:0:0:0:0:1%lo:47100 at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2467) ... 17 common frames omitted Caused by: org.apache.ignite.IgniteCheckedException: Remote node ID is not as expected [expected=313b89e1-26f0-4208-9888-f2e361e2c275, rcvd=a435652b-babd-41f0-96b9-33822965b779] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2577) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2334) ... 17 common frames omitted Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address: /127.0.0.1:47100 at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2467) ... 17 common frames omitted Caused by: org.apache.ignite.IgniteCheckedException: Remote node ID is not as expected [expected=313b89e1-26f0-4208-9888-f2e361e2c275, rcvd=a435652b-babd-41f0-96b9-33822965b779] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2577) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2334) ... 17 common frames omitted {code} > InterruptException for segmentation issues > ------------------------------------------ > > Key: IGNITE-2688 > URL: https://issues.apache.org/jira/browse/IGNITE-2688 > Project: Ignite > Issue Type: Bug > Reporter: Sergey Kozlov > Assignee: Denis Magda > Priority: Minor > > We're still seeing following exception for segmentation issues: > {noformat} > [18:16:31,566][WARNING][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Node > is out of topology (probably, due to short-time network problems). > [18:16:31,566][WARNING][disco-event-worker-#46%null%][GridDiscoveryManager] > Local node SEGMENTED: TcpDiscoveryNode > [id=19cf4b0f-d520-4915-be9f-813a99f945a5, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, > 172.22.222.44, 192.168.1.117], sockAddrs=[work-pc/172.22.222.44:47501, > /0:0:0:0:0:0:0:1:47501, /172.22.222.44:47501, /127.0.0.1:47501, > /172.22.222.44:47501, /192.168.1.117:47501], discPort=47501, order=4, > intOrder=4, lastExchangeTime=1455808591566, loc=true, > ver=1.6.0#19700101-sha1:00000000, isClient=false] > [18:16:31,629][SEVERE][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] > TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node > in order to prevent cluster wide instability. > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095) > at > java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:519) > at > java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:682) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5786) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2160) > at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) > [18:16:31,851][WARNING][sys-#22%null%][GridDhtAtomicCache] > <cache_fad03851_2_08519933018899859> Failed to send near update reply to node > because it left grid: fad03851-2077-4b50-92b3-00ec6d85fa39 > [18:16:31,866][WARNING][disco-event-worker-#46%null%][GridDiscoveryManager] > Stopping local node according to configured segmentation policy. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)