[jira] [Commented] (IGNITE-2688) InterruptException for segmentation issues

2016-04-18 Thread Semen Boikov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245466#comment-15245466
 ] 

Semen Boikov commented on IGNITE-2688:
--

Hi Denis,

Fix looks good, but I do not like idea of introducing special 
'stoppedAbnormally' field and related method just for testing. I think test can 
check nodes log for 'failed abnormally' message (you can use GridStringLogger 
for this).

> InterruptException for segmentation issues
> --
>
> Key: IGNITE-2688
> URL: https://issues.apache.org/jira/browse/IGNITE-2688
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Kozlov
>Assignee: Denis Magda
>Priority: Minor
>
> We're still seeing following exception for  segmentation issues:
> {noformat}
> [18:16:31,566][WARNING][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Node 
> is out of topology (probably, due to short-time network problems).
> [18:16:31,566][WARNING][disco-event-worker-#46%null%][GridDiscoveryManager] 
> Local node SEGMENTED: TcpDiscoveryNode 
> [id=19cf4b0f-d520-4915-be9f-813a99f945a5, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.22.222.44, 192.168.1.117], sockAddrs=[work-pc/172.22.222.44:47501, 
> /0:0:0:0:0:0:0:1:47501, /172.22.222.44:47501, /127.0.0.1:47501, 
> /172.22.222.44:47501, /192.168.1.117:47501], discPort=47501, order=4, 
> intOrder=4, lastExchangeTime=1455808591566, loc=true, 
> ver=1.6.0#19700101-sha1:, isClient=false]
> [18:16:31,629][SEVERE][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] 
> TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node 
> in order to prevent cluster wide instability.
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
>   at 
> java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:519)
>   at 
> java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:682)
>   at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5786)
>   at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2160)
>   at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> [18:16:31,851][WARNING][sys-#22%null%][GridDhtAtomicCache] 
>  Failed to send near update reply to node 
> because it left grid: fad03851-2077-4b50-92b3-00ec6d85fa39
> [18:16:31,866][WARNING][disco-event-worker-#46%null%][GridDiscoveryManager] 
> Stopping local node according to configured segmentation policy.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (IGNITE-2688) InterruptException for segmentation issues

2016-04-16 Thread Denis Magda (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244533#comment-15244533
 ] 

Denis Magda commented on IGNITE-2688:
-

TC looks good. [~yzhdanov] or [~sboikov] please review the changes incorporated 
in IGNITE-2688 branch.

> InterruptException for segmentation issues
> --
>
> Key: IGNITE-2688
> URL: https://issues.apache.org/jira/browse/IGNITE-2688
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Kozlov
>Assignee: Denis Magda
>Priority: Minor
>
> We're still seeing following exception for  segmentation issues:
> {noformat}
> [18:16:31,566][WARNING][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Node 
> is out of topology (probably, due to short-time network problems).
> [18:16:31,566][WARNING][disco-event-worker-#46%null%][GridDiscoveryManager] 
> Local node SEGMENTED: TcpDiscoveryNode 
> [id=19cf4b0f-d520-4915-be9f-813a99f945a5, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.22.222.44, 192.168.1.117], sockAddrs=[work-pc/172.22.222.44:47501, 
> /0:0:0:0:0:0:0:1:47501, /172.22.222.44:47501, /127.0.0.1:47501, 
> /172.22.222.44:47501, /192.168.1.117:47501], discPort=47501, order=4, 
> intOrder=4, lastExchangeTime=1455808591566, loc=true, 
> ver=1.6.0#19700101-sha1:, isClient=false]
> [18:16:31,629][SEVERE][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] 
> TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node 
> in order to prevent cluster wide instability.
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
>   at 
> java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:519)
>   at 
> java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:682)
>   at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5786)
>   at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2160)
>   at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> [18:16:31,851][WARNING][sys-#22%null%][GridDhtAtomicCache] 
>  Failed to send near update reply to node 
> because it left grid: fad03851-2077-4b50-92b3-00ec6d85fa39
> [18:16:31,866][WARNING][disco-event-worker-#46%null%][GridDiscoveryManager] 
> Stopping local node according to configured segmentation policy.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (IGNITE-2688) InterruptException for segmentation issues

2016-04-15 Thread Denis Magda (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243649#comment-15243649
 ] 

Denis Magda commented on IGNITE-2688:
-

When segmentation happens the SPI is moved into {{DISCONNECTING}} state before 
the worker threads gets interrupted. The fix for the bug takes 
{{DISCONNECTING}} state into account as well avoiding printing the error from 
the description and stopping the node in case if the node in this state.

Checking fix with TC.

> InterruptException for segmentation issues
> --
>
> Key: IGNITE-2688
> URL: https://issues.apache.org/jira/browse/IGNITE-2688
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Kozlov
>Assignee: Denis Magda
>Priority: Minor
>
> We're still seeing following exception for  segmentation issues:
> {noformat}
> [18:16:31,566][WARNING][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Node 
> is out of topology (probably, due to short-time network problems).
> [18:16:31,566][WARNING][disco-event-worker-#46%null%][GridDiscoveryManager] 
> Local node SEGMENTED: TcpDiscoveryNode 
> [id=19cf4b0f-d520-4915-be9f-813a99f945a5, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.22.222.44, 192.168.1.117], sockAddrs=[work-pc/172.22.222.44:47501, 
> /0:0:0:0:0:0:0:1:47501, /172.22.222.44:47501, /127.0.0.1:47501, 
> /172.22.222.44:47501, /192.168.1.117:47501], discPort=47501, order=4, 
> intOrder=4, lastExchangeTime=1455808591566, loc=true, 
> ver=1.6.0#19700101-sha1:, isClient=false]
> [18:16:31,629][SEVERE][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] 
> TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node 
> in order to prevent cluster wide instability.
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
>   at 
> java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:519)
>   at 
> java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:682)
>   at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5786)
>   at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2160)
>   at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> [18:16:31,851][WARNING][sys-#22%null%][GridDhtAtomicCache] 
>  Failed to send near update reply to node 
> because it left grid: fad03851-2077-4b50-92b3-00ec6d85fa39
> [18:16:31,866][WARNING][disco-event-worker-#46%null%][GridDiscoveryManager] 
> Stopping local node according to configured segmentation policy.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (IGNITE-2688) InterruptException for segmentation issues

2016-03-03 Thread Denis Magda (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179379#comment-15179379
 ] 

Denis Magda commented on IGNITE-2688:
-

This issue described in this ticket is not a reason of the problem that is 
observed on your side. The fix for this issue will simplycheck that a node is 
not stopping due to the segmentation and will avoid printing the error below if 
the node is segmented

{noformat}
[18:16:31,629][SEVERE][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] 
TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in 
order to prevent cluster wide instability.
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
at 
java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:519)
at 
java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:682)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5786)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2160)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
{noformat}

In your case you should check a reason of long GC pauses and probably fix it or 
tune VM by increasing heap size or setting specific GC parameters [1]
In addition you may want to increase generic 
IgniteConfiguration.failureDetectionTimeout on all the nodes setting it to a 
value bigger than GC pauses.

[1] https://apacheignite.readme.io/docs/performance-tips#tune-garbage-collection

> InterruptException for segmentation issues
> --
>
> Key: IGNITE-2688
> URL: https://issues.apache.org/jira/browse/IGNITE-2688
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Kozlov
>Assignee: Denis Magda
>Priority: Minor
>
> We're still seeing following exception for  segmentation issues:
> {noformat}
> [18:16:31,566][WARNING][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Node 
> is out of topology (probably, due to short-time network problems).
> [18:16:31,566][WARNING][disco-event-worker-#46%null%][GridDiscoveryManager] 
> Local node SEGMENTED: TcpDiscoveryNode 
> [id=19cf4b0f-d520-4915-be9f-813a99f945a5, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.22.222.44, 192.168.1.117], sockAddrs=[work-pc/172.22.222.44:47501, 
> /0:0:0:0:0:0:0:1:47501, /172.22.222.44:47501, /127.0.0.1:47501, 
> /172.22.222.44:47501, /192.168.1.117:47501], discPort=47501, order=4, 
> intOrder=4, lastExchangeTime=1455808591566, loc=true, 
> ver=1.6.0#19700101-sha1:, isClient=false]
> [18:16:31,629][SEVERE][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] 
> TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node 
> in order to prevent cluster wide instability.
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
>   at 
> java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:519)
>   at 
> java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:682)
>   at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5786)
>   at 
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2160)
>   at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> [18:16:31,851][WARNING][sys-#22%null%][GridDhtAtomicCache] 
>  Failed to send near update reply to node 
> because it left grid: fad03851-2077-4b50-92b3-00ec6d85fa39
> [18:16:31,866][WARNING][disco-event-worker-#46%null%][GridDiscoveryManager] 
> Stopping local node according to configured segmentation policy.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (IGNITE-2688) InterruptException for segmentation issues

2016-03-03 Thread Neil Wightman (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177460#comment-15177460
 ] 

Neil Wightman commented on IGNITE-2688:
---

Were getting the exact same issue when using data streamers.   Both nodes 
(we're only runnning 2) are up but it appears a GC causes them to timeout even 
though the timeout is 10 seconds and the nodes are not doing GC for 10 seconds.

{code}
08:41:43.640 [tcp-disco-msg-worker-#2%metrics-store%] WARN  
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Node is out of topology 
(probably, due to short-time network problems).
08:41:43.641 [disco-event-worker-#44%metrics-store%] WARN  
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager - Local node 
SEGMENTED: TcpDiscoveryNode [id=a435652b-babd-41f0-96b9-33822965b779, 
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.0.230], 
sockAddrs=[int00e6/192.168.0.230:47500, /0:0:0:0:0:0:0:1%lo:47500, 
/127.0.0.1:47500, /192.168.0.230:47500], discPort=47500, order=12, intOrder=7, 
lastExchangeTime=1456990903638, loc=true, ver=1.5.0#20151229-sha1:f1f8cda2, 
isClient=false]
08:41:43.698 [tcp-disco-msg-worker-#2%metrics-store%] ERROR 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TcpDiscoverSpi's message 
worker thread failed abnormally. Stopping the node in order to prevent cluster 
wide instability.
java.lang.InterruptedException: null
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at 
java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522)
at 
java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5779)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2161)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
08:41:44.003 [pub-#1%metrics-store%] ERROR 
org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor - Failed 
to respond to node [nodeId=313b89e1-26f0-4208-9888-f2e361e2c275, 
res=DataStreamerResponse [reqId=335, forceLocDep=true]]
org.apache.ignite.IgniteCheckedException: Failed to send message (node may have 
left the grid or TCP connection cannot be established due to firewall issues) 
[node=TcpDiscoveryNode [id=313b89e1-26f0-4208-9888-f2e361e2c275, 
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.0.229], 
sockAddrs=[int00e5/192.168.0.229:47500, /0:0:0:0:0:0:0:1%lo:47500, 
/127.0.0.1:47500, /192.168.0.229:47500], discPort=47500, order=8, intOrder=5, 
lastExchangeTime=1456990534913, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, 
isClient=false], topic=T1 [topic=TOPIC_DATASTREAM, 
id=e9b5f083351-313b89e1-26f0-4208-9888-f2e361e2c275], msg=DataStreamerResponse 
[reqId=335, forceLocDep=true], policy=0]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1082)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1134)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1104)
at 
org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.sendResponse(DataStreamProcessor.java:342)
at 
org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.processRequest(DataStreamProcessor.java:312)
at 
org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.access$000(DataStreamProcessor.java:49)
at 
org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor$1.onMessage(DataStreamProcessor.java:79)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:821)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:103)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:784)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.ignite.spi.IgniteSpiException: Failed to send message to 
remote node: TcpDiscoveryNode [id=313b89e1-26f0-4208-9888-f2e361e2c275, 
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.0.229], 
sockAddrs=[int00e5/192.168.0.229:47500, /0:0:0:0:0:0:0:1%lo:47500, 
/127.0.0.1:47500, /192.168.0.229:47500], discPort=47500, order=8, intOrder=5, 
lastExchangeTime=1456990534913, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, 
isClient=false]
at