[ 
https://issues.apache.org/jira/browse/IGNITE-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991651#comment-14991651
 ] 

Semen Boikov edited comment on IGNITE-1758 at 11/9/15 1:30 PM:
---------------------------------------------------------------

Created test which restarts only server nodes, it fails from time to time with 
assert:
{noformat}
[11:51:24]W:             [org.apache.ignite:ignite-core] 
java.lang.AssertionError: Invalid node order: TcpDiscoveryNode 
[id=005cd5de-f1f1-435c-8ac4-f4474c28d000, addrs=[127.0.0.1], 
sockAddrs=[/127.0.0.1:47503], discPort=47503, order=0, intOrder=61, 
lastExchangeTime=1446724284050, loc=false, ver=1.5.0#20151105-sha1:94119c29, 
isClient=false]
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:51)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:48)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.internal.util.lang.GridFunc.isAll(GridFunc.java:3362)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9176)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9149)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.nodes(TcpDiscoveryNodesRing.java:616)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.visibleNodes(TcpDiscoveryNodesRing.java:128)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.notifyDiscovery(ServerImpl.java:1260)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.access$2700(ServerImpl.java:157)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddFinishedMessage(ServerImpl.java:3685)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2157)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5600)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
{noformat}

Assert fails since when NodeAddFinished event is received for some node there 
are nodes with lower internal order which did not receive NodeAddFinished.

Debugged this failure, found that it is possible that some node can get IO 
error trying to send message to next node and add this node in failed list, but 
this node can still be alive and can process messages from others nodes. To fix 
this issue it is necessary to pass failedNodes collection to next nodes so that 
it will be consistent across all nodes.


was (Author: sboikov):
Created test which restarts only server nodes, it fails from time to time with 
assert:
{noformat}
[11:51:24]W:             [org.apache.ignite:ignite-core] 
java.lang.AssertionError: Invalid node order: TcpDiscoveryNode 
[id=005cd5de-f1f1-435c-8ac4-f4474c28d000, addrs=[127.0.0.1], 
sockAddrs=[/127.0.0.1:47503], discPort=47503, order=0, intOrder=61, 
lastExchangeTime=1446724284050, loc=false, ver=1.5.0#20151105-sha1:94119c29, 
isClient=false]
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:51)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:48)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.internal.util.lang.GridFunc.isAll(GridFunc.java:3362)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9176)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9149)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.nodes(TcpDiscoveryNodesRing.java:616)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.visibleNodes(TcpDiscoveryNodesRing.java:128)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.notifyDiscovery(ServerImpl.java:1260)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.access$2700(ServerImpl.java:157)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddFinishedMessage(ServerImpl.java:3685)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2157)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5600)
[11:51:24]W:             [org.apache.ignite:ignite-core]        at 
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
{noformat}

Assert fails since when NodeAddFinished event is received for some node there 
are nodes with lower internal order which did not receive NodeAddFinished.

Debugged this failure, found that sometimes TcpDiscoveryNodeAddedMessage was 
missed during topology changes and new coordinator did not handle it. From logs 
could not say exactly why message was missed. If increase size of 
PendingMessages this assert does not reproduce, but test someties hangs on 
continuous query start.

> Clients don't survive during massive servers shutdown
> -----------------------------------------------------
>
>                 Key: IGNITE-1758
>                 URL: https://issues.apache.org/jira/browse/IGNITE-1758
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: ignite-1.4
>            Reporter: Denis Magda
>            Assignee: Semen Boikov
>            Priority: Blocker
>             Fix For: 1.5
>
>         Attachments: ignite-1758-test.patch
>
>
> There is a real world use case.
> Start sensible amount of servers and clients.
> Perform cache operations under a transaction.
> Stop a half of the servers. Clients must survive and keep execution their 
> transactions.
> Did the following test:
> - Started 14 servers and 14 clients;
> - Clients execute transactional put operations;
> - Stopped 7 servers.
> Getting different assertions on clients side.
> {noformat}
> [15:47:33,401][ERROR][tcp-client-disco-msg-worker-#521%internal.IgniteClientReconnectCacheMultiThreadedTest18][TcpDiscoverySpi]
>  Runtime error caught during grid runnable execution: IgniteSpiThread 
> [name=tcp-client-disco-msg-worker-#521%internal.IgniteClientReconnectCacheMultiThreadedTest18]
> java.lang.AssertionError: lastVer=29, newVer=32, locNode=TcpDiscoveryNode 
> [id=80f14def-9d49-43a0-96bc-6b83aedb3008, addrs=[127.0.0.1], 
> sockAddrs=[/127.0.0.1:0], discPort=0, order=26, intOrder=0, 
> lastExchangeTime=1445428036418, loc=true, ver=1.4.1#19700101-sha1:00000000, 
> isClient=true], msg=TcpDiscoveryNodeFailedMessage 
> [failedNodeId=3020dc65-ed3e-426f-8784-5bb766961003, order=4, warning=null, 
> super=TcpDiscoveryAbstractMessage 
> [sndNodeId=10c5cfe9-df07-4dfe-a5c0-460087aa9001, 
> id=eed3e3a8051-008a978d-28cc-4f0c-8728-4a815f858000, 
> verifierNodeId=800cf998-828e-4f56-af6a-c2760c5ed008, topVer=32, pendingIdx=0, 
> isClient=false]]
>       at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl.updateTopologyHistory(ClientImpl.java:720)
>       at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl.access$2700(ClientImpl.java:118)
>       at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processNodeFailedMessage(ClientImpl.java:1812)
>       at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processDiscoveryMessage(ClientImpl.java:1543)
>       at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1467)
>       at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> {noformat}
> {noformat}
> java.lang.AssertionError: Missed message future [rcvCnt=141, acked=0, 
> desc=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=0, 
> reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode 
> [id=6090f64b-e019-440b-9d0e-c3642bd3a006, addrs=[127.0.0.1], 
> sockAddrs=[/127.0.0.1:47503], discPort=47503, order=3, intOrder=3, 
> lastExchangeTime=1445428027468, loc=false, ver=1.4.1#19700101-sha1:00000000, 
> isClient=false], connected=false, connectCnt=1, queueLimit=5120]]
>       at 
> org.apache.ignite.internal.util.nio.GridNioRecoveryDescriptor.ackReceived(GridNioRecoveryDescriptor.java:181)
>       at 
> org.apache.ignite.internal.util.nio.GridNioRecoveryDescriptor.onHandshake(GridNioRecoveryDescriptor.java:251)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2331)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2084)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:1978)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1914)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1880)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1066)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1214)
>       at 
> org.apache.ignite.internal.processors.clock.GridClockSyncProcessor.publish(GridClockSyncProcessor.java:305)
>       at 
> org.apache.ignite.internal.processors.clock.GridClockSyncProcessor.access$800(GridClockSyncProcessor.java:54)
>       at 
> org.apache.ignite.internal.processors.clock.GridClockSyncProcessor$TimeCoordinator.body(GridClockSyncProcessor.java:382)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to