[ https://issues.apache.org/jira/browse/IGNITE-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991651#comment-14991651 ]
Semen Boikov edited comment on IGNITE-1758 at 11/9/15 1:30 PM: --------------------------------------------------------------- Created test which restarts only server nodes, it fails from time to time with assert: {noformat} [11:51:24]W: [org.apache.ignite:ignite-core] java.lang.AssertionError: Invalid node order: TcpDiscoveryNode [id=005cd5de-f1f1-435c-8ac4-f4474c28d000, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47503], discPort=47503, order=0, intOrder=61, lastExchangeTime=1446724284050, loc=false, ver=1.5.0#20151105-sha1:94119c29, isClient=false] [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:51) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:48) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.util.lang.GridFunc.isAll(GridFunc.java:3362) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9176) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9149) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.nodes(TcpDiscoveryNodesRing.java:616) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.visibleNodes(TcpDiscoveryNodesRing.java:128) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.ServerImpl.notifyDiscovery(ServerImpl.java:1260) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.ServerImpl.access$2700(ServerImpl.java:157) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddFinishedMessage(ServerImpl.java:3685) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2157) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5600) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) {noformat} Assert fails since when NodeAddFinished event is received for some node there are nodes with lower internal order which did not receive NodeAddFinished. Debugged this failure, found that it is possible that some node can get IO error trying to send message to next node and add this node in failed list, but this node can still be alive and can process messages from others nodes. To fix this issue it is necessary to pass failedNodes collection to next nodes so that it will be consistent across all nodes. was (Author: sboikov): Created test which restarts only server nodes, it fails from time to time with assert: {noformat} [11:51:24]W: [org.apache.ignite:ignite-core] java.lang.AssertionError: Invalid node order: TcpDiscoveryNode [id=005cd5de-f1f1-435c-8ac4-f4474c28d000, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47503], discPort=47503, order=0, intOrder=61, lastExchangeTime=1446724284050, loc=false, ver=1.5.0#20151105-sha1:94119c29, isClient=false] [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:51) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing$1.apply(TcpDiscoveryNodesRing.java:48) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.util.lang.GridFunc.isAll(GridFunc.java:3362) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9176) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.internal.util.IgniteUtils.arrayList(IgniteUtils.java:9149) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.nodes(TcpDiscoveryNodesRing.java:616) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.visibleNodes(TcpDiscoveryNodesRing.java:128) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.ServerImpl.notifyDiscovery(ServerImpl.java:1260) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.ServerImpl.access$2700(ServerImpl.java:157) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddFinishedMessage(ServerImpl.java:3685) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2157) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5600) [11:51:24]W: [org.apache.ignite:ignite-core] at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) {noformat} Assert fails since when NodeAddFinished event is received for some node there are nodes with lower internal order which did not receive NodeAddFinished. Debugged this failure, found that sometimes TcpDiscoveryNodeAddedMessage was missed during topology changes and new coordinator did not handle it. From logs could not say exactly why message was missed. If increase size of PendingMessages this assert does not reproduce, but test someties hangs on continuous query start. > Clients don't survive during massive servers shutdown > ----------------------------------------------------- > > Key: IGNITE-1758 > URL: https://issues.apache.org/jira/browse/IGNITE-1758 > Project: Ignite > Issue Type: Bug > Components: general > Affects Versions: ignite-1.4 > Reporter: Denis Magda > Assignee: Semen Boikov > Priority: Blocker > Fix For: 1.5 > > Attachments: ignite-1758-test.patch > > > There is a real world use case. > Start sensible amount of servers and clients. > Perform cache operations under a transaction. > Stop a half of the servers. Clients must survive and keep execution their > transactions. > Did the following test: > - Started 14 servers and 14 clients; > - Clients execute transactional put operations; > - Stopped 7 servers. > Getting different assertions on clients side. > {noformat} > [15:47:33,401][ERROR][tcp-client-disco-msg-worker-#521%internal.IgniteClientReconnectCacheMultiThreadedTest18][TcpDiscoverySpi] > Runtime error caught during grid runnable execution: IgniteSpiThread > [name=tcp-client-disco-msg-worker-#521%internal.IgniteClientReconnectCacheMultiThreadedTest18] > java.lang.AssertionError: lastVer=29, newVer=32, locNode=TcpDiscoveryNode > [id=80f14def-9d49-43a0-96bc-6b83aedb3008, addrs=[127.0.0.1], > sockAddrs=[/127.0.0.1:0], discPort=0, order=26, intOrder=0, > lastExchangeTime=1445428036418, loc=true, ver=1.4.1#19700101-sha1:00000000, > isClient=true], msg=TcpDiscoveryNodeFailedMessage > [failedNodeId=3020dc65-ed3e-426f-8784-5bb766961003, order=4, warning=null, > super=TcpDiscoveryAbstractMessage > [sndNodeId=10c5cfe9-df07-4dfe-a5c0-460087aa9001, > id=eed3e3a8051-008a978d-28cc-4f0c-8728-4a815f858000, > verifierNodeId=800cf998-828e-4f56-af6a-c2760c5ed008, topVer=32, pendingIdx=0, > isClient=false]] > at > org.apache.ignite.spi.discovery.tcp.ClientImpl.updateTopologyHistory(ClientImpl.java:720) > at > org.apache.ignite.spi.discovery.tcp.ClientImpl.access$2700(ClientImpl.java:118) > at > org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processNodeFailedMessage(ClientImpl.java:1812) > at > org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processDiscoveryMessage(ClientImpl.java:1543) > at > org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1467) > at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) > {noformat} > {noformat} > java.lang.AssertionError: Missed message future [rcvCnt=141, acked=0, > desc=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=0, > reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode > [id=6090f64b-e019-440b-9d0e-c3642bd3a006, addrs=[127.0.0.1], > sockAddrs=[/127.0.0.1:47503], discPort=47503, order=3, intOrder=3, > lastExchangeTime=1445428027468, loc=false, ver=1.4.1#19700101-sha1:00000000, > isClient=false], connected=false, connectCnt=1, queueLimit=5120]] > at > org.apache.ignite.internal.util.nio.GridNioRecoveryDescriptor.ackReceived(GridNioRecoveryDescriptor.java:181) > at > org.apache.ignite.internal.util.nio.GridNioRecoveryDescriptor.onHandshake(GridNioRecoveryDescriptor.java:251) > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2331) > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2084) > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:1978) > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1914) > at > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1880) > at > org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1066) > at > org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1214) > at > org.apache.ignite.internal.processors.clock.GridClockSyncProcessor.publish(GridClockSyncProcessor.java:305) > at > org.apache.ignite.internal.processors.clock.GridClockSyncProcessor.access$800(GridClockSyncProcessor.java:54) > at > org.apache.ignite.internal.processors.clock.GridClockSyncProcessor$TimeCoordinator.body(GridClockSyncProcessor.java:382) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)