[ 
https://issues.apache.org/jira/browse/IGNITE-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979980#comment-14979980
 ] 

Semen Boikov commented on IGNITE-1758:
--------------------------------------

So far found several issue in logic handling client reconnect to cluster:
- some messages were not added to EnsuredMessageHistory (i.e. NodeAddedMessage 
created during processJoinRequestMessage)
- on client side BufferedInputStream was created two times for the same socket 
(in Reconnector and then in SocketReader), I suspect some messages could be not 
read in this case
- TcpDiscoveryNodeAddedMessage sent to joinning client node should contain 
'topology' collection with all nodes joined before, but 'topology' was 
initialized at the moment when this message was sent using nodes at current 
moment, so topology could contain not all needed nodes if it was sent during 
client reconnect

> Clients don't survive during massive servers shutdown
> -----------------------------------------------------
>
>                 Key: IGNITE-1758
>                 URL: https://issues.apache.org/jira/browse/IGNITE-1758
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: ignite-1.4
>            Reporter: Denis Magda
>            Assignee: Semen Boikov
>            Priority: Blocker
>             Fix For: 1.5
>
>         Attachments: ignite-1758-test.patch
>
>
> There is a real world use case.
> Start sensible amount of servers and clients.
> Perform cache operations under a transaction.
> Stop a half of the servers. Clients must survive and keep execution their 
> transactions.
> Did the following test:
> - Started 14 servers and 14 clients;
> - Clients execute transactional put operations;
> - Stopped 7 servers.
> Getting different assertions on clients side.
> {noformat}
> [15:47:33,401][ERROR][tcp-client-disco-msg-worker-#521%internal.IgniteClientReconnectCacheMultiThreadedTest18][TcpDiscoverySpi]
>  Runtime error caught during grid runnable execution: IgniteSpiThread 
> [name=tcp-client-disco-msg-worker-#521%internal.IgniteClientReconnectCacheMultiThreadedTest18]
> java.lang.AssertionError: lastVer=29, newVer=32, locNode=TcpDiscoveryNode 
> [id=80f14def-9d49-43a0-96bc-6b83aedb3008, addrs=[127.0.0.1], 
> sockAddrs=[/127.0.0.1:0], discPort=0, order=26, intOrder=0, 
> lastExchangeTime=1445428036418, loc=true, ver=1.4.1#19700101-sha1:00000000, 
> isClient=true], msg=TcpDiscoveryNodeFailedMessage 
> [failedNodeId=3020dc65-ed3e-426f-8784-5bb766961003, order=4, warning=null, 
> super=TcpDiscoveryAbstractMessage 
> [sndNodeId=10c5cfe9-df07-4dfe-a5c0-460087aa9001, 
> id=eed3e3a8051-008a978d-28cc-4f0c-8728-4a815f858000, 
> verifierNodeId=800cf998-828e-4f56-af6a-c2760c5ed008, topVer=32, pendingIdx=0, 
> isClient=false]]
>       at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl.updateTopologyHistory(ClientImpl.java:720)
>       at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl.access$2700(ClientImpl.java:118)
>       at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processNodeFailedMessage(ClientImpl.java:1812)
>       at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processDiscoveryMessage(ClientImpl.java:1543)
>       at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1467)
>       at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> {noformat}
> {noformat}
> java.lang.AssertionError: Missed message future [rcvCnt=141, acked=0, 
> desc=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=0, 
> reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode 
> [id=6090f64b-e019-440b-9d0e-c3642bd3a006, addrs=[127.0.0.1], 
> sockAddrs=[/127.0.0.1:47503], discPort=47503, order=3, intOrder=3, 
> lastExchangeTime=1445428027468, loc=false, ver=1.4.1#19700101-sha1:00000000, 
> isClient=false], connected=false, connectCnt=1, queueLimit=5120]]
>       at 
> org.apache.ignite.internal.util.nio.GridNioRecoveryDescriptor.ackReceived(GridNioRecoveryDescriptor.java:181)
>       at 
> org.apache.ignite.internal.util.nio.GridNioRecoveryDescriptor.onHandshake(GridNioRecoveryDescriptor.java:251)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2331)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2084)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:1978)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1914)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1880)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1066)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1214)
>       at 
> org.apache.ignite.internal.processors.clock.GridClockSyncProcessor.publish(GridClockSyncProcessor.java:305)
>       at 
> org.apache.ignite.internal.processors.clock.GridClockSyncProcessor.access$800(GridClockSyncProcessor.java:54)
>       at 
> org.apache.ignite.internal.processors.clock.GridClockSyncProcessor$TimeCoordinator.body(GridClockSyncProcessor.java:382)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to