[
https://issues.apache.org/jira/browse/IGNITE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579022#comment-14579022
]
Semen Boikov commented on IGNITE-1003:
--------------------------------------
Did some testing with one server/one client, found one suspicous place in
server dump at the moment when client compains about exchange timeout:
{noformat}
"grid-nio-worker-0-#67%null%" prio=10 tid=0x00007ff3888ce800 nid=0x1824
runnable [0x00007ff30dfbd000]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
- locked <0x00000000ed988a28> (a java.net.SocksSocketImpl)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.openSocket(TcpDiscoverySpi.java:1097)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:541)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:470)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl.pingNode(ServerImpl.java:433)
at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.pingNode(TcpDiscoverySpi.java:346)
at
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.tryFailNode(GridDiscoveryManager.java:1459)
at
org.apache.ignite.internal.managers.GridManagerAdapter$1.tryFailNode(GridManagerAdapter.java:484)
at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2.onDisconnected(TcpCommunicationSpi.java:256)
at
org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onExceptionCaught(GridNioFilterChain.java:253)
at
org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100)
at
org.apache.ignite.internal.util.nio.GridNioCodecFilter.onExceptionCaught(GridNioCodecFilter.java:74)
at
org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100)
at
org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onExceptionCaught(GridConnectionBytesVerifyFilter.java:65)
at
org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedExceptionCaught(GridNioFilterAdapter.java:100)
at
org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onExceptionCaught(GridNioServer.java:1985)
at
org.apache.ignite.internal.util.nio.GridNioFilterChain.onExceptionCaught(GridNioFilterChain.java:157)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:1521)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeys(GridNioServer.java:1346)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1275)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1159)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:108)
at java.lang.Thread.run(Thread.java:722)
{noformat}
Here nio worker hangs in tryFailNode() so communication IO is blocked, need to
move tryFailNode from nio worker.
> Communication issues when running client node in separate subnetwork
> --------------------------------------------------------------------
>
> Key: IGNITE-1003
> URL: https://issues.apache.org/jira/browse/IGNITE-1003
> Project: Ignite
> Issue Type: Bug
> Components: general
> Affects Versions: sprint-4
> Reporter: Valentin Kulichenko
> Priority: Blocker
> Fix For: sprint-5
>
> Attachments: client.zip, server.zip, test.xml
>
>
> Test is the following:
> * Run 8 server nodes on one box.
> * Start and stop client node in a loop on a different box in different
> subnetwork (e.g., over VPN).
> On one if iterations node join process will hang for several minutes due to
> timeouts in initial partition exchange. At some point communication between
> some of the server nodes stops working - messages wait in queue until
> connection is closed and these messages are recovered.
> Attached are configuration file used to run the test and logs with
> communication debug enabled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)