[ 
https://issues.apache.org/jira/browse/IGNITE-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexei Scherbakov updated IGNITE-5457:
--------------------------------------
    Description: 
I observe buggy behavior  in case of simulated split brain.

Nodes in DataCenter1 (where coordinator is located) are slowly leave grid,

while nodes in DataCenter2 stay in grid forever.

In logs I see multiple attemps to kick coordinator by communcation by socket 
timeout, but number of nodes does not change.

Note what my failureDetectionTimeout is significantly higher than communication 
socket timeout.

Looks like coordinator cannot be kicked from topology by TcpCommuncationSpi.

{noformat}
19:13:53.978 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=131ms, reason='timeout']
19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed 
out (consider increasing 'connTimeout' configuration property) 
[addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000]
19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - 
TcpCommunicationSpi failed to establish connection to node, node will be 
dropped from cluster [rmtNode=TcpDiscoveryNode 
[id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], 
sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, 
intOrder=1, lastExchangeTime=1496936257121, loc=false, 
ver=1.10.3#20170604-sha1:30521a17, isClient=false]]
org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node 
still alive?). Make sure that each ComputeTask and cache Transaction has a 
timeout set in order to prevent parties from waiting forever in case of network 
issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, 
addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) 
[ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect 
to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null]
                at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
                ... 6 common frames omitted
        Caused by: java.net.SocketTimeoutException: null
                at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118)
                at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884)
                ... 6 common frames omitted
19:14:23.989 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=130ms, reason='timeout']
19:14:34.078 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=211ms, reason='timeout']
19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200]
    ^-- H/N/C [hosts=144, nodes=160, CPUs=8064]
    ^-- CPU [cur=0.2%, avg=2.37%, GC=0%]
    ^-- PageMemory [pages=604144]
    ^-- Heap [used=33396MB, free=49.04%, comm=65536MB]
    ^-- Non heap [used=171MB, free=-1%, comm=173MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=0, qSize=0]
    ^-- Outbound messages queue [size=0]
{noformat}

  was:
I observe buggy behavior in case of simulated split brain.

Nodes in DataCenter1 (where coordinator is located) are slowly leave grid,

while nodes in DataCenter2 stay in grid forever.

In logs I see multiple attemps to kick coordinator by communcation by socket 
timeout, but number of nodes does not change.

{noformat}
19:13:53.978 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=131ms, reason='timeout']
19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed 
out (consider increasing 'connTimeout' configuration property) 
[addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000]
19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - 
TcpCommunicationSpi failed to establish connection to node, node will be 
dropped from cluster [rmtNode=TcpDiscoveryNode 
[id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], 
sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, 
intOrder=1, lastExchangeTime=1496936257121, loc=false, 
ver=1.10.3#20170604-sha1:30521a17, isClient=false]]
org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node 
still alive?). Make sure that each ComputeTask and cache Transaction has a 
timeout set in order to prevent parties from waiting forever in case of network 
issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, 
addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) 
[ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
        Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect 
to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null]
                at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027)
 [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
                ... 6 common frames omitted
        Caused by: java.net.SocketTimeoutException: null
                at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118)
                at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884)
                ... 6 common frames omitted
19:14:23.989 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=130ms, reason='timeout']
19:14:34.078 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
checkpointLockHoldTime=211ms, reason='timeout']
19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200]
    ^-- H/N/C [hosts=144, nodes=160, CPUs=8064]
    ^-- CPU [cur=0.2%, avg=2.37%, GC=0%]
    ^-- PageMemory [pages=604144]
    ^-- Heap [used=33396MB, free=49.04%, comm=65536MB]
    ^-- Non heap [used=171MB, free=-1%, comm=173MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=0, qSize=0]
    ^-- Outbound messages queue [size=0]
{noformat}


> Weird discovery behavior on split brain.
> ----------------------------------------
>
>                 Key: IGNITE-5457
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5457
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 2.0
>            Reporter: Alexei Scherbakov
>            Priority: Critical
>             Fix For: 2.2
>
>
> I observe buggy behavior  in case of simulated split brain.
> Nodes in DataCenter1 (where coordinator is located) are slowly leave grid,
> while nodes in DataCenter2 stay in grid forever.
> In logs I see multiple attemps to kick coordinator by communcation by socket 
> timeout, but number of nodes does not change.
> Note what my failureDetectionTimeout is significantly higher than 
> communication socket timeout.
> Looks like coordinator cannot be kicked from topology by TcpCommuncationSpi.
> {noformat}
> 19:13:53.978 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
> Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
> checkpointLockHoldTime=131ms, reason='timeout']
> 19:14:03.289 [WARN ] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - Connect timed 
> out (consider increasing 'connTimeout' configuration property) 
> [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, connTimeout=120000]
> 19:14:03.289 [ERROR] [o.a.i.s.c.tcp.TcpCommunicationSpi] [T:] - 
> TcpCommunicationSpi failed to establish connection to node, node will be 
> dropped from cluster [rmtNode=TcpDiscoveryNode 
> [id=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, addrs=[10.116.206.193], 
> sockAddrs=[grid457.ca.sbrf.ru/10.116.206.193:47500], discPort=47500, order=1, 
> intOrder=1, lastExchangeTime=1496936257121, loc=false, 
> ver=1.10.3#20170604-sha1:30521a17, isClient=false]]
> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node 
> still alive?). Make sure that each ComputeTask and cache Transaction has a 
> timeout set in order to prevent parties from waiting forever in case of 
> network issues [nodeId=a8ac1b24-8377-4064-a3d9-02bad9c6f2bb, 
> addrs=[grid457.ca.sbrf.ru/10.116.206.193:47100]]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3022)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2636)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2528)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.access$5800(TcpCommunicationSpi.java:245)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:3830)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:3656)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) 
> [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>       Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect 
> to address [addr=grid457.ca.sbrf.ru/10.116.206.193:47100, err=null]
>               at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3027)
>  [ignite-core-1.10.3.ea10.jar:1.10.3.ea10]
>               ... 6 common frames omitted
>       Caused by: java.net.SocketTimeoutException: null
>               at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118)
>               at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2884)
>               ... 6 common frames omitted
> 19:14:23.989 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
> Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
> checkpointLockHoldTime=130ms, reason='timeout']
> 19:14:34.078 [INFO ] [o.g.g.i.p.c.d.GridCacheDatabaseSharedManager] [T:] - 
> Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, 
> checkpointLockHoldTime=211ms, reason='timeout']
> 19:14:37.967 [INFO ] [o.a.i.i.IgniteKernal%DPL_GRID%grid880] [T:] - 
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=21e01ea7, name=DPL_GRID%grid880, uptime=00:37:00:200]
>     ^-- H/N/C [hosts=144, nodes=160, CPUs=8064]
>     ^-- CPU [cur=0.2%, avg=2.37%, GC=0%]
>     ^-- PageMemory [pages=604144]
>     ^-- Heap [used=33396MB, free=49.04%, comm=65536MB]
>     ^-- Non heap [used=171MB, free=-1%, comm=173MB]
>     ^-- Public thread pool [active=0, idle=0, qSize=0]
>     ^-- System thread pool [active=0, idle=0, qSize=0]
>     ^-- Outbound messages queue [size=0]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to