[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC
[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717071#comment-16717071 ] ASF GitHub Bot commented on IGNITE-8783: Github user anton-vinogradov closed the pull request at: https://github.com/apache/ignite/pull/4364 > Failover tests periodically cause hanging of the whole Data Structures suite > on TC > -- > > Key: IGNITE-8783 > URL: https://issues.apache.org/jira/browse/IGNITE-8783 > Project: Ignite > Issue Type: Bug > Components: data structures >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Labels: MakeTeamcityGreenAgain > Fix For: 2.7 > > > History of suite runs: > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E > Chance of suite hang is 18% in master (based on previous 50 runs). > Hang is always caused by one of the following failover tests: > {noformat} > GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange > GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC
[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550706#comment-16550706 ] Pavel Kovalenko commented on IGNITE-8783: - [~avinogradov] I've checked TC, no suspicious failures have observed. Overall fix looks good, no objections from my side. Ready to merge. > Failover tests periodically cause hanging of the whole Data Structures suite > on TC > -- > > Key: IGNITE-8783 > URL: https://issues.apache.org/jira/browse/IGNITE-8783 > Project: Ignite > Issue Type: Bug > Components: data structures >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Labels: MakeTeamcityGreenAgain > Fix For: 2.7 > > > History of suite runs: > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E > Chance of suite hang is 18% in master (based on previous 50 runs). > Hang is always caused by one of the following failover tests: > {noformat} > GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange > GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC
[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548076#comment-16548076 ] Pavel Kovalenko commented on IGNITE-8783: - [~avinogradov] Ok, now it makes sense. Btw, I propose to add explicit test with the hanging scenario to make sure that after some future changes everything will work well. 18% of flaky in suites not related directly to this functionality is not good point to demonstrate the fix of the problem. So, after adding test, I think PR will be ready to merge. We can make call on Friday when I return from vacation completely :) > Failover tests periodically cause hanging of the whole Data Structures suite > on TC > -- > > Key: IGNITE-8783 > URL: https://issues.apache.org/jira/browse/IGNITE-8783 > Project: Ignite > Issue Type: Bug > Components: data structures >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Labels: MakeTeamcityGreenAgain > Fix For: 2.7 > > > History of suite runs: > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E > Chance of suite hang is 18% in master (based on previous 50 runs). > Hang is always caused by one of the following failover tests: > {noformat} > GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange > GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC
[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548009#comment-16548009 ] Anton Vinogradov commented on IGNITE-8783: -- [~Jokser], Problems explained at initial message, PR fixes #1 and #2. 1) T2 replaced with CompletableLatchUid, thats a sugar makes core readable 2) I removed some code looks broken and replace it with new code and asserts. - sorting by order fixes #2 - I removed code as explained at initial message, since there is no way to have final pending ack on client latch creation, this solves #1. BWT, can we have a call/chat to discuss changes in comfort way? > Failover tests periodically cause hanging of the whole Data Structures suite > on TC > -- > > Key: IGNITE-8783 > URL: https://issues.apache.org/jira/browse/IGNITE-8783 > Project: Ignite > Issue Type: Bug > Components: data structures >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Labels: MakeTeamcityGreenAgain > Fix For: 2.7 > > > History of suite runs: > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E > Chance of suite hang is 18% in master (based on previous 50 runs). > Hang is always caused by one of the following failover tests: > {noformat} > GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange > GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC
[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547997#comment-16547997 ] Pavel Kovalenko commented on IGNITE-8783: - [~avinogradov] I see a lot of refactoring stuff at PR and it's a little bit difficult to determine where is the actual problem fix. Could you please briefly explain what is the test scenario of hanging and what is the key problem fix? Thank you in advance. > Failover tests periodically cause hanging of the whole Data Structures suite > on TC > -- > > Key: IGNITE-8783 > URL: https://issues.apache.org/jira/browse/IGNITE-8783 > Project: Ignite > Issue Type: Bug > Components: data structures >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Labels: MakeTeamcityGreenAgain > Fix For: 2.7 > > > History of suite runs: > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E > Chance of suite hang is 18% in master (based on previous 50 runs). > Hang is always caused by one of the following failover tests: > {noformat} > GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange > GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC
[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546540#comment-16546540 ] Alexey Goncharuk commented on IGNITE-8783: -- [~avinogradov], Currently your change never completes the client latch because {{coordinator}} is a {{ClusterNode}}, but {{nodeIds}} is a {{Set}}. Can you please clarify why you need to check for coordinator? > Failover tests periodically cause hanging of the whole Data Structures suite > on TC > -- > > Key: IGNITE-8783 > URL: https://issues.apache.org/jira/browse/IGNITE-8783 > Project: Ignite > Issue Type: Bug > Components: data structures >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Labels: MakeTeamcityGreenAgain > Fix For: 2.7 > > > History of suite runs: > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E > Chance of suite hang is 18% in master (based on previous 50 runs). > Hang is always caused by one of the following failover tests: > {noformat} > GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange > GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC
[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545297#comment-16545297 ] Anton Vinogradov commented on IGNITE-8783: -- [~ilantukh], 4 problems related to ExchangeLatch hang found: 1) pendingAcks was ignored at client latch recreation on coordinator change. +Fixed+. {noformat} // There is final ack for created latch. if (pendingAcks.containsKey(latchId)) { {noformat} was replaced with {noformat} Set nodeIds = pendingAcks.get(latchId); // There is final ack for created latch. if (nodeIds != null && nodeIds.contains(coordinator)) { {noformat} 2) Topology change could cause coordinator change even in case coordinator node not failed. +Fixed+. added sorting by order to {{getLatchCoordinator}} {noformat} .sorted(Comparator.comparing(ClusterNode::order)) {noformat} Now coordinator is alwais oldest node. 3) Sometimes Latch fails on message send in case connection was not established yet. {noformat} [2018-07-13 19:06:43,910][ERROR][exchange-worker-#233015%partitioned.GridCachePartitionedDataStructuresFailoverSelfTest4%][TcpCommunicationSpi] Failed to send message to remote node [node=TcpDiscoveryNode [id=3838f6ed-1b4d-484d-9773-df449370, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1531498003891, loc=false, ver=2.6.0#19700101-sha1:, isClient=false], msg=GridIoMessage [plc=2, topic=TOPIC_EXCHANGE, topicOrd=31, ordered=false, timeout=0, skipOnTimeout=false, msg=org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.LatchAckMessage@77e7c81f]] class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=3838f6ed-1b4d-484d-9773-df449370, addrs=[/127.0.0.1:45010]] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3449) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2977) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2860) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2703) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2662) at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1643) at org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:1715) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.sendAck(ExchangeLatchManager.java:624) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.countDown(ExchangeLatchManager.java:642) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.waitPartitionRelease(GridDhtPartitionsExchangeFuture.java:1406) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1177) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:732) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2477) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2357) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) at java.lang.Thread.run(Thread.java:748) Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address [addr=/127.0.0.1:45010, err=Address already in use: no further information] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3452) ... 15 more Caused by: java.net.BindException: Address already in use: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:111) at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3289) ... 15 more [2018-07-13 19:06:43,911][INFO
[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC
[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545072#comment-16545072 ] ASF GitHub Bot commented on IGNITE-8783: GitHub user anton-vinogradov opened a pull request: https://github.com/apache/ignite/pull/4364 IGNITE-8783 Signed-off-by: Anton Vinogradov You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/ignite ignite-8783 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/4364.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4364 commit 87acce74295d483e27794aee2c5e1493b665ee2a Author: Anton Vinogradov Date: 2018-07-13T12:38:26Z IGNITE-8783 Signed-off-by: Anton Vinogradov > Failover tests periodically cause hanging of the whole Data Structures suite > on TC > -- > > Key: IGNITE-8783 > URL: https://issues.apache.org/jira/browse/IGNITE-8783 > Project: Ignite > Issue Type: Bug > Components: data structures >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Labels: MakeTeamcityGreenAgain > > History of suite runs: > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E > Chance of suite hang is 18% in master (based on previous 50 runs). > Hang is always caused by one of the following failover tests: > {noformat} > GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange > GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC
[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541789#comment-16541789 ] Ilya Lantukh commented on IGNITE-8783: -- [~avinogradov], I think this code was written to avoid race between handling ack and processing node failure. As far as I understand, there is no mechanism to cancel latch for outdated topology version. > Failover tests periodically cause hanging of the whole Data Structures suite > on TC > -- > > Key: IGNITE-8783 > URL: https://issues.apache.org/jira/browse/IGNITE-8783 > Project: Ignite > Issue Type: Bug > Components: data structures >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Labels: MakeTeamcityGreenAgain > > History of suite runs: > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E > Chance of suite hang is 18% in master (based on previous 50 runs). > Hang is always caused by one of the following failover tests: > {noformat} > GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange > GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC
[ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541697#comment-16541697 ] Anton Vinogradov commented on IGNITE-8783: -- Hang reason found at {{org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager#createClientLatch}} you can see code {noformat} // There is final ack for created latch. if (pendingAcks.containsKey(latchId)) { latch.complete(); pendingAcks.remove(latchId); // this cause pending acks loss when coordinator failure was not handled yet (eg. we handling another node fail) } else clientLatches.put(latchId, latch); {noformat} so, I propose to replace this code with simple {noformat} clientLatches.put(latchId, latch); {noformat} [~Jokser], Could you please explain idea of handling final message from old_coordinator? As far as I see - latches will be recreated on each topology change and acks will be resent. > Failover tests periodically cause hanging of the whole Data Structures suite > on TC > -- > > Key: IGNITE-8783 > URL: https://issues.apache.org/jira/browse/IGNITE-8783 > Project: Ignite > Issue Type: Bug > Components: data structures >Reporter: Ivan Rakov >Assignee: Anton Vinogradov >Priority: Major > Labels: MakeTeamcityGreenAgain > > History of suite runs: > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E > Chance of suite hang is 18% in master (based on previous 50 runs). > Hang is always caused by one of the following failover tests: > {noformat} > GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange > GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)