[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717071#comment-16717071
 ] 

ASF GitHub Bot commented on IGNITE-8783:


Github user anton-vinogradov closed the pull request at:

https://github.com/apache/ignite/pull/4364


> Failover tests periodically cause hanging of the whole Data Structures suite 
> on TC
> --
>
> Key: IGNITE-8783
> URL: https://issues.apache.org/jira/browse/IGNITE-8783
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.7
>
>
> History of suite runs: 
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> Chance of suite hang is 18% in master (based on previous 50 runs).
> Hang is always caused by one of the following failover tests:
> {noformat}
> GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange
> GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

2018-07-20 Thread Pavel Kovalenko (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550706#comment-16550706
 ] 

Pavel Kovalenko commented on IGNITE-8783:
-

[~avinogradov] I've checked TC, no suspicious failures have observed. Overall 
fix looks good, no objections from my side. Ready to merge.

> Failover tests periodically cause hanging of the whole Data Structures suite 
> on TC
> --
>
> Key: IGNITE-8783
> URL: https://issues.apache.org/jira/browse/IGNITE-8783
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.7
>
>
> History of suite runs: 
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> Chance of suite hang is 18% in master (based on previous 50 runs).
> Hang is always caused by one of the following failover tests:
> {noformat}
> GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange
> GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

2018-07-18 Thread Pavel Kovalenko (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548076#comment-16548076
 ] 

Pavel Kovalenko commented on IGNITE-8783:
-

[~avinogradov] Ok, now it makes sense. Btw, I propose to add explicit test with 
the hanging scenario to make sure that after some future changes everything 
will work well. 18% of flaky in suites not related directly to this 
functionality is not good point to demonstrate the fix of the problem. So, 
after adding test, I think PR will be ready to merge.
We can make call on Friday when I return from vacation completely :)

> Failover tests periodically cause hanging of the whole Data Structures suite 
> on TC
> --
>
> Key: IGNITE-8783
> URL: https://issues.apache.org/jira/browse/IGNITE-8783
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.7
>
>
> History of suite runs: 
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> Chance of suite hang is 18% in master (based on previous 50 runs).
> Hang is always caused by one of the following failover tests:
> {noformat}
> GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange
> GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

2018-07-18 Thread Anton Vinogradov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548009#comment-16548009
 ] 

Anton Vinogradov commented on IGNITE-8783:
--

[~Jokser], 

Problems explained at initial message, PR fixes #1 and #2.
1) T2 replaced with CompletableLatchUid, thats a sugar makes core readable
2) I removed some code looks broken and replace it with new code and asserts.
- sorting by order fixes #2
- I removed code as explained at initial message, since there is no way to have 
final pending ack on client latch creation, this solves #1.

BWT, can we have a call/chat to discuss changes in comfort way?


> Failover tests periodically cause hanging of the whole Data Structures suite 
> on TC
> --
>
> Key: IGNITE-8783
> URL: https://issues.apache.org/jira/browse/IGNITE-8783
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.7
>
>
> History of suite runs: 
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> Chance of suite hang is 18% in master (based on previous 50 runs).
> Hang is always caused by one of the following failover tests:
> {noformat}
> GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange
> GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

2018-07-18 Thread Pavel Kovalenko (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547997#comment-16547997
 ] 

Pavel Kovalenko commented on IGNITE-8783:
-

[~avinogradov] I see a lot of refactoring stuff at PR and it's a little bit 
difficult to determine where is the actual problem fix.
Could you please briefly explain what is the test scenario of hanging and what 
is the key problem fix?
Thank you in advance.

> Failover tests periodically cause hanging of the whole Data Structures suite 
> on TC
> --
>
> Key: IGNITE-8783
> URL: https://issues.apache.org/jira/browse/IGNITE-8783
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.7
>
>
> History of suite runs: 
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> Chance of suite hang is 18% in master (based on previous 50 runs).
> Hang is always caused by one of the following failover tests:
> {noformat}
> GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange
> GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

2018-07-17 Thread Alexey Goncharuk (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546540#comment-16546540
 ] 

Alexey Goncharuk commented on IGNITE-8783:
--

[~avinogradov], 
Currently your change never completes the client latch because {{coordinator}} 
is a {{ClusterNode}}, but {{nodeIds}} is a {{Set}}. Can you please 
clarify why you need to check for coordinator?

> Failover tests periodically cause hanging of the whole Data Structures suite 
> on TC
> --
>
> Key: IGNITE-8783
> URL: https://issues.apache.org/jira/browse/IGNITE-8783
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.7
>
>
> History of suite runs: 
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> Chance of suite hang is 18% in master (based on previous 50 runs).
> Hang is always caused by one of the following failover tests:
> {noformat}
> GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange
> GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

2018-07-16 Thread Anton Vinogradov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545297#comment-16545297
 ] 

Anton Vinogradov commented on IGNITE-8783:
--

[~ilantukh],
 4 problems related to ExchangeLatch hang found:

1) pendingAcks was ignored at client latch recreation on coordinator change. 
+Fixed+.
{noformat}
// There is final ack for created latch.
if (pendingAcks.containsKey(latchId)) {
{noformat}
was replaced with
{noformat}
Set nodeIds = pendingAcks.get(latchId);

// There is final ack for created latch.
if (nodeIds != null && nodeIds.contains(coordinator)) {
{noformat}
2) Topology change could cause coordinator change even in case coordinator node 
not failed. +Fixed+.
 added sorting by order to {{getLatchCoordinator}}
{noformat}
.sorted(Comparator.comparing(ClusterNode::order))
{noformat}
Now coordinator is alwais oldest node.

3) Sometimes Latch fails on message send in case connection was not established 
yet.
{noformat}
[2018-07-13 
19:06:43,910][ERROR][exchange-worker-#233015%partitioned.GridCachePartitionedDataStructuresFailoverSelfTest4%][TcpCommunicationSpi]
 Failed to send message to remote node [node=TcpDiscoveryNode 
[id=3838f6ed-1b4d-484d-9773-df449370, addrs=[127.0.0.1], 
sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, 
lastExchangeTime=1531498003891, loc=false, ver=2.6.0#19700101-sha1:, 
isClient=false], msg=GridIoMessage [plc=2, topic=TOPIC_EXCHANGE, topicOrd=31, 
ordered=false, timeout=0, skipOnTimeout=false, 
msg=org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.LatchAckMessage@77e7c81f]]
class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is 
node still alive?). Make sure that each ComputeTask and cache Transaction has a 
timeout set in order to prevent parties from waiting forever in case of network 
issues [nodeId=3838f6ed-1b4d-484d-9773-df449370, addrs=[/127.0.0.1:45010]]
at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3449)
at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2977)
at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2860)
at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2703)
at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2662)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1643)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:1715)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.sendAck(ExchangeLatchManager.java:624)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.countDown(ExchangeLatchManager.java:642)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.waitPartitionRelease(GridDhtPartitionsExchangeFuture.java:1406)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1177)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:732)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2477)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2357)
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to 
connect to address [addr=/127.0.0.1:45010, err=Address already in use: no 
further information]
at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3452)
... 15 more
Caused by: java.net.BindException: Address already in use: no further 
information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:111)
at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3289)
... 15 more
[2018-07-13 19:06:43,911][INFO 

[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

2018-07-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545072#comment-16545072
 ] 

ASF GitHub Bot commented on IGNITE-8783:


GitHub user anton-vinogradov opened a pull request:

https://github.com/apache/ignite/pull/4364

IGNITE-8783

Signed-off-by: Anton Vinogradov 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/ignite ignite-8783

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/ignite/pull/4364.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4364


commit 87acce74295d483e27794aee2c5e1493b665ee2a
Author: Anton Vinogradov 
Date:   2018-07-13T12:38:26Z

IGNITE-8783

Signed-off-by: Anton Vinogradov 




> Failover tests periodically cause hanging of the whole Data Structures suite 
> on TC
> --
>
> Key: IGNITE-8783
> URL: https://issues.apache.org/jira/browse/IGNITE-8783
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
>
> History of suite runs: 
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> Chance of suite hang is 18% in master (based on previous 50 runs).
> Hang is always caused by one of the following failover tests:
> {noformat}
> GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange
> GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

2018-07-12 Thread Ilya Lantukh (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541789#comment-16541789
 ] 

Ilya Lantukh commented on IGNITE-8783:
--

[~avinogradov],
I think this code was written to avoid race between handling ack and processing 
node failure.
As far as I understand, there is no mechanism to cancel latch for outdated 
topology version.

> Failover tests periodically cause hanging of the whole Data Structures suite 
> on TC
> --
>
> Key: IGNITE-8783
> URL: https://issues.apache.org/jira/browse/IGNITE-8783
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
>
> History of suite runs: 
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> Chance of suite hang is 18% in master (based on previous 50 runs).
> Hang is always caused by one of the following failover tests:
> {noformat}
> GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange
> GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

2018-07-12 Thread Anton Vinogradov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541697#comment-16541697
 ] 

Anton Vinogradov commented on IGNITE-8783:
--

Hang reason found 
at 
{{org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager#createClientLatch}}
you can see code
{noformat}
 // There is final ack for created latch.
if (pendingAcks.containsKey(latchId)) {
latch.complete();
pendingAcks.remove(latchId); // this cause pending acks loss when 
coordinator failure was not handled yet (eg. we handling another node fail)
}
else
clientLatches.put(latchId, latch);
{noformat}

so, I propose to replace this code with simple 

{noformat}
clientLatches.put(latchId, latch);
{noformat}

[~Jokser],
Could you please explain idea of handling final message from old_coordinator?
As far as I see - latches will be recreated on each topology change and acks 
will be resent.

> Failover tests periodically cause hanging of the whole Data Structures suite 
> on TC
> --
>
> Key: IGNITE-8783
> URL: https://issues.apache.org/jira/browse/IGNITE-8783
> Project: Ignite
>  Issue Type: Bug
>  Components: data structures
>Reporter: Ivan Rakov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
>
> History of suite runs: 
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures=buildTypeHistoryList_IgniteTests24Java8=%3Cdefault%3E
> Chance of suite hang is 18% in master (based on previous 50 runs).
> Hang is always caused by one of the following failover tests:
> {noformat}
> GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange
> GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)