[ 
https://issues.apache.org/jira/browse/IGNITE-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492783#comment-16492783
 ] 

Alexey Kuznetsov edited comment on IGNITE-5968 at 5/28/18 3:25 PM:
-------------------------------------------------------------------

[~DmitriyGovorukhin] [~agoncharuk] 
The bug due to "lost partition" event is only thrown on new primary node, not 
on new backup(after old primary and backup nodes are down).

Partition loss policy is _IGNORE_.The test scenario is as follows,

{code:java}
startGrid(0);
startGrid(1);
startGrid(2);
startGrid(3);

ignite(2).events().localListen(lsnr1, 
EventType.EVT_CACHE_REBALANCE_PART_DATA_LOST);
ignite(3).events().localListen(lsnr2, 
EventType.EVT_CACHE_REBALANCE_PART_DATA_LOST);

cache.put(key1, key1);// node 0 is primary for key key1, node 1 is backup for 
key1.

stopGrid(0, true);
stopGrid(1, true);// after both grids are stopped, we have partition lost for 
key1.

// Node 2 is new primary node for key1, node 3 is new backup node for key1.

checkEventIsFired(lsn1, lsnr2); // EVT_CACHE_REBALANCE_PART_DATA_LOST event is 
only thrown on new primary node.
{code}

When 2 nodes, holding partition for key1, have crashed, we have "lost 
partition" event, fired only on new primary node(not on backup).

The essential reason for this bug is that new primary node *don't set* LOST 
state to the partitions, 
instead it pretends that no partition loss has happened and clears the 
partition loss state right away, see 
_GridDhtPartitionTopologyImpl#detectLostPartitions_
Primary node sends partitions map to backup node, backup node detects *no* lost 
partitions. So, no events are fired on backup node.

One solution to this is to broadcast partition map with lost partitions via 
_GridDhtPartitionsFullMessage_.

Are you agree with this solution?


was (Author: alexey kuznetsov):
[~DmitriyGovorukhin] [~agoncharuk] 
The bug due to "lost partition" event is only thrown on new primary node, not 
on new backup(after old primary and backup nodes are down).

The test scenario is as follows,

{code:java}
startGrid(0);
startGrid(1);
startGrid(2);
startGrid(3);

ignite(2).events().localListen(lsnr1, 
EventType.EVT_CACHE_REBALANCE_PART_DATA_LOST);
ignite(3).events().localListen(lsnr2, 
EventType.EVT_CACHE_REBALANCE_PART_DATA_LOST);

cache.put(key1, key1);// node 0 is primary for key key1, node 1 is backup for 
key1.

stopGrid(0, true);
stopGrid(1, true);// after both grids are stopped, we have partition lost for 
key1.

// Node 2 is new primary node for key1, node 3 is new backup node for key1.

checkEventIsFired(lsn1, lsnr2); // EVT_CACHE_REBALANCE_PART_DATA_LOST event is 
only thrown on new primary node.
{code}

When 2 nodes, holding partition for key1, have crashed, we have "lost 
partition" event, fired only on new primary node(not on backup).

The essential reason for this bug is that new primary node *don't set* LOST 
state to the partitions, 
instead it pretends that no partition loss has happened and clears the 
partition loss state right away, see 
_GridDhtPartitionTopologyImpl#detectLostPartitions_
Primary node sends partitions map to backup node, backup node detects *no* lost 
partitions. So, no events are fired on backup node.

One solution to this is to broadcast partition map with lost partitions via 
_GridDhtPartitionsFullMessage_.

Are you agree with this solution?

> Test fail in Ignite Cache 2: 
> GridCachePartitionNotLoadedEventSelfTest.testPrimaryAndBackupDead
> ----------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-5968
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5968
>             Project: Ignite
>          Issue Type: Test
>    Affects Versions: 2.1
>            Reporter: Dmitriy Govorukhin
>            Assignee: Alexey Kuznetsov
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>             Fix For: 2.6
>
>
> java.util.concurrent.TimeoutException: Test has been timed out 
> [test=testPrimaryAndBackupDead, timeout=300000]
>     at 
> org.apache.ignite.testframework.junits.GridAbstractTest.runTest(GridAbstractTest.java:1949)
>     at junit.framework.TestCase.runBare(TestCase.java:141)
>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>     at junit.framework.TestResult.run(TestResult.java:125)
>     at junit.framework.TestCase.run(TestCase.java:129)
>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>     at junit.framework.TestSuite.run(TestSuite.java:250)
>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>     at junit.framework.TestSuite.run(TestSuite.java:250)
>     at 
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to