Aleksey Plekhanov created IGNITE-8166:
-----------------------------------------
Summary: stopGrid() hangs in some cases when node is invalidated
and PDS is enabled
Key: IGNITE-8166
URL: https://issues.apache.org/jira/browse/IGNITE-8166
Project: Ignite
Issue Type: Bug
Affects Versions: 2.5
Reporter: Aleksey Plekhanov
Node invalidation via FailureProcessor can hang {{exchange-worker}} and
{{stopGrid()}} when PDS is enabled.
Reproducer (reproducer is racy, sometimes finished without hang):
{code:java}
public class StopNodeHangsTest extends GridCommonAbstractTest {
/** Offheap size for memory policy. */
private static final int SIZE = 10 * 1024 * 1024;
/** Page size. */
static final int PAGE_SIZE = 2048;
/** Number of entries. */
static final int ENTRIES = 2_000;
/** {@inheritDoc} */
@Override protected IgniteConfiguration getConfiguration(String
igniteInstanceName) throws Exception {
IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
DataStorageConfiguration dsCfg = new DataStorageConfiguration();
DataRegionConfiguration dfltPlcCfg = new DataRegionConfiguration();
dfltPlcCfg.setName("dfltPlc");
dfltPlcCfg.setInitialSize(SIZE);
dfltPlcCfg.setMaxSize(SIZE);
dfltPlcCfg.setPersistenceEnabled(true);
dsCfg.setDefaultDataRegionConfiguration(dfltPlcCfg);
dsCfg.setPageSize(PAGE_SIZE);
cfg.setDataStorageConfiguration(dsCfg);
cfg.setFailureHandler(new FailureHandler() {
@Override public boolean onFailure(Ignite ignite, FailureContext
failureCtx) {
return true;
}
});
return cfg;
}
public void testStopNodeHangs() throws Exception {
cleanPersistenceDir();
IgniteEx ignite0 = startGrid(0);
IgniteEx ignite1 = startGrid(1);
ignite1.cluster().active(true);
awaitPartitionMapExchange();
IgniteCache cache = ignite1.getOrCreateCache("TEST");
Map<Integer, Object> entries = new HashMap<>();
for (int i = 0; i < ENTRIES; i++)
entries.put(i, new byte[PAGE_SIZE * 2 / 3]);
cache.putAll(entries);
ignite1.context().failure().process(new
FailureContext(FailureType.CRITICAL_ERROR, null));
stopGrid(0);
stopGrid(1);
}
}
{code}
{{stopGrid(1)}} waiting until exchange finished, {{exchange-worker}} waits on
method {{GridCacheDatabaseSharedManager#checkpointReadLock}} for
{{CheckpointProgressSnapshot#cpBeginFut}}, but this future is never done
because {{db-checkpoint-thread}} got exception at
{{GridCacheDatabaseSharedManager.Checkpointer#markCheckpointBegin}} thrown by
{{FileWriteAheadLogManager#checkNode}} and leave method {{markCheckpointBegin}}
before future is done ({{curr.cpBeginFut.onDone();}})
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)