[ https://issues.apache.org/jira/browse/IGNITE-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitriy Govorukhin updated IGNITE-12048: ---------------------------------------- Description: Page replacement can reload invalid page during checkpoint There is a race between {{writeCheckpointPages}} and page replacement process: * Checkpointer thread begins a checkpoint * Checkpointer thread calls {{getPageForCheckpoint()}}, which will copy page content *and clear dirty flag* * Page replacement tries to find a page for replacement and chooses this page, the page is thrown away * Before the page is written back to the store, the page is acquired again. As a result, an older copy of the page is brought back to memory, which causes all kinds of corruption exceptions and assertions. ---- checkpointReadLock() may hang during node stop I got this hang during one of PDS (Indexing) runs (thread-dump is attached). The following code hang: {code:java} checkpointer.wakeupForCheckpoint(0, "too many dirty pages").cpBeginFut .getUninterruptibly(); {code} It looks like {{wakeupForCheckpoint}} can be called after the checkpointer is stopped and {{cpBeginFut}} will be never completed. ---- Fixed ZookeeperDiscoveryCommunicationFailureTest.testCommunicationFailureResolve_CachesInfo1 Fixed *.testFailAfterStart was: Page replacement can reload invalid page during checkpoint There is a race between {{writeCheckpointPages}} and page replacement process: * Checkpointer thread begins a checkpoint * Checkpointer thread calls {{getPageForCheckpoint()}}, which will copy page content *and clear dirty flag* * Page replacement tries to find a page for replacement and chooses this page, the page is thrown away * Before the page is written back to the store, the page is acquired again. As a result, an older copy of the page is brought back to memory, which causes all kinds of corruption exceptions and assertions. ----- checkpointReadLock() may hang during node stop I got this hang during one of PDS (Indexing) runs (thread-dump is attached). The following code hang: {code:java} checkpointer.wakeupForCheckpoint(0, "too many dirty pages").cpBeginFut .getUninterruptibly(); {code} It looks like {{wakeupForCheckpoint}} can be called after the checkpointer is stopped and {{cpBeginFut}} will be never completed. ---- ----- Fixed ZookeeperDiscoveryCommunicationFailureTest.testCommunicationFailureResolve_CachesInfo1 Fixed *.testFailAfterStart > Bugs & tests fixes > ------------------ > > Key: IGNITE-12048 > URL: https://issues.apache.org/jira/browse/IGNITE-12048 > Project: Ignite > Issue Type: Bug > Reporter: Dmitriy Govorukhin > Priority: Major > > Page replacement can reload invalid page during checkpoint > There is a race between {{writeCheckpointPages}} and page replacement process: > * Checkpointer thread begins a checkpoint > * Checkpointer thread calls {{getPageForCheckpoint()}}, which will copy page > content *and clear dirty flag* > * Page replacement tries to find a page for replacement and chooses this > page, the page is thrown away > * Before the page is written back to the store, the page is acquired again. > As a result, an older copy of the page is brought back to memory, which > causes all kinds of corruption exceptions and assertions. > ---- > checkpointReadLock() may hang during node stop > I got this hang during one of PDS (Indexing) runs (thread-dump is attached). > The following code hang: > {code:java} > checkpointer.wakeupForCheckpoint(0, "too many dirty pages").cpBeginFut > .getUninterruptibly(); > {code} > It looks like {{wakeupForCheckpoint}} can be called after the checkpointer is > stopped and {{cpBeginFut}} will be never completed. > ---- > Fixed > ZookeeperDiscoveryCommunicationFailureTest.testCommunicationFailureResolve_CachesInfo1 > Fixed *.testFailAfterStart -- This message was sent by Atlassian JIRA (v7.6.14#76016)