[ https://issues.apache.org/jira/browse/CASSANDRA-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188237#comment-14188237 ]
Sam Tunnicliffe commented on CASSANDRA-8176: -------------------------------------------- I believe this is a problem which at the moment can only affect the tests. There's a pre-existing race which the changes introduced by CASSANDRA-6904 make more obvious. The race is in each of the RecoveryManagerTest tests: {noformat} 1 - [main] calls CommitLog.instance.resetUnsafe() -- clears CL active/avail segments 2 - [main] calls CommitLog.instance.recover() -- attempts to replay all unmanaged files in the CL dir, then recycle them. 3 - [COMMIT-LOG-ALLOCATOR] detects that there are no available segments, so it: 3a - creates a new segment 3b - adds it to the list of available segments {noformat} The issue is when steps 2 and 3a/3b are interspersed. In that scenario, the new segment created in 3a is replayed by the {{\[main\]}} thread and subsequently recycled whereby its file is renamed. However, as well as being recycled it is added to the list of available segments (3b). So when the next test in the suite runs, this segment is used when performing the initial writes. The recycling has renamed the file but the segment id has not changed, so the checksums written are calculated using the id. Now when this second test resets then tries to recover, it discards the segment due to the checksum mismatches. Ultimately, that's what leads to the NPE & the test failure as the mutations from that dropped segment are not replayed. This race is highlighted by the changes in the commit [~mshuler] mentioned because now {{CL.recover()}} scans the commit log dir twice and even though in this test the archive/restore calls are no-ops, this widens the window for the race to occur. So it's only the (ab)use of {{resetUnsafe()}} coupled with the way that {{recover()}} expects the segments it replays to be unmanaged that causes this problem. The simplest fix right now is to add another {{resetUnsafe()}} to the tests so that each starts with a clean CL, avoiding the possibility of reusing a segment that has already been recycled. This is something to keep in mind with regard to CASSANDRA-7232 if we start replaying commit logs outside of a restart. > Intermittent NPE from RecoveryManagerTest RecoverPIT unit test > -------------------------------------------------------------- > > Key: CASSANDRA-8176 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8176 > Project: Cassandra > Issue Type: Bug > Components: Tests > Reporter: Michael Shuler > Assignee: Sam Tunnicliffe > Fix For: 2.1.2 > > Attachments: RecoveryManagerTest_failure_system.log.gz > > > {noformat} > [junit] Testsuite: org.apache.cassandra.db.RecoveryManagerTest > [junit] Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 7.654 sec > [junit] > [junit] ------------- Standard Output --------------- > [junit] WARN 16:40:38 No host ID found, created > 2cbd54a8-79a5-40e0-a8e6-c8bf2c575877 (Note: This should happen exactly once > per node). > [junit] WARN 16:40:38 No host ID found, created > 2cbd54a8-79a5-40e0-a8e6-c8bf2c575877 (Note: This should happen exactly once > per node). > [junit] WARN 16:40:38 Encountered bad header at position 16 of commit > log > /home/mshuler/git/cassandra/build/test/cassandra/commitlog:0/CommitLog-4-1414082433807.log, > with invalid CRC. The end of segment marker should be zero. > [junit] WARN 16:40:38 Encountered bad header at position 16 of commit > log > /home/mshuler/git/cassandra/build/test/cassandra/commitlog:0/CommitLog-4-1414082433807.log, > with invalid CRC. The end of segment marker should be zero. > [junit] ------------- ---------------- --------------- > [junit] Testcase: > testRecoverPIT(org.apache.cassandra.db.RecoveryManagerTest): Caused an > ERROR > [junit] null > [junit] java.lang.NullPointerException > [junit] at > org.apache.cassandra.db.RecoveryManagerTest.testRecoverPIT(RecoveryManagerTest.java:129) > [junit] > [junit] > [junit] Test org.apache.cassandra.db.RecoveryManagerTest FAILED > {noformat} > Test fails roughly 20-25% of CI runs. Several 10x and 25x bisections for 2.1 > {{git bisect start cassandra-2.1 f03e505}} resulted in {noformat}first bad > commit: [1394b128c65ef1ad59f765e9c9c5058cac04ca69]{noformat} which is > CASSANDRA-6904. > That patch went to 2.0 and I still need to dig there to see if we're getting > the same error, but I've attached the unit test failure system.log from 2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)