[ 
https://issues.apache.org/jira/browse/CASSANDRA-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188237#comment-14188237
 ] 

Sam Tunnicliffe commented on CASSANDRA-8176:
--------------------------------------------

I believe this is a problem which at the moment can only affect the tests. 
There's a pre-existing race which the changes introduced by CASSANDRA-6904 make 
more obvious. 

The race is in each of the RecoveryManagerTest tests:

{noformat}
1 - [main] calls CommitLog.instance.resetUnsafe() -- clears CL active/avail 
segments
2 - [main] calls CommitLog.instance.recover() -- attempts to replay all 
unmanaged files in the CL dir, then recycle them.
3 - [COMMIT-LOG-ALLOCATOR] detects that there are no available segments, so it:
    3a - creates a new segment
    3b - adds it to the list of available segments
{noformat}

The issue is when steps 2 and 3a/3b are interspersed. In that scenario, the new 
segment created in 3a is replayed by the {{\[main\]}} thread and subsequently 
recycled whereby its file is renamed. However, as well as being recycled it is 
added to the list of available segments (3b). So when the next test in the 
suite runs, this segment is used when performing the initial writes. The 
recycling has renamed the file but the segment id has not changed, so the 
checksums written are calculated using the id. Now when this second test resets 
then tries to recover, it discards the segment due to the checksum mismatches. 
Ultimately, that's what leads to the NPE & the test failure as the mutations 
from that dropped segment are not replayed. This race is highlighted by the 
changes in the commit [~mshuler] mentioned because now {{CL.recover()}} scans 
the commit log dir twice and even though in this test the archive/restore calls 
are no-ops, this widens the window for the race to occur.

So it's only the (ab)use of {{resetUnsafe()}} coupled with the way that 
{{recover()}} expects the segments it replays to be unmanaged that causes this 
problem. The simplest fix right now is to add another {{resetUnsafe()}} to the 
tests so that each starts with a clean CL, avoiding the possibility of reusing 
a segment that has already been recycled.

This is something to keep in mind with regard to CASSANDRA-7232 if we start 
replaying commit logs outside of a restart.

> Intermittent NPE from RecoveryManagerTest RecoverPIT unit test
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-8176
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8176
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tests
>            Reporter: Michael Shuler
>            Assignee: Sam Tunnicliffe
>             Fix For: 2.1.2
>
>         Attachments: RecoveryManagerTest_failure_system.log.gz
>
>
> {noformat}
>     [junit] Testsuite: org.apache.cassandra.db.RecoveryManagerTest
>     [junit] Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 7.654 sec
>     [junit] 
>     [junit] ------------- Standard Output ---------------
>     [junit] WARN  16:40:38 No host ID found, created 
> 2cbd54a8-79a5-40e0-a8e6-c8bf2c575877 (Note: This should happen exactly once 
> per node).
>     [junit] WARN  16:40:38 No host ID found, created 
> 2cbd54a8-79a5-40e0-a8e6-c8bf2c575877 (Note: This should happen exactly once 
> per node).
>     [junit] WARN  16:40:38 Encountered bad header at position 16 of commit 
> log 
> /home/mshuler/git/cassandra/build/test/cassandra/commitlog:0/CommitLog-4-1414082433807.log,
>  with invalid CRC. The end of segment marker should be zero.
>     [junit] WARN  16:40:38 Encountered bad header at position 16 of commit 
> log 
> /home/mshuler/git/cassandra/build/test/cassandra/commitlog:0/CommitLog-4-1414082433807.log,
>  with invalid CRC. The end of segment marker should be zero.
>     [junit] ------------- ---------------- ---------------
>     [junit] Testcase: 
> testRecoverPIT(org.apache.cassandra.db.RecoveryManagerTest):      Caused an 
> ERROR
>     [junit] null
>     [junit] java.lang.NullPointerException
>     [junit]     at 
> org.apache.cassandra.db.RecoveryManagerTest.testRecoverPIT(RecoveryManagerTest.java:129)
>     [junit] 
>     [junit] 
>     [junit] Test org.apache.cassandra.db.RecoveryManagerTest FAILED
> {noformat}
> Test fails roughly 20-25% of CI runs. Several 10x and 25x bisections for 2.1 
> {{git bisect start cassandra-2.1 f03e505}} resulted in {noformat}first bad 
> commit: [1394b128c65ef1ad59f765e9c9c5058cac04ca69]{noformat} which is 
> CASSANDRA-6904.
> That patch went to 2.0 and I still need to dig there to see if we're getting 
> the same error, but I've attached the unit test failure system.log from 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to