[ 
https://issues.apache.org/jira/browse/CASSANDRA-18119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656282#comment-17656282
 ] 

Josh McKenzie commented on CASSANDRA-18119:
-------------------------------------------

+1 on the code change.

Looks like the trunk and 4.1 ci runs have failures in repeat runs on 
LogTransactionTest.java; the one on 4.1 [looks like it might be 
related|https://app.circleci.com/pipelines/github/krummas/cassandra/846/workflows/f9ff701a-22f6-463b-b356-c37a61d24a75/jobs/7054/tests]
 to this change here; it now chains down to 
{code:java}
static boolean removeUnfinishedLeftovers(Map.Entry<String, List<File>> entry)
{
    try(LogFile txn = LogFile.make(entry.getKey(), entry.getValue()))
    {
        logger.info("Verifying logfile transaction {}", txn);
        // We don't check / include the stats file timestamp on LogRecord 
creation / verification as that might
        // be modified by a race in compaction notification and then needlessly 
fail subsequent node starts.
        if (txn.verify(true)) // REVIEWER NOTE: this param leading to skipping 
TS calc on stats file is what we changed here{code}
>From the logs on the test failure, looks like it was expecting true (files 
>found and deleted correctly) but received fals:
{code:java}
 34409 [junit-timeout] ERROR [main] 2022-12-13 15:45:01,465 LogFile.java:170 - 
Failed to read records for transaction log 
[nb_txn_compaction_1ae0fb60-7afd-11ed-bd8f-0b2a035c989d.log in 
build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1,
 build/test/cassandra/data/TransactionLogs
 34410 [junit-timeout] ERROR [main] 2022-12-13 15:45:01,465 
LogTransaction.java:561 - Unexpected disk state: failed to read transaction log 
[nb_txn_compaction_1ae0fb60-7afd-11ed-bd8f-0b2a035c989d.log in 
build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1,
 build/test/cassandra/da
 34411 [junit-timeout] Files and contents follow:^M
 34412 [junit-timeout] 
build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1/nb_txn_compaction_1ae0fb60-7afd-11ed-bd8f-0b2a035c989d.log^M
 34413 [junit-timeout]     
REMOVE:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1/nb-0-big-,1670946301437,5][2873235910]^M
 34414 [junit-timeout]     
REMOVE:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/2/nb-2-big-,1670946301441,5][1283732776]^M
 34415 [junit-timeout]     
ADD:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1/nb-1-big-,0,5][1197593494]^M
 34416 [junit-timeout]     
ADD:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/2/nb-3-big-,0,5][1374830355]^M
 34417 [junit-timeout]     COMMIT:[,0,0][2613697770]^M
 34418 [junit-timeout]         ***This record should have been the last one in 
all replicas^M
 34419 [junit-timeout] 
build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/2/nb_txn_compaction_1ae0fb60-7afd-11ed-bd8f-0b2a035c989d.log^M
 34420 [junit-timeout]     
REMOVE:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1/nb-0-big-,1670946301437,5][2873235910]^M
 34421 [junit-timeout]     
REMOVE:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/2/nb-2-big-,1670946301441,5][1283732776]^M
 34422 [junit-timeout]     
ADD:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/1/nb-1-big-,0,5][1197593494]^M
 34423 [junit-timeout]     
ADD:[/tmp/cassandra/build/test/cassandra/data/TransactionLogsTest/mockcf39-1add51e07afd11edbd8f0b2a035c989d/2/nb-3-big-,0,5][1374830355]^M
 34424 [junit-timeout]     COMMIT:[,0,0][2613697770]^M
 34425 [junit-timeout]         ***This record should have been the last one in 
all replicas^M
 34426 [junit-timeout]     COMMIT:[,0,0][2613697770]^M
 34427 [junit-timeout]         ***This record should have been the last one in 
all replicas^M
 34428 [junit-timeout] ^M {code}
Not clear to me how the change in this patch could have contributed to that; 
might be worth multiplexing this unit test on trunk w/out the change to see if 
there's a pre-existing race / issue in there. The repeated leaks of the 
TransactionTidier from LogTransaction throughout the repeated runs also look a 
smidge suspicious:
{code:java}
 34435 [junit-timeout] ERROR [Reference-Reaper] 2022-12-13 15:45:01,466 
Ref.java:237 - LEAK DETECTED: a reference (class 
org.apache.cassandra.db.lifecycle.LogTransaction$TransactionTidier@1776643200:[nb_txn_compaction_198e1590-7afd-11ed-bd8f-0b2a035c989d.log
 in /tmp/cassandra/build/test/cassandra/data/TransactionLog {code}

> Handle sstable metadata stats file getting a new mtime after compaction has 
> finished
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18119
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18119
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Compaction, Local/Startup and Shutdown
>            Reporter: Marcus Eriksson
>            Assignee: Marcus Eriksson
>            Priority: Normal
>             Fix For: 3.11.x, 4.0.x, 4.1.x, 4.x
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Due to a race between compaction finishing and compaction strategies getting 
> reloaded there is a chance that we try to add both the new sstable and the 
> old compacted sstable to the compaction strategy, and in the LCS case this 
> can cause the old sstable to get sent to L0 to avoid overlap. This changes 
> the mtime of the stats metadata file and if the node is shut down before the 
> sstable is actually deleted from disk, we fail starting with the following 
> exception:
> {code}
> .../mockcf1-392b3ff07c5a11ed8c662f5760cb10b3/nb_txn_compaction_3983c030-7c5a-11ed-8c66-2f5760cb10b3.log
> REMOVE:[.../data/TransactionLogsTest/mockcf1-392b3ff07c5a11ed8c662f5760cb10b3/nb-0-big-,1671096247000,5][4003386800]
>          ***Unexpected files detected for sstable [nb-0-big-]: last update 
> time [Thu Dec 15 10:24:09 CET 2022] (1671096249000) should have been [Thu Dec 
> 15 10:24:07 CET 2022] (1671096247000)
> ADD:[.../data/TransactionLogsTest/mockcf1-392b3ff07c5a11ed8c662f5760cb10b3/nb-2-big-,0,5][319189529]
> {code}
> A workaround for this (until we properly fix the way compaction strategies 
> get notified about sstable changes) is to ignore the timestamp of the STATS 
> component when cleaning up compaction leftovers on startup. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to