Gary Tully created AMQ-7488:
-------------------------------

             Summary: Corruption to mKahaDB transaction journal can lead to 
partial transaction completion in error
                 Key: AMQ-7488
                 URL: https://issues.apache.org/jira/browse/AMQ-7488
             Project: ActiveMQ
          Issue Type: Bug
          Components: mKahaDB, KahaDB
    Affects Versions: 5.15.0
            Reporter: Gary Tully
            Assignee: Gary Tully
             Fix For: 5.16.0


mKahaDB has a transaction journal to track the outcome of local JMS 
transactions that span kahaDB instances. Ie: send to A and B where each has its 
own persistence adapter.
To ensure N instances are in sync,  a local 2PC is required
On recovery, if there is an outcome recorded in the txStore journal a commit 
must be replayed. If the outcome has not been recorded, rollback is assumed.
If the txJournal is corrupt and detected as corrupt the broker fails to start. 
However it is not possible to manually recover via JMX without a live broker.
Each kahaDB instance would need to be restarted in isolation to force heuristic 
completion.

Deleting the corrupt journal to allow restart is potentially problematic *if* 
there are pending outcomes. 
It can lead to partial outcomes, as recovery assumes good information, with no 
information from an empty tx journal, assumes rollback of any relevant pending 
transaction.

Note: the failure window is very small, but it is present and can lead to 
message loss once the txStore data is lost, deleted. 

workarond:
Only delete a corrupt txStore journal once there are no recovered XA 
transactions in play in any of the nested kahaDB instances. This is visible 
from the recovery logging.
If there are recovered XA transactions with formatId=61616 (the internal id 
used to identify these local 2pc transactions), then careful recovery of the 
relevant kahaDB instance directory in standalone mode (without mKahaDB) will be 
required if the transactions should commit.

solution:
The journal needs to better detect  corruption *and* not do any recovery 
processing in the absence of correct (non corrupt) information. Leaving pending 
transactions to be heuristically completed via JMX. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to