[jira] [Updated] (IGNITE-25538) ROLLED_BACK transactions are not removed from active transactions list

Mikhail Petrov (Jira) Mon, 02 Jun 2025 06:13:11 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mikhail Petrov updated IGNITE-25538:
------------------------------------
    Description: 
User can observe the following output of `control.sh tx` command:


{code:java}
Matching transactions:
TcpDiscoveryNode [id=34fd49ed-c325-4a93-a32c-3726c1c19130, 
addrs=[10.19.138.119], order=3, ver=16.1.3#20241226-sha1:900bfa69, 
isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0071.ca.sbrf.ru]
Tx: [xid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, 
label=UcpSearchServiceDecorator.searchByClientId, state=ROLLED_BACK, 
startTime=2025-05-26 23:53:58.515, duration=224437 sec, 
isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, 
size=0, dhtNodes=[], nearXid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, 
parentNodeIds=[86cc9e5e]]
Tx: [xid=087e3040791-00000000-156e-2f01-0000-000000000030, 
label=bs-ucp-4g-update-service, state=ROLLED_BACK, startTime=2025-05-25 
23:45:45.961, duration=311329 sec, isolation=READ_COMMITTED, 
concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, size=0, dhtNodes=[], 
nearXid=087e3040791-00000000-156e-2f01-0000-000000000030, 
parentNodeIds=[60400a24]]
Tx: [xid=0e60d620791-00000000-156e-2f01-0000-000000000035, 
label=CloudClientSearchService.byCriteria, state=ROLLED_BACK, 
startTime=2025-05-24 23:49:05.016, duration=397530 sec, 
isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, 
size=0, dhtNodes=[], nearXid=0e60d620791-00000000-156e-2f01-0000-000000000035, 
parentNodeIds=[448e854c]]
TcpDiscoveryNode [id=9f11128e-c5a2-4700-af6b-c4777edfa31b, 
addrs=[10.19.138.75], order=54, ver=16.1.3#20241226-sha1:900bfa69, 
isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0025.ca.sbrf.ru]

Command [TX] finished with code: 0
{code}

>From the user perspective the mentioned output can be interpreted as bunch of 
>LRTs (long running transaction). Moreover this transactions cannot be `killed` 
>through contro.sh --kill command and are present in active transactions list 
>until node is rebooted.

It worth to mention that the described problem is not reproduced for every 
rolled back transaction, but for some under certain conditions.

Reproducer:


1. Start server node.
2. Start tx through thin client with timeout.
3. Inject sleep in IgniteTxManager#onCreated after isCompleted check with value 
greater than tx timeout. It can definitely be a case if the thread that started 
the transactions is switched by the scheduler.
4. Wait for tx to complete with timeout error.

As a result the transaction is rolled back by timeout worker and then is stored 
in active transactions map in IgniteTxManager#onCreated method.

The described above "hanging" transactions in ROLLED_BACK state do not hold any 
data key locks and does not affect PME in any way. 


  was:
User can observe the following output of `control.sh tx` command:


{code:java}
Matching transactions:
TcpDiscoveryNode [id=34fd49ed-c325-4a93-a32c-3726c1c19130, 
addrs=[10.19.138.119], order=3, ver=16.1.3#20241226-sha1:900bfa69, 
isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0071.ca.sbrf.ru]
Tx: [xid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, 
label=UcpSearchServiceDecorator.searchByClientId, state=ROLLED_BACK, 
startTime=2025-05-26 23:53:58.515, duration=224437 sec, 
isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, 
size=0, dhtNodes=[], nearXid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, 
parentNodeIds=[86cc9e5e]]
Tx: [xid=087e3040791-00000000-156e-2f01-0000-000000000030, 
label=bs-ucp-4g-update-service, state=ROLLED_BACK, startTime=2025-05-25 
23:45:45.961, duration=311329 sec, isolation=READ_COMMITTED, 
concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, size=0, dhtNodes=[], 
nearXid=087e3040791-00000000-156e-2f01-0000-000000000030, 
parentNodeIds=[60400a24]]
Tx: [xid=0e60d620791-00000000-156e-2f01-0000-000000000035, 
label=CloudClientSearchService.byCriteria, state=ROLLED_BACK, 
startTime=2025-05-24 23:49:05.016, duration=397530 sec, 
isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, 
size=0, dhtNodes=[], nearXid=0e60d620791-00000000-156e-2f01-0000-000000000035, 
parentNodeIds=[448e854c]]
TcpDiscoveryNode [id=9f11128e-c5a2-4700-af6b-c4777edfa31b, 
addrs=[10.19.138.75], order=54, ver=16.1.3#20241226-sha1:900bfa69, 
isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0025.ca.sbrf.ru]

Command [TX] finished with code: 0
{code}

>From the user perspective the mentioned output can be interpreted as bunch of 
>LRTs (long running transaction). Moreover this transactions cannot be `killed` 
>through contro.sh --kill command and are present in active transactions list 
>until node is rebooted.

It worth to mention that the described problem is not reproduced for every 
rolled back transaction, but for some under certain conditions.

Reproducer:


1. Start server node.
2. Start tx through thin client with timeout.
3. Inject sleep in IgniteTxManager#onCreated after isCompleted check with value 
greater than tx timeout. It can definitely be a case if the thread that started 
the transactions is switched by the scheduler.
4. Wait for tx to complete with timeout error.

As a result the transaction is rolled back in by timeout worker thread and then 
stored in active transactions map.

The described above "hanging" transactions in ROLLED_BACK state do not hold any 
data key locks and does not affect PME in any way. 



> ROLLED_BACK transactions are not removed from active transactions list 
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-25538
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25538
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Petrov
>            Priority: Minor
>              Labels: ise
>
> User can observe the following output of `control.sh tx` command:
> {code:java}
> Matching transactions:
> TcpDiscoveryNode [id=34fd49ed-c325-4a93-a32c-3726c1c19130, 
> addrs=[10.19.138.119], order=3, ver=16.1.3#20241226-sha1:900bfa69, 
> isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0071.ca.sbrf.ru]
> Tx: [xid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, 
> label=UcpSearchServiceDecorator.searchByClientId, state=ROLLED_BACK, 
> startTime=2025-05-26 23:53:58.515, duration=224437 sec, 
> isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, 
> size=0, dhtNodes=[], 
> nearXid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, 
> parentNodeIds=[86cc9e5e]]
> Tx: [xid=087e3040791-00000000-156e-2f01-0000-000000000030, 
> label=bs-ucp-4g-update-service, state=ROLLED_BACK, startTime=2025-05-25 
> 23:45:45.961, duration=311329 sec, isolation=READ_COMMITTED, 
> concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, size=0, dhtNodes=[], 
> nearXid=087e3040791-00000000-156e-2f01-0000-000000000030, 
> parentNodeIds=[60400a24]]
> Tx: [xid=0e60d620791-00000000-156e-2f01-0000-000000000035, 
> label=CloudClientSearchService.byCriteria, state=ROLLED_BACK, 
> startTime=2025-05-24 23:49:05.016, duration=397530 sec, 
> isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, 
> size=0, dhtNodes=[], 
> nearXid=0e60d620791-00000000-156e-2f01-0000-000000000035, 
> parentNodeIds=[448e854c]]
> TcpDiscoveryNode [id=9f11128e-c5a2-4700-af6b-c4777edfa31b, 
> addrs=[10.19.138.75], order=54, ver=16.1.3#20241226-sha1:900bfa69, 
> isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0025.ca.sbrf.ru]
> Command [TX] finished with code: 0
> {code}
> From the user perspective the mentioned output can be interpreted as bunch of 
> LRTs (long running transaction). Moreover this transactions cannot be 
> `killed` through contro.sh --kill command and are present in active 
> transactions list until node is rebooted.
> It worth to mention that the described problem is not reproduced for every 
> rolled back transaction, but for some under certain conditions.
> Reproducer:
> 1. Start server node.
> 2. Start tx through thin client with timeout.
> 3. Inject sleep in IgniteTxManager#onCreated after isCompleted check with 
> value greater than tx timeout. It can definitely be a case if the thread that 
> started the transactions is switched by the scheduler.
> 4. Wait for tx to complete with timeout error.
> As a result the transaction is rolled back by timeout worker and then is 
> stored in active transactions map in IgniteTxManager#onCreated method.
> The described above "hanging" transactions in ROLLED_BACK state do not hold 
> any data key locks and does not affect PME in any way. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-25538) ROLLED_BACK transactions are not removed from active transactions list

Reply via email to