[ https://issues.apache.org/jira/browse/IGNITE-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mikhail Petrov updated IGNITE-25538: ------------------------------------ Description: User can observe the following output of `control.sh tx` command: {code:java} Matching transactions: TcpDiscoveryNode [id=34fd49ed-c325-4a93-a32c-3726c1c19130, addrs=[10.19.138.119], order=3, ver=16.1.3#20241226-sha1:900bfa69, isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0071.ca.sbrf.ru] Tx: [xid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, label=UcpSearchServiceDecorator.searchByClientId, state=ROLLED_BACK, startTime=2025-05-26 23:53:58.515, duration=224437 sec, isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, size=0, dhtNodes=[], nearXid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, parentNodeIds=[86cc9e5e]] Tx: [xid=087e3040791-00000000-156e-2f01-0000-000000000030, label=bs-ucp-4g-update-service, state=ROLLED_BACK, startTime=2025-05-25 23:45:45.961, duration=311329 sec, isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, size=0, dhtNodes=[], nearXid=087e3040791-00000000-156e-2f01-0000-000000000030, parentNodeIds=[60400a24]] Tx: [xid=0e60d620791-00000000-156e-2f01-0000-000000000035, label=CloudClientSearchService.byCriteria, state=ROLLED_BACK, startTime=2025-05-24 23:49:05.016, duration=397530 sec, isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, size=0, dhtNodes=[], nearXid=0e60d620791-00000000-156e-2f01-0000-000000000035, parentNodeIds=[448e854c]] TcpDiscoveryNode [id=9f11128e-c5a2-4700-af6b-c4777edfa31b, addrs=[10.19.138.75], order=54, ver=16.1.3#20241226-sha1:900bfa69, isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0025.ca.sbrf.ru] Command [TX] finished with code: 0 {code} >From the user perspective the mentioned output can be interpreted as bunch of >LRTs (long running transaction). Moreover this transactions cannot be `killed` >through contro.sh --kill command and are present in active transactions list >until node is rebooted. It worth to mention that the described problem is not reproduced for every rolled back transaction, but for some under certain conditions. Reproducer: 1. Start server node. 2. Start tx through thin client with timeout. 3. Inject sleep in IgniteTxManager#onCreated after isCompleted check with value greater than tx timeout. It can definitely be a case if the thread that started the transactions is switched by the scheduler. 4. Wait for tx to complete with timeout error. As a result the transaction is rolled back by timeout worker and then is stored in active transactions map in IgniteTxManager#onCreated method. The described above "hanging" transactions in ROLLED_BACK state do not hold any data key locks and does not affect PME in any way. was: User can observe the following output of `control.sh tx` command: {code:java} Matching transactions: TcpDiscoveryNode [id=34fd49ed-c325-4a93-a32c-3726c1c19130, addrs=[10.19.138.119], order=3, ver=16.1.3#20241226-sha1:900bfa69, isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0071.ca.sbrf.ru] Tx: [xid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, label=UcpSearchServiceDecorator.searchByClientId, state=ROLLED_BACK, startTime=2025-05-26 23:53:58.515, duration=224437 sec, isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, size=0, dhtNodes=[], nearXid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, parentNodeIds=[86cc9e5e]] Tx: [xid=087e3040791-00000000-156e-2f01-0000-000000000030, label=bs-ucp-4g-update-service, state=ROLLED_BACK, startTime=2025-05-25 23:45:45.961, duration=311329 sec, isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, size=0, dhtNodes=[], nearXid=087e3040791-00000000-156e-2f01-0000-000000000030, parentNodeIds=[60400a24]] Tx: [xid=0e60d620791-00000000-156e-2f01-0000-000000000035, label=CloudClientSearchService.byCriteria, state=ROLLED_BACK, startTime=2025-05-24 23:49:05.016, duration=397530 sec, isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, size=0, dhtNodes=[], nearXid=0e60d620791-00000000-156e-2f01-0000-000000000035, parentNodeIds=[448e854c]] TcpDiscoveryNode [id=9f11128e-c5a2-4700-af6b-c4777edfa31b, addrs=[10.19.138.75], order=54, ver=16.1.3#20241226-sha1:900bfa69, isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0025.ca.sbrf.ru] Command [TX] finished with code: 0 {code} >From the user perspective the mentioned output can be interpreted as bunch of >LRTs (long running transaction). Moreover this transactions cannot be `killed` >through contro.sh --kill command and are present in active transactions list >until node is rebooted. It worth to mention that the described problem is not reproduced for every rolled back transaction, but for some under certain conditions. Reproducer: 1. Start server node. 2. Start tx through thin client with timeout. 3. Inject sleep in IgniteTxManager#onCreated after isCompleted check with value greater than tx timeout. It can definitely be a case if the thread that started the transactions is switched by the scheduler. 4. Wait for tx to complete with timeout error. As a result the transaction is rolled back in by timeout worker thread and then stored in active transactions map. The described above "hanging" transactions in ROLLED_BACK state do not hold any data key locks and does not affect PME in any way. > ROLLED_BACK transactions are not removed from active transactions list > ----------------------------------------------------------------------- > > Key: IGNITE-25538 > URL: https://issues.apache.org/jira/browse/IGNITE-25538 > Project: Ignite > Issue Type: Bug > Reporter: Mikhail Petrov > Priority: Minor > Labels: ise > > User can observe the following output of `control.sh tx` command: > {code:java} > Matching transactions: > TcpDiscoveryNode [id=34fd49ed-c325-4a93-a32c-3726c1c19130, > addrs=[10.19.138.119], order=3, ver=16.1.3#20241226-sha1:900bfa69, > isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0071.ca.sbrf.ru] > Tx: [xid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, > label=UcpSearchServiceDecorator.searchByClientId, state=ROLLED_BACK, > startTime=2025-05-26 23:53:58.515, duration=224437 sec, > isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, > size=0, dhtNodes=[], > nearXid=0a2e8e50791-00000000-156e-2f01-0000-000000000013, > parentNodeIds=[86cc9e5e]] > Tx: [xid=087e3040791-00000000-156e-2f01-0000-000000000030, > label=bs-ucp-4g-update-service, state=ROLLED_BACK, startTime=2025-05-25 > 23:45:45.961, duration=311329 sec, isolation=READ_COMMITTED, > concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, size=0, dhtNodes=[], > nearXid=087e3040791-00000000-156e-2f01-0000-000000000030, > parentNodeIds=[60400a24]] > Tx: [xid=0e60d620791-00000000-156e-2f01-0000-000000000035, > label=CloudClientSearchService.byCriteria, state=ROLLED_BACK, > startTime=2025-05-24 23:49:05.016, duration=397530 sec, > isolation=READ_COMMITTED, concurrency=PESSIMISTIC, topVer=N/A, timeout=0 sec, > size=0, dhtNodes=[], > nearXid=0e60d620791-00000000-156e-2f01-0000-000000000035, > parentNodeIds=[448e854c]] > TcpDiscoveryNode [id=9f11128e-c5a2-4700-af6b-c4777edfa31b, > addrs=[10.19.138.75], order=54, ver=16.1.3#20241226-sha1:900bfa69, > isClient=false, consistentId=epk_rb_si_pplad-pprbrbepk0025.ca.sbrf.ru] > Command [TX] finished with code: 0 > {code} > From the user perspective the mentioned output can be interpreted as bunch of > LRTs (long running transaction). Moreover this transactions cannot be > `killed` through contro.sh --kill command and are present in active > transactions list until node is rebooted. > It worth to mention that the described problem is not reproduced for every > rolled back transaction, but for some under certain conditions. > Reproducer: > 1. Start server node. > 2. Start tx through thin client with timeout. > 3. Inject sleep in IgniteTxManager#onCreated after isCompleted check with > value greater than tx timeout. It can definitely be a case if the thread that > started the transactions is switched by the scheduler. > 4. Wait for tx to complete with timeout error. > As a result the transaction is rolled back by timeout worker and then is > stored in active transactions map in IgniteTxManager#onCreated method. > The described above "hanging" transactions in ROLLED_BACK state do not hold > any data key locks and does not affect PME in any way. -- This message was sent by Atlassian Jira (v8.20.10#820010)