yujun777 opened a new pull request, #24273:
URL: https://github.com/apache/doris/pull/24273
## Proposed changes
**BUG**
There is a bug in publishing transaction, which may cause FE replica
contains a certain version, but tablet in BE actually does not contain that
version data. An example is as follows:
1. Suppose partition current visible version = V, has a replica A. And
a committed txn T, its version is V + 1, T is not published yet;
2. Clone a new replica B from A. After cloning, B's version = V;
3. Publish txn T, after publishing, the txn will fininsh and increase
replica's version. For B, its version catchup with T's version - 1 = (V+1) - 1
= V, and B is not in T's error replica ids, so FE will treat B as succ, and
update B new version = V + 1. But B's BE actually not contains txn T's data.
**FIX**
1. when publish, BE will check if the tablet contains this txn's related
version data. If this tablet contains the version, BE will report this tablet
as succ;
2. FE will check BE's succ tablets, if a replica is not in the BE' s report
succ tablets, it will treat this replica as failed this versin.
**TEST**
FE send a test txn publish task to the BEs, the test txn is not exists in
BE, BE will report tablets as failed. Finally FE will publish failed.
BE log:
```
W20230912 15:28:41.339694 1623 engine_publish_version_task.cpp:227] publish
version failed on transaction, tablet version not exists.
transaction_id=9999999, tablet_id=10268, version=4
W20230912 15:28:41.339735 1623 engine_publish_version_task.cpp:227] publish
version failed on transaction, tablet version not exists.
transaction_id=9999999, tablet_id=10272, version=4
W20230912 15:28:41.339742 1623 engine_publish_version_task.cpp:227] publish
version failed on transaction, tablet version not exists.
transaction_id=9999999, tablet_id=10276, version=4
W20230912 15:28:41.339751 1623 engine_publish_version_task.cpp:227] publish
version failed on transaction, tablet version not exists.
transaction_id=9999999, tablet_id=10280, version=4
W20230912 15:28:41.339762 1623 engine_publish_version_task.cpp:227] publish
version failed on transaction, tablet version not exists.
transaction_id=9999999, tablet_id=10284, version=4
```
FE log:
```
2023-09-12 15:29:55,082 INFO (PUBLISH_VERSION|36)
[DatabaseTransactionMgr.finishTransaction():1032] publish version failed for
transaction TransactionState. transaction id: 1002, label:
insert_d50da1383dcd4f4d_8b9c3cfe227dd56d, db id: 10256, table id list: 10258,
callback id: -1, coordinator: FE: 128.0.1.1, transaction status: COMMITTED,
error replicas num: 30, replica ids: 10261,10262,10263,10265,10266, prepare
time: 1694532521141, commit time: 1694532521285, finish time: -1, reason:
on tablet 10292 with version 4, and has failed replicas, quorum num 2. table
10258, partition 10257, tablet detail: 3 replicas write
data failed: { [replicaId=10293, backendId=10003, backendAlive=true,
version=3, state=NORMAL], [replicaId=10294, backendId=10002, backendAlive=true,
version=3, state=NORMAL], [replicaId=10295, backendId=10004, backendAlive=true,
version=3, state=NORMAL] };
```
## Further comments
If this is a relatively large or complex change, kick off the discussion at
[[email protected]](mailto:[email protected]) by explaining why you
chose the solution you did and what alternatives you considered, etc...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]