yujun777 opened a new pull request, #24273:
URL: https://github.com/apache/doris/pull/24273

   ## Proposed changes
   
   **BUG**
   
   There is a bug in publishing transaction, which may cause FE replica 
contains a certain version, but tablet in BE actually does not contain that 
version data. An example is as follows:
   1.  Suppose  partition current visible version = V,   has a replica A.   And 
a committed txn T,  its version is V + 1,   T is not published yet;
   2. Clone a new replica B from A.  After cloning,   B's version = V;
   3. Publish txn T,  after publishing,  the txn will fininsh and increase 
replica's version. For B,  its version catchup with T's version - 1 = (V+1) - 1 
= V,   and B is not in T's error replica ids,  so FE will treat B as succ, and 
update B new version = V + 1.  But  B's BE actually not contains txn T's data.  
   
   **FIX**
   1. when publish,  BE  will check if the tablet contains this txn's related 
version data. If this tablet contains the version, BE will report this tablet 
as succ;
   2. FE will check BE's succ tablets,  if a replica is not in the BE' s report 
succ tablets,  it will treat this replica as failed this versin.
   
   **TEST**
   
   FE send a test txn publish task to the BEs,  the test txn is not exists in 
BE,  BE will report tablets as failed. Finally FE will publish failed.
   
   BE log:
   
   ```
   W20230912 15:28:41.339694  1623 engine_publish_version_task.cpp:227] publish 
version failed on transaction, tablet version not exists. 
transaction_id=9999999, tablet_id=10268, version=4
   W20230912 15:28:41.339735  1623 engine_publish_version_task.cpp:227] publish 
version failed on transaction, tablet version not exists. 
transaction_id=9999999, tablet_id=10272, version=4
   W20230912 15:28:41.339742  1623 engine_publish_version_task.cpp:227] publish 
version failed on transaction, tablet version not exists. 
transaction_id=9999999, tablet_id=10276, version=4
   W20230912 15:28:41.339751  1623 engine_publish_version_task.cpp:227] publish 
version failed on transaction, tablet version not exists. 
transaction_id=9999999, tablet_id=10280, version=4
   W20230912 15:28:41.339762  1623 engine_publish_version_task.cpp:227] publish 
version failed on transaction, tablet version not exists. 
transaction_id=9999999, tablet_id=10284, version=4
   ``` 
   
   FE log:
   
   ```
   2023-09-12 15:29:55,082 INFO (PUBLISH_VERSION|36) 
[DatabaseTransactionMgr.finishTransaction():1032] publish version failed for 
transaction TransactionState. transaction id: 1002, label: 
insert_d50da1383dcd4f4d_8b9c3cfe227dd56d, db id: 10256, table id list: 10258, 
callback id: -1, coordinator: FE: 128.0.1.1, transaction status: COMMITTED, 
error replicas num: 30, replica ids: 10261,10262,10263,10265,10266, prepare
   time: 1694532521141, commit time: 1694532521285, finish time: -1, reason:  
on tablet 10292 with version 4, and has failed replicas, quorum num 2. table 
10258, partition 10257, tablet detail: 3 replicas write
   data failed: { [replicaId=10293, backendId=10003, backendAlive=true, 
version=3, state=NORMAL], [replicaId=10294, backendId=10002, backendAlive=true, 
version=3, state=NORMAL], [replicaId=10295, backendId=10004, backendAlive=true, 
version=3, state=NORMAL] };
   ```
   
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at 
[[email protected]](mailto:[email protected]) by explaining why you 
chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to