howzi opened a new issue, #3058:
URL: https://github.com/apache/brpc/issues/3058

   **Describe the bug**
   参考https://github.com/apache/brpc/issues/1773https://github.com/apache/brpc/pull/1817 中的改动。
   其中引入的`_hc_started`会导致极端情况下,跳过一次health check。
   
   
   <img width="1324" height="1266" alt="Image" 
src="https://github.com/user-attachments/assets/ade077ae-d09e-46ac-a7a4-f7d49c643c25";
 />
   具体的问题和#1773中的一样,revive在1处恢复了`_versioned_ref`的version,如果在2之前切出去
   <img width="1460" height="1072" alt="Image" 
src="https://github.com/user-attachments/assets/a517db2f-9eed-4a65-9549-0e31248c7a72";
 />
   这时候SetFailed再次更新了`_versioned_ref`的version(+1),并进入到了OnFailed中
   
   <img width="1474" height="1030" alt="Image" 
src="https://github.com/user-attachments/assets/3e242e66-7572-43ad-92b4-b03ac926a394";
 />
   这时候上一个HC还没结束,`_hc_started`变更失败并跳过了触发健康检查,那么`_versioned_ref`就永远也没法恢复了。
   
   
   **To Reproduce**
   和#1773一样,这里复制下
   
   可以使用brpc example里面的client server测试例子,将上图中1,2之间加一个bthread_usleep(10000*1000) 
-> sleep 10s模拟两个操作之间pthread被切走的情况,然后
   
   ./server
   ./client
   restart server 为了触发client端第一个SetFailed进入健康检查并进入到Socket::Revive 
1的位置(cas成功之后)开始sleep
   restart server 在sleep完成之前,为了触发client端第二个SetFailed并且没有把socket refcount-1
   就可以稳定复现该问题
   
   **Expected behavior**
   
   **Versions**
   所有
   
   **Additional context/screenshots**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to