gaodayue opened a new issue #1168:
URL: https://github.com/apache/incubator-brpc/issues/1168


   **Describe the bug (描述bug)**
   当下游节点X故障重启后,集群有时候会出现某个上游节点Y一直无法连接X的情况,其他上游节点在健康检查后会重建与X的连接。例如
   
   1)  下游节点10.26.44.32在09:02:17因故障重启后,某个上游节点Y没有重建与10.26.44.32的连接,日志中持续输出"Not 
connected to 10.26.44.32:8060 yet"
   
   ```
   W0716 09:02:17.695824 142210 input_messenger.cpp:212] Fail to read from 
Socket{id=896 fd=4228 addr=10.26.44.32:8060:55309} (0xa352000): Connection 
reset by peer [104]
   W0716 09:02:17.702852 141955 data_stream_sender.cpp:138] failed to send brpc 
batch, error=Host is down, error_text=[E104]Fail to read from Socket{id=896 
fd=4228 addr=10.26.44.32:8060:55309} (0x0xa352000): Connection reset by peer 
[R1][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R2][E112]Not 
connected to 10.26.44.32:8060 yet, server_id=896 [R3][E112]Not connected to 
10.26.44.32:8060 yet, server_id=896
   ....忽略类似内容....
   W0716 09:09:58.714361 38298 data_stream_sender.cpp:138] failed to send brpc 
batch, error=Host is down, error_text=[E112]Not connected to 10.26.44.32:8060 
yet, server_id=896 [R1][E112]Not connected to 10.26.44.32:8060 yet, 
server_id=896 [R2][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 
[R3][E112]Not connected to 10.26.44.32:8060 yet, server_id=896
   ```
   
   2)查看netstat发现没有Y与10.26.44.32的TCP连接
   3)查看Y的/connections发现Socket状态为Broken,信息如下
   
   ```
   $ curl http://localhost:8060/connections | grep 10.26.44.32:8060
   Broken                    |10.26.44.32:8060   |55309|-  |-           |-    
|-        |-     |-         |-       |-         |-     |-         |-       |-   
       |896
   
   $ curl http://localhost:8060/sockets/896
   # This is a broken Socket
   version=1
   shared_part={
     ref_count=1
     socket_pool=null
     creator_socket=896
     in_size=316616369
     in_num_messages=12120483
     out_size=114960271066
     out_num_messages=12120511
   }
   nref=1
   nevent=1
   fd=4228
   tos=0
   reset_fd_to_now=485975008182us
   remote_side=10.26.44.32:8060
   local_side=10.22.180.15:55309
   on_et_events=0x1bc5dd0
   user=(brpc::InputMessenger*)0x5c0ab40
   this_id=896
   preferred_index=1 (baidu_std)
   hc_count=0
   avg_input_msg_size=26
   read_buf=0
   last_read_to_now=960432766us
   last_write_to_now=960412394us
   overcrowded=0
   id_wait_list={}
   parsing_context=0
   pipeline_q=0
   hc_interval_s=3
   ninprocess=1
   auth_flag_error=0
   auth_id=177098681547473
   auth_context=0
   logoff_flag=0
   recycle_flag=1
   agent_socket_id=(none)
   cid=0
   write_head=0
   ssl_state=SSL_OFF
   tcpi={
     state=7
     ca_state=0
     retransmits=0
     probes=0
     backoff=0
     options=7
     snd_wscale=7
     rcv_wscale=7
     rto=205000
     ato=40000
     snd_mss=1448
     rcv_mss=736
     unacked=0
     sacked=0
     lost=0
     retrans=0
     fackets=0
     last_data_sent=960413
     last_ack_sent=0
     last_data_recv=960433
     last_ack_recv=960413
     pmtu=1500
     rcv_ssthresh=52260
     rtt=2750
     rttvar=3000
     snd_ssthresh=18
     snd_cwnd=18
     advmss=1448
     reordering=3
   }
   ```
   
   4)Y日志中没有"Checking Socket"和"Revived 
Socket"的日志(集群启用了健康检查,health_check_interval = 3,其他上游节点有Checking和Revived日志)
   
   **To Reproduce (复现方法)**
   
   生产环境小概率出现,目前需要通过重启上游节点恢复。
   
   **Versions (各种版本)**
   OS: CentOS Linux release 7.1.1503 (Core)
   Compiler: gcc (GCC) 7.2.0
   brpc: 0.9.5
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@brpc.apache.org
For additional commands, e-mail: dev-h...@brpc.apache.org

Reply via email to