gaodayue opened a new issue #1168: URL: https://github.com/apache/incubator-brpc/issues/1168
**Describe the bug (描述bug)** 当下游节点X故障重启后,集群有时候会出现某个上游节点Y一直无法连接X的情况,其他上游节点在健康检查后会重建与X的连接。例如 1) 下游节点10.26.44.32在09:02:17因故障重启后,某个上游节点Y没有重建与10.26.44.32的连接,日志中持续输出"Not connected to 10.26.44.32:8060 yet" ``` W0716 09:02:17.695824 142210 input_messenger.cpp:212] Fail to read from Socket{id=896 fd=4228 addr=10.26.44.32:8060:55309} (0xa352000): Connection reset by peer [104] W0716 09:02:17.702852 141955 data_stream_sender.cpp:138] failed to send brpc batch, error=Host is down, error_text=[E104]Fail to read from Socket{id=896 fd=4228 addr=10.26.44.32:8060:55309} (0x0xa352000): Connection reset by peer [R1][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R2][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R3][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 ....忽略类似内容.... W0716 09:09:58.714361 38298 data_stream_sender.cpp:138] failed to send brpc batch, error=Host is down, error_text=[E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R1][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R2][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R3][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 ``` 2)查看netstat发现没有Y与10.26.44.32的TCP连接 3)查看Y的/connections发现Socket状态为Broken,信息如下 ``` $ curl http://localhost:8060/connections | grep 10.26.44.32:8060 Broken |10.26.44.32:8060 |55309|- |- |- |- |- |- |- |- |- |- |- |- |896 $ curl http://localhost:8060/sockets/896 # This is a broken Socket version=1 shared_part={ ref_count=1 socket_pool=null creator_socket=896 in_size=316616369 in_num_messages=12120483 out_size=114960271066 out_num_messages=12120511 } nref=1 nevent=1 fd=4228 tos=0 reset_fd_to_now=485975008182us remote_side=10.26.44.32:8060 local_side=10.22.180.15:55309 on_et_events=0x1bc5dd0 user=(brpc::InputMessenger*)0x5c0ab40 this_id=896 preferred_index=1 (baidu_std) hc_count=0 avg_input_msg_size=26 read_buf=0 last_read_to_now=960432766us last_write_to_now=960412394us overcrowded=0 id_wait_list={} parsing_context=0 pipeline_q=0 hc_interval_s=3 ninprocess=1 auth_flag_error=0 auth_id=177098681547473 auth_context=0 logoff_flag=0 recycle_flag=1 agent_socket_id=(none) cid=0 write_head=0 ssl_state=SSL_OFF tcpi={ state=7 ca_state=0 retransmits=0 probes=0 backoff=0 options=7 snd_wscale=7 rcv_wscale=7 rto=205000 ato=40000 snd_mss=1448 rcv_mss=736 unacked=0 sacked=0 lost=0 retrans=0 fackets=0 last_data_sent=960413 last_ack_sent=0 last_data_recv=960433 last_ack_recv=960413 pmtu=1500 rcv_ssthresh=52260 rtt=2750 rttvar=3000 snd_ssthresh=18 snd_cwnd=18 advmss=1448 reordering=3 } ``` 4)Y日志中没有"Checking Socket"和"Revived Socket"的日志(集群启用了健康检查,health_check_interval = 3,其他上游节点有Checking和Revived日志) **To Reproduce (复现方法)** 生产环境小概率出现,目前需要通过重启上游节点恢复。 **Versions (各种版本)** OS: CentOS Linux release 7.1.1503 (Core) Compiler: gcc (GCC) 7.2.0 brpc: 0.9.5 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@brpc.apache.org For additional commands, e-mail: dev-h...@brpc.apache.org