Hi All,

We met with unexpected behavior of health check in load balancer with a windows server as backend, for example:
ovn-controller --> win_srv SYN
ovn-controller <-- win_srv SYN_ACK
ovn-controller --> win_srv RST_ACK
ovn-controller <-- win_srv SYN_ACK TCP Retransmission
ovn-controller --> win_srv RST_ACK
ovn-controller <-- win_srv SYN_ACK TCP Retransmission
ovn-controller --> win_srv RST_ACK
ovn-controller <-- win_srv RST

Dump where 172.20.2.2 is ovn-controller and 172.20.2.49 is windows server backend: 17:18:20.324464 IP 172.20.2.2.29636 > 172.20.2.49.7045: Flags [S], seq 2289025863, win 65160, length 0 17:18:20.324572 IP 172.20.2.49.7045 > 172.20.2.2.29636: Flags [S.], seq 2447380613, ack 2289025864, win 8192, options [mss 1460], length 0 17:18:20.325233 IP 172.20.2.2.29636 > 172.20.2.49.7045: Flags [R.], seq 2, ack 1, win 65160, length 0 17:18:23.336091 IP 172.20.2.49.7045 > 172.20.2.2.29636: Flags [S.], seq 2447380613, ack 2289025864, win 8192, options [mss 1460], length 0 17:18:23.336559 IP 172.20.2.2.29636 > 172.20.2.49.7045: Flags [R.], seq 2, ack 1, win 65160, length 0 17:18:29.335992 IP 172.20.2.49.7045 > 172.20.2.2.29636: Flags [S.], seq 2447380613, ack 2289025864, win 8192, options [mss 1460], length 0 17:18:29.336423 IP 172.20.2.2.29636 > 172.20.2.49.7045: Flags [R.], seq 2, ack 1, win 65160, length 0 17:18:41.335919 IP 172.20.2.49.7045 > 172.20.2.2.29636: Flags [R], seq 2447380614, win 0, length 0

For the linux it doesn't affected because linux just ignore RST_ACK and didn't made tcp retransmission from client side, it looks like:
ovn-controller --> linux_srv SYN
ovn-controller <-- linux_srv SYN_ACK
ovn-controller --> linux_srv RST_ACK
The linux client didn't made TCP retransmission.

The main issue:
The status in svc_mon flapping between online and offline because of behavior with a windows server backend. Every change for status in svc_mon trigger ovn-northd and increase CPU usage of ovn-northd to 100%.

I checked it on windows server with simple http server as backend and this is behavior reproduced and looks like all backends with windows affected.

I want make a patch to fix this behavior, but I don't know which method is preferred:

1) Add a new boolean field in struct svc_mon. After the func svc_monitor_run() called with init/online/offline cases set this field in the func svc_monitor_send_health_check(). If the func process_packet_in() called in ACTION_OPCODE_HANDLE_SVC_CHECK case we change the boolean field and send RST_ACK to client if packet_in with SYN_ACK flag or we make a choice based on the new boolean field, change or not the svc_mon->state field if packet_in with RST flag.
The func where we make decision:
https://github.com/ovn-org/ovn/blob/main/controller/pinctrl.c#L7810-L7858

2) Implement established tcp connection, for example:
ovn-controller --> win_srv SYN
ovn-controller <-- win_srv SYN_ACK
ovn-controller --> win_srv ACK
ovn-controller --> win_srv FIN_ACK
ovn-controller <-- win_srv FIN_ACK
ovn-controller --> win_srv ACK

3) Send RST instead of RST_ACK

It will be great if you advice which method is better or I need fix it on another way.
Thanks.

--
Regards,
Evgenii Kovalev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to