On Thu, Apr 11, 2024 at 12:20 PM Vladislav Odintsov <odiv...@gmail.com> wrote:
> Hi all, > > I’m running ovn 22.09 and sometimes see that ovn-controllers crash with > segmentation fault. The backtrace is next: > > (gdb) bt > #0 0x00007f0742707de1 in __strlen_sse2 () from /lib64/libc.so.6 > #1 0x00007f0742788c5d in inet_pton () from /lib64/libc.so.6 > #2 0x0000564f45a1c784 in ip_parse (s=<optimized out>, > ip=ip@entry=0x7f074040f90c) > at lib/packets.c:698 > #3 0x0000564f4594cbfb in svc_monitor_send_tcp_health_check__ > (swconn=swconn@entry=0x7f0738000940, > svc_mon=svc_mon@entry=0x564f4c2960c0, ctl_flags=ctl_flags@entry=2, > tcp_seq=3858078915, tcp_ack=tcp_ack@entry=0, > tcp_src=<optimized out>) at controller/pinctrl.c:7513 > #4 0x0000564f4594d47c in svc_monitor_send_tcp_health_check__ > (tcp_src=<optimized out>, tcp_ack=0, tcp_seq=<optimized out>, > ctl_flags=2, svc_mon=0x564f4c2960c0, swconn=0x7f0738000940) at > controller/pinctrl.c:7502 > #5 svc_monitor_send_health_check (swconn=swconn@entry=0x7f0738000940, > svc_mon=svc_mon@entry=0x564f4c2960c0) > at controller/pinctrl.c:7621 > #6 0x0000564f4595869b in svc_monitors_run > (svc_monitors_next_run_time=0x564f45dd3970 > <svc_monitors_next_run_time.37793>, > swconn=0x7f0738000940) at controller/pinctrl.c:7693 > #7 pinctrl_handler (arg_=0x564f45e11240 <pinctrl>) at > controller/pinctrl.c:3499 > #8 0x0000564f45a0ad6f in ovsthread_wrapper (aux_=<optimized out>) at > lib/ovs-thread.c:422 > #9 0x00007f074325bea5 in start_thread () from /lib64/libpthread.so.0 > #10 0x00007f07427798dd in clone () from /lib64/libc.so.6 > > After moving to frame #3, I can get actual data from svc_mon structure > (port/protocol/dp_key/port_key) - I’ve looked them up in SB DB and found > port_binding, which belongs to a logical port, which resides on this > chassis. > It has configured LB with HC. Here everything seems good. But if to check > svc_mon->sb_svc_mon structure, it seems to me that it contains garbage - > Address 0x564f00000000 out of bounds; logical_port == 0, etc (but I can be > wrong): > > $1 = (const struct sbrec_service_monitor *) 0x564f54db2b40 > (gdb) print *svc_mon->sb_svc_mon > $2 = {header_ = {hmap_node = {hash = 94898726054728, next = 0x0}, uuid = > {parts = {0, 0, 0, 0}}, src_arcs = {prev = 0x564f54aae0d0, next = 0x0}, > dst_arcs = {prev = 0x564f7f8bd470, next = 0x564f7f8bd540}, table = 0x64, > old_datum = 0xf, > parsed = 152, reparse_node = {prev = 0x0, next = 0x0}, new_datum = > 0x0, prereqs = 0x52eb8916, written = 0x171, txn_node = {hash = 1, next = > 0x564f54db2db0}, map_op_written = 0x0, map_op_lists = 0x0, set_op_written = > 0x0, > set_op_lists = 0x0, change_seqno = {0, 0, 0}, track_node = {prev = > 0x564f00000000, next = 0x0}, updated = 0x0, tracked_old_datum = 0x0}, > external_ids = {map = {buckets = 0x1, one = 0x564f54db2d90, mask = 0, n = > 0}}, > ip = 0x564f00000000 <Address 0x564f00000000 out of bounds>, logical_port > = 0x0, options = {map = {buckets = 0x0, one = 0x0, mask = 1, n = > 94898780242768}}, port = 0, protocol = 0x0, src_ip = 0x1 <Address 0x1 out > of bounds>, > src_mac = 0x564f54db2d70 "`Ջ\177OV", status = 0x0} > … > (gdb) print svc_mon->state > $8 = SVC_MON_S_ONLINE > (gdb) print svc_mon->status > $9 = SVC_MON_ST_ONLINE > (gdb) print svc_mon->protocol > $10 = SVC_MON_PROTO_TCP > (gdb) print svc_mon->sb_svc_mon > > This crash occurred right after ovsdb SB connection loss due to inactivity > probe failure. So, ovn-controller was re-connecting to SB, and I guess, > this > could somehow re-initialize SB IDL objects. > > I’m not sure I can try to reproduce this behaviour on latest main branch, > so my > question, if this theoretically can be connected with re-initialization of > IDL? > If yes, what should be done to avoid such behavior? > Should ovn-controller process changes if its IDL is in inconsistent state? > > Any help is appreciated. > > Regards, > Vladislav Odintsov > > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > Hi Vladislav, I didn't look closely what could cause the issue, however I have an idea how to reproduce it on main. You could try a test that uses `sleep_sb` [0], combining it with `ovn-remote-probe-interval`. That should eventually lead to the same state. Regards, Ales -- Ales Musil Senior Software Engineer - OVN Core Red Hat EMEA <https://www.redhat.com> amu...@redhat.com <https://red.ht/sig> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev