On Thu, Apr 11, 2024 at 12:20 PM Vladislav Odintsov <odiv...@gmail.com>
wrote:

> Hi all,
>
> I’m running ovn 22.09 and sometimes see that ovn-controllers crash with
> segmentation fault.  The backtrace is next:
>
> (gdb) bt
> #0  0x00007f0742707de1 in __strlen_sse2 () from /lib64/libc.so.6
> #1  0x00007f0742788c5d in inet_pton () from /lib64/libc.so.6
> #2  0x0000564f45a1c784 in ip_parse (s=<optimized out>, 
> ip=ip@entry=0x7f074040f90c)
> at lib/packets.c:698
> #3  0x0000564f4594cbfb in svc_monitor_send_tcp_health_check__
> (swconn=swconn@entry=0x7f0738000940,
>     svc_mon=svc_mon@entry=0x564f4c2960c0, ctl_flags=ctl_flags@entry=2,
> tcp_seq=3858078915, tcp_ack=tcp_ack@entry=0,
>     tcp_src=<optimized out>) at controller/pinctrl.c:7513
> #4  0x0000564f4594d47c in svc_monitor_send_tcp_health_check__
> (tcp_src=<optimized out>, tcp_ack=0, tcp_seq=<optimized out>,
>     ctl_flags=2, svc_mon=0x564f4c2960c0, swconn=0x7f0738000940) at
> controller/pinctrl.c:7502
> #5  svc_monitor_send_health_check (swconn=swconn@entry=0x7f0738000940,
> svc_mon=svc_mon@entry=0x564f4c2960c0)
>     at controller/pinctrl.c:7621
> #6  0x0000564f4595869b in svc_monitors_run
> (svc_monitors_next_run_time=0x564f45dd3970
> <svc_monitors_next_run_time.37793>,
>     swconn=0x7f0738000940) at controller/pinctrl.c:7693
> #7  pinctrl_handler (arg_=0x564f45e11240 <pinctrl>) at
> controller/pinctrl.c:3499
> #8  0x0000564f45a0ad6f in ovsthread_wrapper (aux_=<optimized out>) at
> lib/ovs-thread.c:422
> #9  0x00007f074325bea5 in start_thread () from /lib64/libpthread.so.0
> #10 0x00007f07427798dd in clone () from /lib64/libc.so.6
>
> After moving to frame #3, I can get actual data from svc_mon structure
> (port/protocol/dp_key/port_key) - I’ve looked them up in SB DB and found
> port_binding, which belongs to a logical port, which resides on this
> chassis.
> It has configured LB with HC. Here everything seems good.  But if to check
> svc_mon->sb_svc_mon structure, it seems to me that it contains garbage -
> Address 0x564f00000000 out of bounds; logical_port == 0, etc (but I can be
> wrong):
>
> $1 = (const struct sbrec_service_monitor *) 0x564f54db2b40
> (gdb) print *svc_mon->sb_svc_mon
> $2 = {header_ = {hmap_node = {hash = 94898726054728, next = 0x0}, uuid =
> {parts = {0, 0, 0, 0}}, src_arcs = {prev = 0x564f54aae0d0, next = 0x0},
> dst_arcs = {prev = 0x564f7f8bd470, next = 0x564f7f8bd540}, table = 0x64,
> old_datum = 0xf,
>     parsed = 152, reparse_node = {prev = 0x0, next = 0x0}, new_datum =
> 0x0, prereqs = 0x52eb8916, written = 0x171, txn_node = {hash = 1, next =
> 0x564f54db2db0}, map_op_written = 0x0, map_op_lists = 0x0, set_op_written =
> 0x0,
>     set_op_lists = 0x0, change_seqno = {0, 0, 0}, track_node = {prev =
> 0x564f00000000, next = 0x0}, updated = 0x0, tracked_old_datum = 0x0},
> external_ids = {map = {buckets = 0x1, one = 0x564f54db2d90, mask = 0, n =
> 0}},
>   ip = 0x564f00000000 <Address 0x564f00000000 out of bounds>, logical_port
> = 0x0, options = {map = {buckets = 0x0, one = 0x0, mask = 1, n =
> 94898780242768}}, port = 0, protocol = 0x0, src_ip = 0x1 <Address 0x1 out
> of bounds>,
>   src_mac = 0x564f54db2d70 "`Ջ\177OV", status = 0x0}
> …
> (gdb) print svc_mon->state
> $8 = SVC_MON_S_ONLINE
> (gdb) print svc_mon->status
> $9 = SVC_MON_ST_ONLINE
> (gdb) print svc_mon->protocol
> $10 = SVC_MON_PROTO_TCP
> (gdb) print svc_mon->sb_svc_mon
>
> This crash occurred right after ovsdb SB connection loss due to inactivity
> probe failure.  So, ovn-controller was re-connecting to SB, and I guess,
> this
> could somehow re-initialize SB IDL objects.
>
> I’m not sure I can try to reproduce this behaviour on latest main branch,
> so my
> question, if this theoretically can be connected with re-initialization of
> IDL?
> If yes, what should be done to avoid such behavior?
> Should ovn-controller process changes if its IDL is in inconsistent state?
>
> Any help is appreciated.
>
> Regards,
> Vladislav Odintsov
>
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>

Hi Vladislav,

I didn't look closely what could cause the issue, however I have an idea
how to reproduce it on main. You could try a test that uses `sleep_sb` [0],
combining it with `ovn-remote-probe-interval`. That should eventually lead
to the same state.

Regards,
Ales
-- 

Ales Musil

Senior Software Engineer - OVN Core

Red Hat EMEA <https://www.redhat.com>

amu...@redhat.com
<https://red.ht/sig>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to