** Also affects: neutron Importance: Undecided Status: New ** Changed in: neutron Importance: Undecided => Medium
** Changed in: neutron Assignee: (unassigned) => Rodolfo Alonso (rodolfo-alonso-hernandez) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1896734 Title: A privsep daemon spawned by neutron-openvswitch-agent hangs when debug logging is enabled (large number of registered NICs) - an RPC response is too large for msgpack Status in OpenStack neutron-openvswitch charm: Invalid Status in neutron: New Status in oslo.privsep: New Status in python-oslo.privsep package in Ubuntu: New Bug description: When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle. The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective. The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5 ======================================================================== Old Description While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron- openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago Main PID: 247772 (/usr/bin/python) Tasks: 4 (limit: 9830) CGroup: /system.slice/neutron-openvswitch-agent.service ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/) # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch' 860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock 248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock 363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978 * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977 * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM To manage notifications about this bug go to: https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1896734/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp