** Also affects: neutron (Ubuntu)
   Importance: Undecided
       Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Also affects: python-oslo.privsep (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Also affects: neutron (Ubuntu Hirsute)
   Importance: Undecided
       Status: New

** Also affects: python-oslo.privsep (Ubuntu Hirsute)
   Importance: Undecided
       Status: New

** Also affects: neutron (Ubuntu Groovy)
   Importance: Undecided
       Status: New

** Also affects: python-oslo.privsep (Ubuntu Groovy)
   Importance: Undecided
       Status: New

** Changed in: neutron (Ubuntu Focal)
   Importance: Undecided => Medium

** Changed in: neutron (Ubuntu Focal)
       Status: New => Triaged

** Changed in: neutron (Ubuntu Groovy)
   Importance: Undecided => Medium

** Changed in: neutron (Ubuntu Groovy)
       Status: New => Triaged

** Changed in: neutron (Ubuntu Hirsute)
   Importance: Undecided => Medium

** Changed in: neutron (Ubuntu Hirsute)
       Status: New => Triaged

** Also affects: cloud-archive
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
       Status: New

** Changed in: cloud-archive/ussuri
   Importance: Undecided => Medium

** Changed in: cloud-archive/ussuri
       Status: New => Triaged

** Changed in: cloud-archive/victoria
   Importance: Undecided => Medium

** Changed in: cloud-archive/victoria
       Status: New => Triaged

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1896734

Title:
  A privsep daemon spawned by neutron-openvswitch-agent hangs when debug
  logging is enabled (large number of registered NICs) - an RPC response
  is too large for msgpack

Status in OpenStack neutron-openvswitch charm:
  Invalid
Status in Ubuntu Cloud Archive:
  Triaged
Status in Ubuntu Cloud Archive ussuri series:
  Triaged
Status in Ubuntu Cloud Archive victoria series:
  Triaged
Status in neutron:
  Fix Released
Status in oslo.privsep:
  New
Status in neutron package in Ubuntu:
  Triaged
Status in python-oslo.privsep package in Ubuntu:
  New
Status in neutron source package in Focal:
  Triaged
Status in python-oslo.privsep source package in Focal:
  New
Status in neutron source package in Groovy:
  Triaged
Status in python-oslo.privsep source package in Groovy:
  New
Status in neutron source package in Hirsute:
  Triaged
Status in python-oslo.privsep source package in Hirsute:
  New

Bug description:
  When there is a large amount of netdevs registered in the kernel and
  debug logging is enabled, neutron-openvswitch-agent and the privsep
  daemon spawned by it hang since the RPC call result sent by the
  privsep daemon over a unix socket exceeds the message sizes that the
  msgpack library can handle.

  The impact of this is that enabling debug logging on the cloud
  completely stalls neutron-openvswitch-agents and makes them "dead"
  from the Neutron server perspective.

  The issue is summarized in detail in comment #5
  https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5

  ========================================================================
  Old Description

  While trying to debug a different issue, I encountered a situation
  where privsep hangs in the process of handling a request from neutron-
  openvswitch-agent when debug logging is enabled (juju debug-log
  neutron-openvswitch=true):

  https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
  https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12

  The issue gets reproduced reliably in the environment where I
  encountered it on all units. As a result, neutron-openvswitch-agent
  services hang while waiting for a response from the privsep daemon and
  do not progress past basic initialization. They never post any state
  back to the Neutron server and thus are marked dead by it.

  The processes though are shown as "active (running)" by systemd which
  adds to the confusion since they do indeed start from the systemd's
  perspective.

  systemctl --no-pager status neutron-openvswitch-agent.service
  ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin 
Agent
     Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; 
enabled; vendor preset: enabled)
     Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
   Main PID: 247772 (/usr/bin/python)
      Tasks: 4 (limit: 9830)
     CGroup: /system.slice/neutron-openvswitch-agent.service
             ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent 
--config-file=/etc/neutron/neutron.conf 
--config-file=/etc/neutron/plugins/ml2/openvswitch_…og
             └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file 
/etc/neutron/neutron.conf --config-file 
/etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck

  --------------------------------------------------------

  An strace shows that the privsep daemon tries to receive input from fd
  3 which is the unix socket it uses to communicate with the client.
  However, this is just one tread out of many spawned by the privsep
  daemon so it is unlikely to be the root cause (there are 65 threads
  there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)

  # there is one extra neutron-openvvswitch-agent running in a LXD container 
which can be ignored here (there is an octavia unit on the node which has a 
neutron-openvswitch subordinate)
  root@node2:~# ps -eo pid,user,args --sort user | grep -P 
'privsep.*openvswitch'
   860690 100000   /usr/bin/python3 /usr/bin/privsep-helper --config-file 
/etc/neutron/neutron.conf --config-file 
/etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context 
neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
   248272 root     /usr/bin/python3 /usr/bin/privsep-helper --config-file 
/etc/neutron/neutron.conf --config-file 
/etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context 
neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
   363905 root     grep --color=auto -P privsep.*openvswitch

  root@node2:~# strace -f -p 248453 2>&1
  [pid 248786] futex(0x7f6a6401c1d0, 
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished 
...>
  [pid 248475] futex(0x7f6a6c024590, 
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished 
...>
  [pid 248473] futex(0x7f6a746d9fd0, 
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished 
...>
  [pid 248453] recvfrom(3,

  root@node2:~# lsof -p 248453  | grep 3u
  privsep-h 248453 root    3u  unix 0xffff8e6d8abdec00      0t0 356522977 
type=STREAM

  root@node2:~# ss -pax | grep 356522977
  u_str             ESTAB               0                    0                  
                                       /tmp/tmp2afa3enn/privsep.sock 356522978
                                    * 356522977                              
users:(("/usr/bin/python",pid=247567,fd=16))
  u_str             ESTAB               0                    0                  
                                                                   * 356522977
                                    * 356522978                              
users:(("privsep-helper",pid=248453,fd=3))

  root@node2:~# lsof -p 247567  | grep 16u
  /usr/bin/ 247567 neutron   16u     unix 0xffff8e6d8abdb400      0t0 356522978 
/tmp/tmp2afa3enn/privsep.sock type=STREAM

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1896734/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to