[Yahoo-eng-team] [Bug 1855919] Re: broken pipe errors cause neutron metadata agent to fail

2023-01-12 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/869770
Committed: 
https://opendev.org/openstack/neutron/commit/ed048638f4ef6a1799a1dcbd8819627b3fa8e75e
Submitter: "Zuul (22348)"
Branch:master

commit ed048638f4ef6a1799a1dcbd8819627b3fa8e75e
Author: Brian Haley 
Date:   Tue Jan 10 16:40:29 2023 -0500

Add text to WSGI config guide for debugging

Add "broken pipe" example error message to WSGI guide in
rpc_workers section, could help diagnose agent issues.

Trivialfix

Change-Id: I6e95614bce42124f125be4c23fff0d923ea9ebcc
Closes-bug: #1855919


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1855919

Title:
  broken pipe errors cause neutron metadata agent to fail

Status in neutron:
  Fix Released

Bug description:
  After we increased computes to 200, we started seeing "broken pipe"
  errors in neutron-metadata-agent.log on the controllers. After a
  neutron restart the errors are reduced, then they increase until the
  log is mostly errors, and the neutron metadata service fails, and VMs
  cannot boot. Another symptom is that unacked RMQ messages build up in
  the q-plugin queue. This is the first error we see; this one occurs as
  the server is starting:

  
  2019-12-10 10:56:01.942 1838536 INFO eventlet.wsgi.server [-] (1838536) wsgi 
starting up on http:/var/lib/neutron/metadata_proxy
  2019-12-10 10:56:01.943 1838538 INFO eventlet.wsgi.server [-] (1838538) wsgi 
starting up on http:/var/lib/neutron/metadata_proxy
  2019-12-10 10:56:01.945 1838539 INFO eventlet.wsgi.server [-] (1838539) wsgi 
starting up on http:/var/lib/neutron/metadata_proxy
  2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] Traceback (most 
recent call last):
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 521, in 
handle_one_response
  write(b''.join(towrite))
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 462, in write
  wfile.flush()
File "/usr/lib/python2.7/socket.py", line 307, in flush
  self._sock.sendall(view[write_offset:write_offset+buffer_size])
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 390, 
in sendall
  tail = self.send(data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 384, 
in send
  return self._send_loop(self.fd.send, data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 371, 
in _send_loop
  return send_method(data, *args)
  error: [Errno 32] Broken pipe

  2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] 
10.195.74.25, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 0 time: 19.0296111
  2019-12-10 10:56:25.059 1838516 INFO eventlet.wsgi.server [-] 
10.195.74.28, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.2840948
  2019-12-10 10:56:25.181 1838529 INFO eventlet.wsgi.server [-] 
10.195.74.68, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.2695429
  2019-12-10 10:56:25.259 1838518 INFO eventlet.wsgi.server [-] 
10.195.74.28, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.1980510

  Then we see some "call queues" warnings and the threshold increases to
  40:

  2019-12-10 10:56:31.414 1838515 WARNING
  oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11,
  greater than warning threshold: 10. There could be a leak. Increasing
  threshold to: 20

  Next we see RPC timeout errors:

  2019-12-10 10:57:02.043 1838520 WARNING oslo_messaging._drivers.amqpdriver 
[-] Number of call queues is 11, greater than warning threshold: 10. There 
could be a leak. Increasing threshold to: 20
  2019-12-10 10:57:02.059 1838534 ERROR neutron.common.rpc [-] Timeout in RPC 
method get_ports. Waiting for 37 seconds before next attempt. If the server is 
not down, consider increasing the rpc_response_timeout option as Neutron 
server(s) may be overloaded and unable to respond quickly enough.: 
MessagingTimeout: Timed out waiting for a reply to message ID 
1ed3e021607e466f8b9b84cd3b05b188
  2019-12-10 10:57:02.059 1838534 WARNING neutron.common.rpc [-] Increasing 
timeout for get_ports calls to 120 seconds. Restart the agent to restore it to 
the default value.: MessagingTimeout: Timed out waiting for a reply to message 
ID 1ed3e021607e466f8b9b84cd3b05b188
  2019-12-10 10:57:02.285 1838521 INFO eventlet.wsgi.server [-] 
10.195.74.27, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.7959940

  2019-12-10 10:57:16.215 1838531 WARNING
  oslo_messaging._drivers.amqpdriver [-] Number of call queues is 21,
  greater than warning threshold: 20. There could be a leak. Increasing
  threshold to: 40

  2019-12-10 10:57:17.339 1838539 WARNING
  oslo_messaging._drivers.amqpdriver [-] Number of

[Yahoo-eng-team] [Bug 1855919] Re: broken pipe errors cause neutron metadata agent to fail

2019-12-15 Thread YAMAMOTO Takashi
lower priority as there's a workaround. of course, documentation
improvements are welcome.

** Changed in: neutron
   Importance: High => Low

** Changed in: neutron
   Status: New => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1855919

Title:
  broken pipe errors cause neutron metadata agent to fail

Status in neutron:
  Opinion

Bug description:
  After we increased computes to 200, we started seeing "broken pipe"
  errors in neutron-metadata-agent.log on the controllers. After a
  neutron restart the errors are reduced, then they increase until the
  log is mostly errors, and the neutron metadata service fails, and VMs
  cannot boot. Another symptom is that unacked RMQ messages build up in
  the q-plugin queue. This is the first error we see; this one occurs as
  the server is starting:

  
  2019-12-10 10:56:01.942 1838536 INFO eventlet.wsgi.server [-] (1838536) wsgi 
starting up on http:/var/lib/neutron/metadata_proxy
  2019-12-10 10:56:01.943 1838538 INFO eventlet.wsgi.server [-] (1838538) wsgi 
starting up on http:/var/lib/neutron/metadata_proxy
  2019-12-10 10:56:01.945 1838539 INFO eventlet.wsgi.server [-] (1838539) wsgi 
starting up on http:/var/lib/neutron/metadata_proxy
  2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] Traceback (most 
recent call last):
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 521, in 
handle_one_response
  write(b''.join(towrite))
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 462, in write
  wfile.flush()
File "/usr/lib/python2.7/socket.py", line 307, in flush
  self._sock.sendall(view[write_offset:write_offset+buffer_size])
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 390, 
in sendall
  tail = self.send(data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 384, 
in send
  return self._send_loop(self.fd.send, data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 371, 
in _send_loop
  return send_method(data, *args)
  error: [Errno 32] Broken pipe

  2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] 
10.195.74.25, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 0 time: 19.0296111
  2019-12-10 10:56:25.059 1838516 INFO eventlet.wsgi.server [-] 
10.195.74.28, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.2840948
  2019-12-10 10:56:25.181 1838529 INFO eventlet.wsgi.server [-] 
10.195.74.68, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.2695429
  2019-12-10 10:56:25.259 1838518 INFO eventlet.wsgi.server [-] 
10.195.74.28, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.1980510

  Then we see some "call queues" warnings and the threshold increases to
  40:

  2019-12-10 10:56:31.414 1838515 WARNING
  oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11,
  greater than warning threshold: 10. There could be a leak. Increasing
  threshold to: 20

  Next we see RPC timeout errors:

  2019-12-10 10:57:02.043 1838520 WARNING oslo_messaging._drivers.amqpdriver 
[-] Number of call queues is 11, greater than warning threshold: 10. There 
could be a leak. Increasing threshold to: 20
  2019-12-10 10:57:02.059 1838534 ERROR neutron.common.rpc [-] Timeout in RPC 
method get_ports. Waiting for 37 seconds before next attempt. If the server is 
not down, consider increasing the rpc_response_timeout option as Neutron 
server(s) may be overloaded and unable to respond quickly enough.: 
MessagingTimeout: Timed out waiting for a reply to message ID 
1ed3e021607e466f8b9b84cd3b05b188
  2019-12-10 10:57:02.059 1838534 WARNING neutron.common.rpc [-] Increasing 
timeout for get_ports calls to 120 seconds. Restart the agent to restore it to 
the default value.: MessagingTimeout: Timed out waiting for a reply to message 
ID 1ed3e021607e466f8b9b84cd3b05b188
  2019-12-10 10:57:02.285 1838521 INFO eventlet.wsgi.server [-] 
10.195.74.27, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.7959940

  2019-12-10 10:57:16.215 1838531 WARNING
  oslo_messaging._drivers.amqpdriver [-] Number of call queues is 21,
  greater than warning threshold: 20. There could be a leak. Increasing
  threshold to: 40

  2019-12-10 10:57:17.339 1838539 WARNING
  oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11,
  greater than warning threshold: 10. There could be a leak. Increasing
  threshold to: 20

  2019-12-10 10:57:24.838 1838524 INFO eventlet.wsgi.server [-] 
10.195.73.242, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.6842020
  2019-12-10 10:57:24.882 1838524 ERROR neutron.common.rpc [-] Timeout in RPC 
method get_ports. Waiting for 3 seconds before next attempt. If the server is 
not down, consider increasing the rpc_resp

[Yahoo-eng-team] [Bug 1855919] Re: broken pipe errors cause neutron metadata agent to fail

2019-12-17 Thread OpenStack Infra
Fix proposed to branch: master
Review: https://review.opendev.org/699372

** Changed in: neutron
   Status: Opinion => In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1855919

Title:
  broken pipe errors cause neutron metadata agent to fail

Status in neutron:
  In Progress

Bug description:
  After we increased computes to 200, we started seeing "broken pipe"
  errors in neutron-metadata-agent.log on the controllers. After a
  neutron restart the errors are reduced, then they increase until the
  log is mostly errors, and the neutron metadata service fails, and VMs
  cannot boot. Another symptom is that unacked RMQ messages build up in
  the q-plugin queue. This is the first error we see; this one occurs as
  the server is starting:

  
  2019-12-10 10:56:01.942 1838536 INFO eventlet.wsgi.server [-] (1838536) wsgi 
starting up on http:/var/lib/neutron/metadata_proxy
  2019-12-10 10:56:01.943 1838538 INFO eventlet.wsgi.server [-] (1838538) wsgi 
starting up on http:/var/lib/neutron/metadata_proxy
  2019-12-10 10:56:01.945 1838539 INFO eventlet.wsgi.server [-] (1838539) wsgi 
starting up on http:/var/lib/neutron/metadata_proxy
  2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] Traceback (most 
recent call last):
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 521, in 
handle_one_response
  write(b''.join(towrite))
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 462, in write
  wfile.flush()
File "/usr/lib/python2.7/socket.py", line 307, in flush
  self._sock.sendall(view[write_offset:write_offset+buffer_size])
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 390, 
in sendall
  tail = self.send(data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 384, 
in send
  return self._send_loop(self.fd.send, data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 371, 
in _send_loop
  return send_method(data, *args)
  error: [Errno 32] Broken pipe

  2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] 
10.195.74.25, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 0 time: 19.0296111
  2019-12-10 10:56:25.059 1838516 INFO eventlet.wsgi.server [-] 
10.195.74.28, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.2840948
  2019-12-10 10:56:25.181 1838529 INFO eventlet.wsgi.server [-] 
10.195.74.68, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.2695429
  2019-12-10 10:56:25.259 1838518 INFO eventlet.wsgi.server [-] 
10.195.74.28, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.1980510

  Then we see some "call queues" warnings and the threshold increases to
  40:

  2019-12-10 10:56:31.414 1838515 WARNING
  oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11,
  greater than warning threshold: 10. There could be a leak. Increasing
  threshold to: 20

  Next we see RPC timeout errors:

  2019-12-10 10:57:02.043 1838520 WARNING oslo_messaging._drivers.amqpdriver 
[-] Number of call queues is 11, greater than warning threshold: 10. There 
could be a leak. Increasing threshold to: 20
  2019-12-10 10:57:02.059 1838534 ERROR neutron.common.rpc [-] Timeout in RPC 
method get_ports. Waiting for 37 seconds before next attempt. If the server is 
not down, consider increasing the rpc_response_timeout option as Neutron 
server(s) may be overloaded and unable to respond quickly enough.: 
MessagingTimeout: Timed out waiting for a reply to message ID 
1ed3e021607e466f8b9b84cd3b05b188
  2019-12-10 10:57:02.059 1838534 WARNING neutron.common.rpc [-] Increasing 
timeout for get_ports calls to 120 seconds. Restart the agent to restore it to 
the default value.: MessagingTimeout: Timed out waiting for a reply to message 
ID 1ed3e021607e466f8b9b84cd3b05b188
  2019-12-10 10:57:02.285 1838521 INFO eventlet.wsgi.server [-] 
10.195.74.27, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.7959940

  2019-12-10 10:57:16.215 1838531 WARNING
  oslo_messaging._drivers.amqpdriver [-] Number of call queues is 21,
  greater than warning threshold: 20. There could be a leak. Increasing
  threshold to: 40

  2019-12-10 10:57:17.339 1838539 WARNING
  oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11,
  greater than warning threshold: 10. There could be a leak. Increasing
  threshold to: 20

  2019-12-10 10:57:24.838 1838524 INFO eventlet.wsgi.server [-] 
10.195.73.242, "GET /latest/meta-data/instance-id HTTP/1.0" status: 200  
len: 146 time: 0.6842020
  2019-12-10 10:57:24.882 1838524 ERROR neutron.common.rpc [-] Timeout in RPC 
method get_ports. Waiting for 3 seconds before next attempt. If the server is 
not down, consider increasing the rpc_response_timeout option as Neutron 
server(s) may be overload