[Yahoo-eng-team] [Bug 1992161] [NEW] Unknown quota resource security_group_rule in neutron-rpc-server

2022-10-07 Thread Johannes Kulik
Public bug reported:

When restarting our linuxbridge-agents, we see exceptions for some of the
networks: Unknown quota resources ['security_group_rule']. This stops the
linuxbridge-agent from fully bringing up that network.

Prerequisites:
* run api-server and rpc-server in different process
  We have neutron-server running with uWSGI and start the neutron-rpc-server in 
another container.

Steps to reproduce:
* have a project with server/network/ports
* have an unused default security group
* delete the default security group
* restart the appropriate linuxbridge-agent

Version:
* Ussuri with custom patches on top: https://github.com/sapcc/neutron

Expected behavior:
linuxbridge-agent should bring up all networks even if the user deleted the
default security group.

Either don't create a default security-group when called via the
linuxbridge-agent instead of the API or make the quota available in the
rpc-server so the default security-group can be created.

Creating/updating a port or creating a network via API will create the default
security group and fix the problem on the linuxbridge-agent, too. I just don't
think that's acceptable to have the user/admin do some API actions in case the
user did something they maybe shouldn't have.

We've also seen the same exception from a dhcp-agent. Attached both a traceback
from linuxbridge as well as from dhcp-agent.

Trying to debug this, we found that no quota resources are registered in 
neutron-rpc-server. This can be seen when using the eventlet backdoor by these 
commands:
  from neutron.quota import resource_registry;
  resource_registry.get_all_resources()

** Affects: neutron
 Importance: Undecided
 Status: New

** Attachment added: "tracebacks from dhcp-agent and linuxbridge agent calling 
neutron-rpc-server"
   
https://bugs.launchpad.net/bugs/1992161/+attachment/5622035/+files/rpc-no-default-security-group-creation.txt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1992161

Title:
  Unknown quota resource security_group_rule in neutron-rpc-server

Status in neutron:
  New

Bug description:
  When restarting our linuxbridge-agents, we see exceptions for some of the
  networks: Unknown quota resources ['security_group_rule']. This stops the
  linuxbridge-agent from fully bringing up that network.

  Prerequisites:
  * run api-server and rpc-server in different process
We have neutron-server running with uWSGI and start the neutron-rpc-server 
in another container.

  Steps to reproduce:
  * have a project with server/network/ports
  * have an unused default security group
  * delete the default security group
  * restart the appropriate linuxbridge-agent

  Version:
  * Ussuri with custom patches on top: https://github.com/sapcc/neutron

  Expected behavior:
  linuxbridge-agent should bring up all networks even if the user deleted the
  default security group.

  Either don't create a default security-group when called via the
  linuxbridge-agent instead of the API or make the quota available in the
  rpc-server so the default security-group can be created.

  Creating/updating a port or creating a network via API will create the default
  security group and fix the problem on the linuxbridge-agent, too. I just don't
  think that's acceptable to have the user/admin do some API actions in case the
  user did something they maybe shouldn't have.

  We've also seen the same exception from a dhcp-agent. Attached both a 
traceback
  from linuxbridge as well as from dhcp-agent.

  Trying to debug this, we found that no quota resources are registered in 
neutron-rpc-server. This can be seen when using the eventlet backdoor by these 
commands:
from neutron.quota import resource_registry;
resource_registry.get_all_resources()

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1992161/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1989361] Re: extension using collection_actions and collection_methods with path_prefix doesn't get proper URLs

2022-09-13 Thread Johannes Kulik
Looking a little more into it, the tests [0] actually always have a "/"
prefix in their "path_prefix", which works fine, because the "routes"
library calls "stripslashes()" in the "resource()" call and thus we
shouldn't end up with double-slashes.

[0]
https://github.com/sapcc/neutron/blob/64bef10cd97d1f56647a4d20a7ce0644c18b8ece/neutron/tests/unit/api/test_extensions.py#L237

** Changed in: neutron
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1989361

Title:
  extension using collection_actions and collection_methods with
  path_prefix doesn't get proper URLs

Status in neutron:
  Invalid

Bug description:
  We're creating a new extension downstream to add some special-sauce
  API endpoints. During that, we tried to use "collection_actions" to
  create some special actions for our resource. Those ended up being
  uncallable always returning a 404 as the call was interpreted as a
  standard "update" call instead of calling our special function.

  We debugged this down and it turns out the Route object created when
  registering the API endpoint in [0] ff doesn't contain a "/" at the
  start of its regexp. Therefore, it doesn't match.

  This seems to come from the fact that we - other than e.g. the
  quotasv2 extension [1] - have to set a "path_prefix".

  Looking at the underlying "routes" library, we automatically get a "/"
  prefixed for the "resource()" call [2], while the "Submap"'s
  "submapper()" call needs to already contain the prefixed "/" as
  exemplified in [3].

  Therefore, I propose to prepend a "/" to the "path_prefix" for the
  code handling "collection_actions" and "collection_methods" and will
  open a review-request for this.

  [0] 
https://github.com/sapcc/neutron/blob/64bef10cd97d1f56647a4d20a7ce0644c18b8ece/neutron/api/extensions.py#L159
  [1] 
https://github.com/sapcc/neutron/blob/64bef10cd97d1f56647a4d20a7ce0644c18b8ece/neutron/extensions/quotasv2.py#L210-L215
  [2] https://github.com/bbangert/routes/blob/main/routes/mapper.py#L1126-L1132
  [3] https://github.com/bbangert/routes/blob/main/routes/mapper.py#L78

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1989361/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1989361] [NEW] extension using collection_actions and collection_methods with path_prefix doesn't get proper URLs

2022-09-12 Thread Johannes Kulik
Public bug reported:

We're creating a new extension downstream to add some special-sauce API
endpoints. During that, we tried to use "collection_actions" to create
some special actions for our resource. Those ended up being uncallable
always returning a 404 as the call was interpreted as a standard
"update" call instead of calling our special function.

We debugged this down and it turns out the Route object created when
registering the API endpoint in [0] ff doesn't contain a "/" at the
start of its regexp. Therefore, it doesn't match.

This seems to come from the fact that we - other than e.g. the quotasv2
extension [1] - have to set a "path_prefix".

Looking at the underlying "routes" library, we automatically get a "/"
prefixed for the "resource()" call [2], while the "Submap"'s
"submapper()" call needs to already contain the prefixed "/" as
exemplified in [3].

Therefore, I propose to prepend a "/" to the "path_prefix" for the code
handling "collection_actions" and "collection_methods" and will open a
review-request for this.

[0] 
https://github.com/sapcc/neutron/blob/64bef10cd97d1f56647a4d20a7ce0644c18b8ece/neutron/api/extensions.py#L159
[1] 
https://github.com/sapcc/neutron/blob/64bef10cd97d1f56647a4d20a7ce0644c18b8ece/neutron/extensions/quotasv2.py#L210-L215
[2] https://github.com/bbangert/routes/blob/main/routes/mapper.py#L1126-L1132
[3] https://github.com/bbangert/routes/blob/main/routes/mapper.py#L78

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1989361

Title:
  extension using collection_actions and collection_methods with
  path_prefix doesn't get proper URLs

Status in neutron:
  New

Bug description:
  We're creating a new extension downstream to add some special-sauce
  API endpoints. During that, we tried to use "collection_actions" to
  create some special actions for our resource. Those ended up being
  uncallable always returning a 404 as the call was interpreted as a
  standard "update" call instead of calling our special function.

  We debugged this down and it turns out the Route object created when
  registering the API endpoint in [0] ff doesn't contain a "/" at the
  start of its regexp. Therefore, it doesn't match.

  This seems to come from the fact that we - other than e.g. the
  quotasv2 extension [1] - have to set a "path_prefix".

  Looking at the underlying "routes" library, we automatically get a "/"
  prefixed for the "resource()" call [2], while the "Submap"'s
  "submapper()" call needs to already contain the prefixed "/" as
  exemplified in [3].

  Therefore, I propose to prepend a "/" to the "path_prefix" for the
  code handling "collection_actions" and "collection_methods" and will
  open a review-request for this.

  [0] 
https://github.com/sapcc/neutron/blob/64bef10cd97d1f56647a4d20a7ce0644c18b8ece/neutron/api/extensions.py#L159
  [1] 
https://github.com/sapcc/neutron/blob/64bef10cd97d1f56647a4d20a7ce0644c18b8ece/neutron/extensions/quotasv2.py#L210-L215
  [2] https://github.com/bbangert/routes/blob/main/routes/mapper.py#L1126-L1132
  [3] https://github.com/bbangert/routes/blob/main/routes/mapper.py#L78

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1989361/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1949767] [NEW] FIP ports count into quota as they get a project_id set

2021-11-04 Thread Johannes Kulik
Public bug reported:

With
https://github.com/openstack/neutron/commit/d0c172afa6ea38e94563afb4994471420b27cddf
Neutron started adding a "project_id" to a FIPs external port, even
though
https://github.com/openstack/neutron/blob/f97baa0b16687453735e46e7a0f73fe03d7d4db7/neutron/db/l3_db.py#L326
states, that this is "intentionally not set".

This makes the ports viewable by the customer in "openstack port list"
and lets the ports count into quota, which was not the case pre-train.

Is this change intentional?

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1949767

Title:
  FIP ports count into quota as they get a project_id set

Status in neutron:
  New

Bug description:
  With
  
https://github.com/openstack/neutron/commit/d0c172afa6ea38e94563afb4994471420b27cddf
  Neutron started adding a "project_id" to a FIPs external port, even
  though
  
https://github.com/openstack/neutron/blob/f97baa0b16687453735e46e7a0f73fe03d7d4db7/neutron/db/l3_db.py#L326
  states, that this is "intentionally not set".

  This makes the ports viewable by the customer in "openstack port list"
  and lets the ports count into quota, which was not the case pre-train.

  Is this change intentional?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1949767/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1930406] [NEW] parallel volume-attachment requests might starve out nova-api for others

2021-06-01 Thread Johannes Kulik
Public bug reported:

When doing volume-attachemnts, nova-api does an RPC-call (with a
long_rpc_timeout) into nova-compute to reserve_block_device_name(). This
takes a lock on the instance. If a volume-attachment was in progress,
which also takes the instance-lock, nova-api's RPC-call needs to wait.

Having RPC-calls in nova-api, that can take a long time, will block the
process handling the request. If a project does a lot of volume-
attachments (e.g. for a k8s workload > 10 attachments per instance),
this could starve out other users of nova-api by occupying all available
processes.

When running nova-api with eventlet, a small number of processes can
handle a lot of requests in parallel and some blocking rpc-calls don't
matter too much.

When switching to uWSGI, the number of processes would have to be
increased drastically to accommodate for that - unless it's possible to
map those requests to threads and use a high number of threads instead.

What's the recommended way to run nova-api on uWSGI to handle this? Low
number of processes with high number of threads to mimic eventlet?

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1930406

Title:
  parallel volume-attachment requests might starve out nova-api for
  others

Status in OpenStack Compute (nova):
  New

Bug description:
  When doing volume-attachemnts, nova-api does an RPC-call (with a
  long_rpc_timeout) into nova-compute to reserve_block_device_name().
  This takes a lock on the instance. If a volume-attachment was in
  progress, which also takes the instance-lock, nova-api's RPC-call
  needs to wait.

  Having RPC-calls in nova-api, that can take a long time, will block
  the process handling the request. If a project does a lot of volume-
  attachments (e.g. for a k8s workload > 10 attachments per instance),
  this could starve out other users of nova-api by occupying all
  available processes.

  When running nova-api with eventlet, a small number of processes can
  handle a lot of requests in parallel and some blocking rpc-calls don't
  matter too much.

  When switching to uWSGI, the number of processes would have to be
  increased drastically to accommodate for that - unless it's possible
  to map those requests to threads and use a high number of threads
  instead.

  What's the recommended way to run nova-api on uWSGI to handle this?
  Low number of processes with high number of threads to mimic eventlet?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1930406/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1915815] [NEW] vmware: Rescue impossible if VM folder renamed

2021-02-16 Thread Johannes Kulik
Public bug reported:

Steps to reproduce
==

* storage-vMotion a VM (this renames the folder to the VM name, i.e. "$uuid" to 
"$name ($uuid)")
* openstack server rescue $uuid

Actual Result
=

Nova's vmware driver raises an exception:

Traceback (most recent call last):
  File 
"/nova-base-source/nova-base-archive-stable-queens-m3/nova/compute/manager.py", 
line 3621, in rescue_instance
rescue_image_meta, admin_password)
  File 
"/nova-base-source/nova-base-archive-stable-queens-m3/nova/virt/vmwareapi/driver.py",
 line 601, in rescue
self._vmops.rescue(context, instance, network_info, image_meta)
  File 
"/nova-base-source/nova-base-archive-stable-queens-m3/nova/virt/vmwareapi/vmops.py",
 line 1802, in rescue
vi.cache_image_path, rescue_disk_path)
  File 
"/nova-base-source/nova-base-archive-stable-queens-m3/nova/virt/vmwareapi/ds_util.py",
 line 311, in disk_copy
session._wait_for_task(copy_disk_task)
  File 
"/nova-base-source/nova-base-archive-stable-queens-m3/nova/virt/vmwareapi/driver.py",
 line 725, in _wait_for_task
return self.wait_for_task(task_ref)
  File 
"/plugins/openstack-base-plugin-oslo-vmware-archive-stable-queens-m3/oslo_vmware/api.py",
 line 402, in wait_for_task
   return evt.wait()
  File 
"/var/lib/kolla/venv/local/lib/python2.7/site-packages/eventlet/event.py", line 
121, in wait
return hubs.get_hub().switch()
  File 
"/var/lib/kolla/venv/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", 
line 294, in switch
return self.greenlet.switch()
  File 
"/plugins/openstack-base-plugin-oslo-vmware-archive-stable-queens-m3/oslo_vmware/common/loopingcall.py",
 line 75, in _inner
self.f(*self.args, **self.kw)
  File 
"/plugins/openstack-base-plugin-oslo-vmware-archive-stable-queens-m3/oslo_vmware/api.py",
 line 449, in _poll_task
raise exceptions.translate_fault(task_info.error)
FileNotFoundException: File [eph-bb145-3] 
551e5570-cf70-4ca0-9f37-e50210c4d2f5/ was not found


Expected Result
===

VM is put into rescue mode and boots the rescue image.

This is

Environment
===

This happend on queens, but the same code is still there in master:
https://github.com/openstack/nova/blob/a7dd1f8881484ba0bf4270dd48109c2be142c333/nova/virt/vmwareapi/vmops.py#L1228-L1229

** Affects: nova
 Importance: Undecided
 Assignee: Johannes Kulik (jkulik)
 Status: New

** Changed in: nova
 Assignee: (unassigned) => Johannes Kulik (jkulik)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1915815

Title:
  vmware: Rescue impossible if VM folder renamed

Status in OpenStack Compute (nova):
  New

Bug description:
  Steps to reproduce
  ==

  * storage-vMotion a VM (this renames the folder to the VM name, i.e. "$uuid" 
to "$name ($uuid)")
  * openstack server rescue $uuid

  Actual Result
  =

  Nova's vmware driver raises an exception:

  Traceback (most recent call last):
File 
"/nova-base-source/nova-base-archive-stable-queens-m3/nova/compute/manager.py", 
line 3621, in rescue_instance
  rescue_image_meta, admin_password)
File 
"/nova-base-source/nova-base-archive-stable-queens-m3/nova/virt/vmwareapi/driver.py",
 line 601, in rescue
  self._vmops.rescue(context, instance, network_info, image_meta)
File 
"/nova-base-source/nova-base-archive-stable-queens-m3/nova/virt/vmwareapi/vmops.py",
 line 1802, in rescue
  vi.cache_image_path, rescue_disk_path)
File 
"/nova-base-source/nova-base-archive-stable-queens-m3/nova/virt/vmwareapi/ds_util.py",
 line 311, in disk_copy
  session._wait_for_task(copy_disk_task)
File 
"/nova-base-source/nova-base-archive-stable-queens-m3/nova/virt/vmwareapi/driver.py",
 line 725, in _wait_for_task
  return self.wait_for_task(task_ref)
File 
"/plugins/openstack-base-plugin-oslo-vmware-archive-stable-queens-m3/oslo_vmware/api.py",
 line 402, in wait_for_task
 return evt.wait()
File 
"/var/lib/kolla/venv/local/lib/python2.7/site-packages/eventlet/event.py", line 
121, in wait
  return hubs.get_hub().switch()
File 
"/var/lib/kolla/venv/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", 
line 294, in switch
  return self.greenlet.switch()
File 
"/plugins/openstack-base-plugin-oslo-vmware-archive-stable-queens-m3/oslo_vmware/common/loopingcall.py",
 line 75, in _inner
  self.f(*self.args, **self.kw)
File 
"/plugins/openstack-base-plugin-oslo-vmware-archive-

[Yahoo-eng-team] [Bug 1870096] [NEW] soft-affinity weight not normalized base on server group's maximum

2020-04-01 Thread Johannes Kulik
Public bug reported:

Description
===

When using soft-affinity to schedule instances on the same host, the
weight is unexpectedly low if a server was previously scheduled to any
server-group with more members on a host.

Steps to reproduce
==

Do not restart nova-scheduler in the process or the bug doesn't appear.

* Create a server-group with soft-affinity (let's call it A)
* Create 6 servers in server-group A, one after the other so they end up on the 
same host.
* Create another server-group with soft-affinity (B)
* Create 1 server in server-group B
* Create 1 server in server-group B and look at the scheduler's weights 
assigned to the hosts by the ServerGroupSoftAffinityWeigher.

Expected result
===

The weight assigned to the host by the ServerGroupSoftAffinityWeigher
should be 1, as the maximum number of instances for server-group B is on
that host (the one we created there before).

Actual result
=
The weight assigned to the host by the ServerGroupSoftAffinityWeigher is 0.2, 
as the maximum number of instances ever encountered on a host is 5.

Environment
===

We noticed this on a queens version of nova a year ago. Can't give the
exact commit anymore, but the code still looks broken in current master.

I've opened a review-request for fixing this bug here:
https://review.opendev.org/#/c/713863/

** Affects: nova
 Importance: Undecided
 Assignee: Johannes Kulik (jkulik)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1870096

Title:
  soft-affinity weight not normalized base on server group's maximum

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  Description
  ===

  When using soft-affinity to schedule instances on the same host, the
  weight is unexpectedly low if a server was previously scheduled to any
  server-group with more members on a host.

  Steps to reproduce
  ==

  Do not restart nova-scheduler in the process or the bug doesn't
  appear.

  * Create a server-group with soft-affinity (let's call it A)
  * Create 6 servers in server-group A, one after the other so they end up on 
the same host.
  * Create another server-group with soft-affinity (B)
  * Create 1 server in server-group B
  * Create 1 server in server-group B and look at the scheduler's weights 
assigned to the hosts by the ServerGroupSoftAffinityWeigher.

  Expected result
  ===

  The weight assigned to the host by the ServerGroupSoftAffinityWeigher
  should be 1, as the maximum number of instances for server-group B is
  on that host (the one we created there before).

  Actual result
  =
  The weight assigned to the host by the ServerGroupSoftAffinityWeigher is 0.2, 
as the maximum number of instances ever encountered on a host is 5.

  Environment
  ===

  We noticed this on a queens version of nova a year ago. Can't give the
  exact commit anymore, but the code still looks broken in current
  master.

  I've opened a review-request for fixing this bug here:
  https://review.opendev.org/#/c/713863/

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1870096/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp