[Yahoo-eng-team] [Bug 2020028] [NEW] evacuate an instance on non-shared storage succeeded and boot image is rebuilt

2023-05-17 Thread norman shen
Public bug reported:

Description
===

evacuate an instance on non-shared storage succeeded and boot image is
rebuilt

Steps to reproduce
==

1. Create a two compute nodes cluster without shared storage
2. boot a image backed virtual machine
3. shutdown down the compute node where vm is running
4. evacuate instance to another node

Expected:

evacuate failed

Real:

evacuate succeeded and boot image is rebuilt.

Version
===

Using nova victoria version

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2020028

Title:
  evacuate an instance on non-shared storage succeeded and boot image is
  rebuilt

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  evacuate an instance on non-shared storage succeeded and boot image is
  rebuilt

  Steps to reproduce
  ==

  1. Create a two compute nodes cluster without shared storage
  2. boot a image backed virtual machine
  3. shutdown down the compute node where vm is running
  4. evacuate instance to another node

  Expected:

  evacuate failed

  Real:

  evacuate succeeded and boot image is rebuilt.

  Version
  ===

  Using nova victoria version

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2020028/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1999126] [NEW] resize to the same host unexpectedly clears host_info cache

2022-12-07 Thread norman shen
Public bug reported:

Description
===
We are using victoria nova and find that after a same-host-cold-migrate, 
subsequently cold migrate could break anti-affinity policy.

Steps to reproduce
==

1 provision a openstack cluster with 2 compute nodes
2 create a server group with anti-affinity rule
3 create two vms binding tot he server group
4 disable compute node B
5 cold migrate server on node A and confirm it
6 right after confirm succeeded, migrate server on node B
7 check vm host if not on the same host , then repeat 5 - 6

Expected result
===

servers will reside on different nodes

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1999126

Title:
  resize to the same host unexpectedly clears host_info cache

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  We are using victoria nova and find that after a same-host-cold-migrate, 
subsequently cold migrate could break anti-affinity policy.

  Steps to reproduce
  ==

  1 provision a openstack cluster with 2 compute nodes
  2 create a server group with anti-affinity rule
  3 create two vms binding tot he server group
  4 disable compute node B
  5 cold migrate server on node A and confirm it
  6 right after confirm succeeded, migrate server on node B
  7 check vm host if not on the same host , then repeat 5 - 6

  Expected result
  ===

  servers will reside on different nodes

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1999126/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1996966] [NEW] get_machine_ips took too long to complete

2022-11-17 Thread norman shen
Public bug reported:

Description
===

I found that get_machine_ips could took too long before returning IP addresses. 
There are
around 160 instances with about 200 nics which results in around 1000 network 
adapters on the host.
calling netifaces.ifaddresses approximately took around 0.2 ~ 0.5 seconds.

Steps to reproduce
==

1. use an arm64 host or host load is high
2. booting 200 instances with neutron hybird SG driver enabled
3. reboot nova-compute

Expected result
===

get_machine_ips should take no more than 2 seconds to return

Actual result
=

Took around 500 seconds

Environment
===
1. phyitum arm64 as compute node

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1996966

Title:
  get_machine_ips took too long to complete

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  I found that get_machine_ips could took too long before returning IP 
addresses. There are
  around 160 instances with about 200 nics which results in around 1000 network 
adapters on the host.
  calling netifaces.ifaddresses approximately took around 0.2 ~ 0.5 seconds.

  Steps to reproduce
  ==

  1. use an arm64 host or host load is high
  2. booting 200 instances with neutron hybird SG driver enabled
  3. reboot nova-compute

  Expected result
  ===

  get_machine_ips should take no more than 2 seconds to return

  Actual result
  =

  Took around 500 seconds

  Environment
  ===
  1. phyitum arm64 as compute node

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1996966/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1995229] [NEW] [Opinion] Update instance availability_zone when reset host AZ

2022-10-30 Thread norman shen
Public bug reported:

Description
===

Instance.availability_zone is set in nova.conductor while scheduling.
But host's availability_zone could be modified when host is added to an
aggregate, but instance.availability_zone will not be changed, instead
'availabity_zone' will be cached in applications like memcached.

The issue of this strategy causes /servers/detail to cost around extra 1
second when returned list containing more than 500 servers.

So my proposal is to update instance.availability_zone when host is
added to a new aggregate.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1995229

Title:
  [Opinion] Update instance availability_zone when reset host AZ

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  Instance.availability_zone is set in nova.conductor while scheduling.
  But host's availability_zone could be modified when host is added to
  an aggregate, but instance.availability_zone will not be changed,
  instead 'availabity_zone' will be cached in applications like
  memcached.

  The issue of this strategy causes /servers/detail to cost around extra
  1 second when returned list containing more than 500 servers.

  So my proposal is to update instance.availability_zone when host is
  added to a new aggregate.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1995229/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1995028] [NEW] list os-service causing reconnects to memcached all the time

2022-10-27 Thread norman shen
Public bug reported:

Description
===

we are running a victoria openstack cluster (python3). and I observe
that everytime when an openstack compute service list executed, nova-api
will create a new connection to memcache. Actually there are several
reasons to cause this behavior

1. when running natively with eventlet's wsgi server, everytime a new coroutine 
is created to host web request and this causes keystonemiddle auth_token which 
uses python-memcached to reconnect to memcahced all the time
2. os-services will trigger nova.availability_zones.set_availability_zones and 
it will update cache every time, since cellv2 is enabled, this method is 
running in an co-routine as well
3. python-memcached's Client is inheriting from threading.local which will be 
monkey_patched to use eventlet's implementation and thus for every co-routine 
context it will create a new connection

Steps to reproduce
==

1. Patch def _get_socket and print connection
2. execute openstack compute service list

Expected result
===

Maintain stable connections to memcached

Actual result
=

Reconnects

Environment
===

1. devstack victoria openstack

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1995028

Title:
  list os-service causing reconnects to memcached all the time

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  we are running a victoria openstack cluster (python3). and I observe
  that everytime when an openstack compute service list executed, nova-
  api will create a new connection to memcache. Actually there are
  several reasons to cause this behavior

  1. when running natively with eventlet's wsgi server, everytime a new 
coroutine is created to host web request and this causes keystonemiddle 
auth_token which uses python-memcached to reconnect to memcahced all the time
  2. os-services will trigger nova.availability_zones.set_availability_zones 
and it will update cache every time, since cellv2 is enabled, this method is 
running in an co-routine as well
  3. python-memcached's Client is inheriting from threading.local which will be 
monkey_patched to use eventlet's implementation and thus for every co-routine 
context it will create a new connection

  Steps to reproduce
  ==

  1. Patch def _get_socket and print connection
  2. execute openstack compute service list

  Expected result
  ===

  Maintain stable connections to memcached

  Actual result
  =

  Reconnects

  Environment
  ===

  1. devstack victoria openstack

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1995028/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1995029] [NEW] list os-service causing reconnects to memcached all the time

2022-10-27 Thread norman shen
Public bug reported:

Description
===

we are running a victoria openstack cluster (python3). and I observe
that everytime when an openstack compute service list executed, nova-api
will create a new connection to memcache. Actually there are several
reasons to cause this behavior

1. when running natively with eventlet's wsgi server, everytime a new coroutine 
is created to host web request and this causes keystonemiddle auth_token which 
uses python-memcached to reconnect to memcahced all the time
2. os-services will trigger nova.availability_zones.set_availability_zones and 
it will update cache every time, since cellv2 is enabled, this method is 
running in an co-routine as well
3. python-memcached's Client is inheriting from threading.local which will be 
monkey_patched to use eventlet's implementation and thus for every co-routine 
context it will create a new connection

Steps to reproduce
==

1. Patch def _get_socket and print connection
2. execute openstack compute service list

Expected result
===

Maintain stable connections to memcached

Actual result
=

Reconnects

Environment
===

1. devstack victoria openstack

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1995029

Title:
  list os-service causing reconnects to memcached all the time

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  we are running a victoria openstack cluster (python3). and I observe
  that everytime when an openstack compute service list executed, nova-
  api will create a new connection to memcache. Actually there are
  several reasons to cause this behavior

  1. when running natively with eventlet's wsgi server, everytime a new 
coroutine is created to host web request and this causes keystonemiddle 
auth_token which uses python-memcached to reconnect to memcahced all the time
  2. os-services will trigger nova.availability_zones.set_availability_zones 
and it will update cache every time, since cellv2 is enabled, this method is 
running in an co-routine as well
  3. python-memcached's Client is inheriting from threading.local which will be 
monkey_patched to use eventlet's implementation and thus for every co-routine 
context it will create a new connection

  Steps to reproduce
  ==

  1. Patch def _get_socket and print connection
  2. execute openstack compute service list

  Expected result
  ===

  Maintain stable connections to memcached

  Actual result
  =

  Reconnects

  Environment
  ===

  1. devstack victoria openstack

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1995029/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1991380] [NEW] centos 7.6 cannot access 169.254.169.254

2022-09-30 Thread norman shen
Public bug reported:

Hello,

  I am testing centos 7.6 using an Victoria Openstack. In the virtual
machine, I am finding the route looks like below

# ip r
default via 172.31.0.1 dev eth0
192.168.0.0/16 dev eth1 proto kernel scope link src 192.168.0.9
169.254.0.0/16 dev eth0 scope link metric 1002
169.254.0.0/16 dev eth1 scope link metric 1003

As it shows 169.254.0.0/16 seems overriding the 169.254.169.254 route
and causing VM failed to access metadata.

Any idea why such situation happens? Thank you.

** Affects: cloud-init
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1991380

Title:
  centos 7.6 cannot access 169.254.169.254

Status in cloud-init:
  New

Bug description:
  Hello,

I am testing centos 7.6 using an Victoria Openstack. In the virtual
  machine, I am finding the route looks like below

  # ip r
  default via 172.31.0.1 dev eth0
  192.168.0.0/16 dev eth1 proto kernel scope link src 192.168.0.9
  169.254.0.0/16 dev eth0 scope link metric 1002
  169.254.0.0/16 dev eth1 scope link metric 1003

  As it shows 169.254.0.0/16 seems overriding the 169.254.169.254 route
  and causing VM failed to access metadata.

  Any idea why such situation happens? Thank you.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1991380/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1988281] [NEW] neutron dhcp agent state not consistent with real status

2022-08-31 Thread norman shen
Public bug reported:

We are observing that neutron-dhcp-agent's state is deviating from "real
state", by saying real state, I mean all hosted dnsmasq are running and
configured.

For example, agent A is hosting 1,000 networks, if I reboot agent A then
all dnsmasq processes are gone, and dhcp agent will try to reboot every
dnsmasq, this will introduce a long delay between agent start and agent
handles new rabbitmq messages. But weirdly, openstack network agent list
will show that the agent is up and running which IMO is inconsistent. I
think under this situation, openstack network agent list should report
the corresponding agent to be down.

** Affects: neutron
 Importance: Undecided
 Status: New

** Description changed:

- We are observing that neutron-dhcp-agent's state is deviating from "real 
state", by saying real state, I mean  
- all hosted dnsmasq are running and configured. For example, agent A is 
hosting 1,000 networks, if I reboot agent A then all dnsmasq processes are 
gone, and dhcp agent will try to reboot every dnsmasq, this will introduce a 
long delay between agent start and agent handles new rabbitmq messages. But 
weirdly, openstack network agent list will show that the agent is up and 
running which IMO is inconsistent.
+ We are observing that neutron-dhcp-agent's state is deviating from "real
+ state", by saying real state, I mean all hosted dnsmasq are running and
+ configured.
+ 
+ For example, agent A is hosting 1,000 networks, if I reboot agent A then
+ all dnsmasq processes are gone, and dhcp agent will try to reboot every
+ dnsmasq, this will introduce a long delay between agent start and agent
+ handles new rabbitmq messages. But weirdly, openstack network agent list
+ will show that the agent is up and running which IMO is inconsistent. I
+ think under this situation, openstack network agent list should report
+ the corresponding agent to be down.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1988281

Title:
  neutron  dhcp agent state not consistent with real status

Status in neutron:
  New

Bug description:
  We are observing that neutron-dhcp-agent's state is deviating from
  "real state", by saying real state, I mean all hosted dnsmasq are
  running and configured.

  For example, agent A is hosting 1,000 networks, if I reboot agent A
  then all dnsmasq processes are gone, and dhcp agent will try to reboot
  every dnsmasq, this will introduce a long delay between agent start
  and agent handles new rabbitmq messages. But weirdly, openstack
  network agent list will show that the agent is up and running which
  IMO is inconsistent. I think under this situation, openstack network
  agent list should report the corresponding agent to be down.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1988281/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1982902] [NEW] umount /run/cloud-init/tmp/tmpl5n7csdd failed

2022-07-26 Thread norman shen
Public bug reported:

Hello,

I am using cloud-init version: /usr/bin/cloud-init 20.4.1-0ubuntu1~18.04.1, 
ubuntu version is root@ubuntu:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 18.04.5 LTS
Release:18.04
Codename:   bionic

I found that umount configdrive fails with device busy reported, it
further causes temp folder failed to be deleted.


Logs are 

```
2022-07-25 02:13:01,732 - handlers.py[DEBUG]: finish: 
init-local/search-ConfigDrive: FAIL: no local data found from 
DataSourceConfigDrive
2022-07-25 02:13:01,733 - util.py[WARNING]: Getting data from  failed
2022-07-25 02:13:01,733 - util.py[DEBUG]: Getting data from  failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/temp_utils.py", line 90, in 
tempdir
yield tdir
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1687, in 
mount_cb
return ret
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1571, in 
unmounter
subp.subp(umount_cmd)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 295, in subp
cmd=args)
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['umount', '/run/cloud-init/tmp/tmpl5n7csdd']
Exit code: 32
Reason: -
Stdout:
Stderr: umount: /run/cloud-init/tmp/tmpl5n7csdd: target is busy.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 
771, in find_source
if s.update_metadata([EventType.BOOT_NEW_INSTANCE]):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 
660, in update_metadata
result = self.get_data()
  File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 
279, in get_data
return_value = self._get_data()
  File 
"/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceConfigDrive.py", 
line 81, in _get_data
mtype=mtype)
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1687, in 
mount_cb
return ret
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
  File "/usr/lib/python3/dist-packages/cloudinit/temp_utils.py", line 92, in 
tempdir
shutil.rmtree(tdir, ignore_errors=rmtree_ignore_errors)
  File "/usr/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
  File "/usr/lib/python3.6/shutil.py", line 424, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib/python3.6/shutil.py", line 424, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
  File "/usr/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 30] Read-only file system: 'network_data.json'
2022-07-25 02:13:01,783 - main.py[DEBUG]: No local datasource found

```

** Affects: cloud-init
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1982902

Title:
  umount  /run/cloud-init/tmp/tmpl5n7csdd failed

Status in cloud-init:
  New

Bug description:
  Hello,

  I am using cloud-init version: /usr/bin/cloud-init 20.4.1-0ubuntu1~18.04.1, 
ubuntu version is root@ubuntu:~# lsb_release -a
  No LSB modules are available.
  Distributor ID: Ubuntu
  Description:Ubuntu 18.04.5 LTS
  Release:18.04
  Codename:   bionic

  I found that umount configdrive fails with device busy reported, it
  further causes temp folder failed to be deleted.

  
  Logs are 

  ```
  2022-07-25 02:13:01,732 - handlers.py[DEBUG]: finish: 
init-local/search-ConfigDrive: FAIL: no local data found from 
DataSourceConfigDrive
  2022-07-25 02:13:01,733 - util.py[WARNING]: Getting data from  failed
  2022-07-25 02:13:01,733 - util.py[DEBUG]: Getting data from  failed
  Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/temp_utils.py", line 90, in 
tempdir
  yield tdir
File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1687, in 
mount_cb
  return ret
File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
  next(self.gen)
File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1571, in 
unmounter
  subp.subp(umount_cmd)
File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 295, in subp
  cmd=args)
  cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
  Command: ['umount', '/run/cloud-init/tmp/tmpl5n7csdd']
  Exit code: 32
  Reason: -
  Stdout:
  Stderr: umount: /run/cloud-init/tmp/tmpl5n7csdd: target is busy.

  During handling of the above 

[Yahoo-eng-team] [Bug 1978827] [NEW] rebuild instance continues to flush old mpath on failure

2022-06-15 Thread norman shen
Public bug reported:

Description
===

When rebuilding instance failed due to a potentially problematic cinder API,
then when trying to rebuild again, nova will try to disconnect volume again
although the path has already clearer. It is generally OK for rbd backend, but
it could cause problem for FC san if new volume got attached during two 
consecutive rebuilds
which could consume the previous lun id.


Environment


It could happen for all versions.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1978827

Title:
  rebuild instance continues to flush old mpath on failure

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  When rebuilding instance failed due to a potentially problematic cinder API,
  then when trying to rebuild again, nova will try to disconnect volume again
  although the path has already clearer. It is generally OK for rbd backend, but
  it could cause problem for FC san if new volume got attached during two 
consecutive rebuilds
  which could consume the previous lun id.

  
  Environment
  

  It could happen for all versions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1978827/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1973656] [NEW] meaning of option "router_auto_schedule" is ambiguous

2022-05-16 Thread norman shen
Public bug reported:

I found meaning of option "router_auto_schedule" is hard to follow.  A
quick code review finds it is only used at (tests excluded)

```python
def get_router_ids(self, context, host):
"""Returns IDs of routers scheduled to l3 agent on 

This will autoschedule unhosted routers to l3 agent on  and then
return all ids of routers scheduled to it.
"""
if extensions.is_extension_supported(
self.l3plugin, constants.L3_AGENT_SCHEDULER_EXT_ALIAS):
if cfg.CONF.router_auto_schedule:
self.l3plugin.auto_schedule_routers(context, host)
return self.l3plugin.list_router_ids_on_host(context, host)
```

which seems to be fixing router without agents associated with it. And
even if I turn this option off, router is still able to be properly
scheduled to agents. because

```python
@registry.receives(resources.ROUTER, [events.AFTER_CREATE],
   priority_group.PRIORITY_ROUTER_EXTENDED_ATTRIBUTE)
def _after_router_create(self, resource, event, trigger, context,
 router_id, router, router_db, **kwargs):
if not router['ha']:
return
try:
self.schedule_router(context, router_id)
router['ha_vr_id'] = router_db.extra_attributes.ha_vr_id
self._notify_router_updated(context, router_id)
except Exception as e:
with excutils.save_and_reraise_exception() as ctx:
if isinstance(e, l3ha_exc.NoVRIDAvailable):
ctx.reraise = False
LOG.warning("No more VRIDs for router: %s", e)
else:
LOG.exception("Failed to schedule HA router %s.",
  router_id)
router['status'] = self._update_router_db(
context, router_id,
{'status': constants.ERROR})['status']
```

seems to not respecting this option.

So IMO auto_schedule_router might better be renamed to something like
`fix_dangling_routers` etc and could be turned off if user wants to fix
wrong routers manually. The reason is that could router by agent is
pretty expensive for a relatively large deployment with around 10,000
routers.

** Affects: neutron
 Importance: Undecided
 Status: New

** Description changed:

  I found meaning of option "router_auto_schedule" is hard to follow.  A
- quick code review finds it is only used at
+ quick code review finds it is only used at (tests excluded)
  
  ```python
- def get_router_ids(self, context, host):
- """Returns IDs of routers scheduled to l3 agent on 
+ def get_router_ids(self, context, host):
+ """Returns IDs of routers scheduled to l3 agent on 
  
- This will autoschedule unhosted routers to l3 agent on  and then
- return all ids of routers scheduled to it.
- """
- if extensions.is_extension_supported(
- self.l3plugin, constants.L3_AGENT_SCHEDULER_EXT_ALIAS):
- if cfg.CONF.router_auto_schedule:
- self.l3plugin.auto_schedule_routers(context, host)
- return self.l3plugin.list_router_ids_on_host(context, host)
+ This will autoschedule unhosted routers to l3 agent on  and then
+ return all ids of routers scheduled to it.
+ """
+ if extensions.is_extension_supported(
+ self.l3plugin, constants.L3_AGENT_SCHEDULER_EXT_ALIAS):
+ if cfg.CONF.router_auto_schedule:
+ self.l3plugin.auto_schedule_routers(context, host)
+ return self.l3plugin.list_router_ids_on_host(context, host)
  ```
  
  which seems to be fixing router without agents associated with it. And
  even if I turn this option off, router is still able to be properly
  scheduled to agents. because
  
  ```python
- @registry.receives(resources.ROUTER, [events.AFTER_CREATE],
-priority_group.PRIORITY_ROUTER_EXTENDED_ATTRIBUTE)
- def _after_router_create(self, resource, event, trigger, context,
-  router_id, router, router_db, **kwargs):
- if not router['ha']:
- return
- try:
- self.schedule_router(context, router_id)
- router['ha_vr_id'] = router_db.extra_attributes.ha_vr_id
- self._notify_router_updated(context, router_id)
- except Exception as e:
- with excutils.save_and_reraise_exception() as ctx:
- if isinstance(e, l3ha_exc.NoVRIDAvailable):
- ctx.reraise = False
- LOG.warning("No more VRIDs for router: %s", e)
- else:
- LOG.exception("Failed to schedule HA router %s.",
-   router_id)
- router['status'] = self._update_router_db(
- context, router_id,
- 

[Yahoo-eng-team] [Bug 1973576] [NEW] remove eager subquery load for DistributedPortBinding

2022-05-16 Thread norman shen
Public bug reported:

We observe excessive DB calls to load DistributedPortBindings,
We have enabled DVR and have some huge virtual routers with around
60 router interfaces scheduled on around 200 compute nodes.

We saw something like

```console
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server Traceback (most 
recent call last):
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/sqlalchemy/engine/base.py", 
line 1193, in _execute_context
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server context)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/sqlalchemy/engine/default.py", 
line 509, in do_execute
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server 
cursor.execute(statement, parameters)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/pymysql/cursors.py", line 170, 
in execute
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server result = 
self._query(query)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/pymysql/cursors.py", line 328, 
in _query
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server conn.query(q)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 
516, in query
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server 
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 
727, in _read_query_result
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server result.read()
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 
1073, in read
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server 
self._read_result_packet(first_packet)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 
1143, in _read_result_packet
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server 
self._read_rowdata_packet()
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 
1177, in _read_rowdata_packet
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server packet = 
self.connection._read_packet()
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 
673, in _read_packet
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server recv_data = 
self._read_bytes(bytes_to_read)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 
702, in _read_bytes
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server 
CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server 
pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during 
query')
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server The above exception 
was the direct cause of the following exception:
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server Traceback (most 
recent call last):
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", 
line 166, in _process_incoming
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server res = 
self.dispatcher.dispatch(message)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py",
 line 265, in dispatch
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server return 
self._do_dispatch(endpoint, method, ctxt, args)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py",
 line 194, in _do_dispatch
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server result = 
func(ctxt, **new_args)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/neutron/api/rpc/handlers/l3_rpc.py",
 line 104, in get_router_ids
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server 
self.l3plugin.auto_schedule_routers(context, host)
2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server   File 

[Yahoo-eng-team] [Bug 1968837] [NEW] too many l3 dvr agents got notifications after a server got deleted

2022-04-13 Thread norman shen
Public bug reported:


We are using Rocky 13.0.6 neutron which seems removing router namespace if 
retry limit got hit. 

After some investigations, it seems that delete a server which already 
associates with a floating ip address
seems causes a broadcast notification to all related routers. In our cases, we 
have around 300 compute nodes and they all have l3 dvr agents running on.

the related code snippet is
https://github.com/openstack/neutron/blob/bb4c26eb7245465bf7cea7e0f07342601eb78ede/neutron/db/l3_db.py#L1999,
so my question is: is it still relevant to have it if dvr is enabled?

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1968837

Title:
  too many l3 dvr agents got notifications after a server got deleted

Status in neutron:
  New

Bug description:
  
  We are using Rocky 13.0.6 neutron which seems removing router namespace if 
retry limit got hit. 

  After some investigations, it seems that delete a server which already 
associates with a floating ip address
  seems causes a broadcast notification to all related routers. In our cases, 
we have around 300 compute nodes and they all have l3 dvr agents running on.

  the related code snippet is
  
https://github.com/openstack/neutron/blob/bb4c26eb7245465bf7cea7e0f07342601eb78ede/neutron/db/l3_db.py#L1999,
  so my question is: is it still relevant to have it if dvr is enabled?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1968837/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1964587] Re: default video driver

2022-03-11 Thread norman shen
** Changed in: nova
   Status: Incomplete => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1964587

Title:
  default video driver

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Hello, I saw on amd64 platform nova defaults to use cirrus as video
  driver and windows virtual machine got a small resolution.

  video driver virtio could allow a larger resolution. And looks like
  the driver type cannot be set by user.

  My question is why using cirrus as default and do we have plan to
  adopt virtio? thank you.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1964587/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1964587] [NEW] default video driver

2022-03-11 Thread norman shen
Public bug reported:

Hello, I saw on amd64 platform nova defaults to use cirrus as video
driver and windows virtual machine got a small resolution.

video driver virtio could allow a larger resolution. And looks like the
driver type cannot be set by user.

My question is why using cirrus as default and do we have plan to adopt
virtio? thank you.

** Affects: nova
 Importance: Undecided
 Status: Incomplete

** Changed in: nova
   Status: New => Incomplete

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1964587

Title:
  default video driver

Status in OpenStack Compute (nova):
  Incomplete

Bug description:
  Hello, I saw on amd64 platform nova defaults to use cirrus as video
  driver and windows virtual machine got a small resolution.

  video driver virtio could allow a larger resolution. And looks like
  the driver type cannot be set by user.

  My question is why using cirrus as default and do we have plan to
  adopt virtio? thank you.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1964587/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1954619] Re: device_name is too narrow

2022-01-06 Thread norman shen
Thank you, I saw a patch has been merged upstream for new releases. and
this should be fixed.

** Changed in: horizon
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1954619

Title:
  device_name is too narrow

Status in OpenStack Dashboard (Horizon):
  Invalid

Bug description:
  Horizon will auto fill a device_name "vda" by default. But vda only
  makes senses to virtio-blk block device. For scsi device, sda makes
  more sense.

  Nova will take care of device name if not specified, so why not make
  this field null by default and let nova chose a better device_name
  instead?

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1954619/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1954619] [NEW] device_name is too narrow

2021-12-12 Thread norman shen
Public bug reported:

Horizon will auto fill a device_name "vda" by default. But vda only
makes senses to virtio-blk block device. For scsi device, sda makes more
sense.

Nova will take care of device name if not specified, so why not make
this field null by default and let nova chose a better device_name
instead?

** Affects: horizon
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1954619

Title:
  device_name is too narrow

Status in OpenStack Dashboard (Horizon):
  New

Bug description:
  Horizon will auto fill a device_name "vda" by default. But vda only
  makes senses to virtio-blk block device. For scsi device, sda makes
  more sense.

  Nova will take care of device name if not specified, so why not make
  this field null by default and let nova chose a better device_name
  instead?

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1954619/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1953718] [NEW] nova compute failed to update placement if mdev max available is 0

2021-12-08 Thread norman shen
Public bug reported:

Description
===
nova compute will failed to update vgpu mdev placement data if mdev type 
changed while
there are some previously created mdev devices with different types. For 
nvidia, under such
circumstances max available instances will be 0.

Steps to reproduce
==

configure vgpu type to nvida-231 at first,
boot one instance
then change vgpu type to nvida-233 and reboot nova-compute service
then it will failed to update placement

Expected result
===
better observability, for example refuse to start nova-compute service or 
better logging to help
operator understand the possible cause.

Actual result
=

2021-12-09 07:18:13.774 632001 ERROR nova.scheduler.client.report [None 
req-d717a248-4d90-4262-bf8b-11875c60aea6 - - - - -] 
[req-03944f1d-79bb-4d2f-b37a-99db24d78653] Failed to update inventory to 
[{'VGPU': {'total': 0, 'min_unit': 1, 'step_size': 1, 'reserved': 0, 
'allocation_ratio': 1.0, 'max_unit': 0}}] for resource provider with UUID 
9b6dd7c7-50c8-4780-b343-4c2e65dd0c67.  Got 400: {"errors": [{"status": 400, 
"title": "Bad Request", "detail": "The server could not comply with the request 
since it is either malformed or otherwise incorrect.\n\n JSON does not 
validate: 0 is less than the minimum of 1  Failed validating 'minimum' in 
schema['properties']['inventories']['patternProperties']['^[A-Z0-9_]+$']['properties']['total']:
 {'maximum': 2147483647, 'minimum': 1, 'type': 'integer'}  On 
instance['inventories']['VGPU']['total']: 0  ", "code": 
"placement.undefined_code", "request_id": 
"req-03944f1d-79bb-4d2f-b37a-99db24d78653"}]}
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager [None 
req-d717a248-4d90-4262-bf8b-11875c60aea6 - - - - -] Error updating resources 
for node compute-009.: nova.exception.ResourceProviderSyncFailed: Failed to 
synchronize the placement service with resource provider information supplied 
by the compute host.
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager Traceback (most 
recent call last):
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/scheduler/client/report.py",
 line 1342, in catch_all
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager yield
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/scheduler/client/report.py",
 line 1430, in update_from_provider_tree
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager 
self.set_inventory_for_provider(
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/scheduler/client/report.py",
 line 951, in set_inventory_for_provider
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager raise 
exception.ResourceProviderUpdateFailed(url=url, error=resp.text)
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager 
nova.exception.ResourceProviderUpdateFailed: Failed to update resource provider 
via URL /resource_providers/9b6dd7c7-50c8-4780-b343-4c2e65dd0c67/inventories: 
{"errors": [{"status": 400, "title": "Bad Request", "detail": "The server could 
not comply with the request since it is either malformed or otherwise 
incorrect.\n\n JSON does not validate: 0 is less than the minimum of 1  Failed 
validating 'minimum' in 
schema['properties']['inventories']['patternProperties']['^[A-Z0-9_]+$']['properties']['total']:
 {'maximum': 2147483647, 'minimum': 1, 'type': 'integer'}  On 
instance['inventories']['VGPU']['total']: 0  ", "code": 
"placement.undefined_code", "request_id": 
"req-03944f1d-79bb-4d2f-b37a-99db24d78653"}]}
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager 
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager During handling of 
the above exception, another exception occurred:
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager 
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager Traceback (most 
recent call last):
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 
10293, in _update_available_resource_for_node
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager 
self.rt.update_available_resource(context, nodename,
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/compute/resource_tracker.py",
 line 910, in update_available_resource
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager 
self._update_available_resource(context, resources, startup=startup)
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/oslo_concurrency/lockutils.py", 
line 360, in inner
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager return f(*args, 
**kwargs)
2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager   File 

[Yahoo-eng-team] [Bug 1946546] [NEW] nova-compute endlessly waits for snapshot completes

2021-10-09 Thread norman shen
Public bug reported:

Description
===

When trying to create a server image, nova compute will endlessly waits for 
snapshot to be created.
this is quite dangerous because server's file system has already been frozen 
and IO operations has been
disabled.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1946546

Title:
  nova-compute endlessly waits for snapshot completes

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  When trying to create a server image, nova compute will endlessly waits for 
snapshot to be created.
  this is quite dangerous because server's file system has already been frozen 
and IO operations has been
  disabled.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1946546/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1940641] [NEW] nova compute with allocated vgpu device failed to start after host reboot

2021-08-20 Thread norman shen
Public bug reported:

Description
=

nova compute service failed to start after reboot, if there are vgpu
virtual machines beforehand.

Error log

2021-08-20 09:37:30.331 284159 DEBUG nova.virt.libvirt.volume.mount [None 
req-6ad4e06c-980e-4759-8b36-6c696e596dab - - - - -] Initialising 
_HostMountState generation 0 host_up 
/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/volume/mount.py:131
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service [-] Error starting 
thread.: libvirt.libvirtError: Node device not found: no node device with 
matching name 'mdev_74527849_d08c_4243_b868_f84a1437c9b5'
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service Traceback (most 
recent call last):
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/oslo_service/service.py", line 
807, in run_service
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service service.start()
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/service.py", line 159, in 
start
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service 
self.manager.init_host()
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 
1414, in init_host
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service 
self.driver.init_host(host=self.host)
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", 
line 733, in init_host
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service 
self._recreate_assigned_mediated_devices()
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", 
line 862, in _recreate_assigned_mediated_devices
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service dev_info = 
self._get_mediated_device_information(dev_name)
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", 
line 7380, in _get_mediated_device_information
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service virtdev = 
self._host.device_lookup_by_name(devname)
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/host.py", 
line 1153, in device_lookup_by_name
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service return 
self.get_connection().nodeDeviceLookupByName(name)
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/eventlet/tpool.py", line 190, 
in doit
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service result = 
proxy_call(self._autowrap, f, *args, **kwargs)
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/eventlet/tpool.py", line 148, 
in proxy_call
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service rv = execute(f, 
*args, **kwargs)
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/eventlet/tpool.py", line 129, 
in execute
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service six.reraise(c, e, 
tb)
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/six.py", line 703, in reraise
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service raise value
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/eventlet/tpool.py", line 83, in 
tworker
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service rv = meth(*args, 
**kwargs)
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service   File 
"/var/lib/openstack/lib/python3.8/site-packages/libvirt.py", line 4614, in 
nodeDeviceLookupByName
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service if ret is 
None:raise libvirtError('virNodeDeviceLookupByName() failed', conn=self)
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service libvirt.libvirtError: 
Node device not found: no node device with matching name 
'mdev_74527849_d08c_4243_b868_f84a1437c9b5'
2021-08-20 09:37:30.421 284159 ERROR oslo_service.service


Environment


nova: victoria
os ubuntu 20.04

Steps to Reproduce
===


create vgpu virtual machines (mdev) and then reboot host.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1940641

Title:
  nova compute with allocated vgpu device failed to start after host
  reboot

Status in OpenStack Compute (nova):
  

[Yahoo-eng-team] [Bug 1940012] [NEW] allow attaching pci devices as different functions

2021-08-15 Thread norman shen
Public bug reported:

Description
===

We have a use case to attach FPGA device to virtual machine. This FPGA
card gets two functions, we can attach both of them using alias. After
both of them are passing through to the virtual machine, we found that
they are not appearing as different functions of a same PCI device.
Instead, they are two PCI devices as denoted by 'slot' ID.

I think it should be possible to allow setting functions as libvirt
allows it.

** Affects: nova
 Importance: Undecided
 Status: Opinion

** Changed in: nova
   Status: New => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1940012

Title:
  allow attaching pci devices as different functions

Status in OpenStack Compute (nova):
  Opinion

Bug description:
  Description
  ===

  We have a use case to attach FPGA device to virtual machine. This FPGA
  card gets two functions, we can attach both of them using alias. After
  both of them are passing through to the virtual machine, we found that
  they are not appearing as different functions of a same PCI device.
  Instead, they are two PCI devices as denoted by 'slot' ID.

  I think it should be possible to allow setting functions as libvirt
  allows it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1940012/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1934203] [NEW] cannot multi attach enabled volume after swap volume

2021-06-30 Thread norman shen
Public bug reported:

Description
===

detach a multi-attach enabled volume failed after swapping volume.

Steps to reproduce
==

1. Create two volume type with multi attach enabled (A, B)
2. Create a new volume using type A
3. attach it a server
4. Retype this volume to type B
5. wait for it succeeds and detach it will cause a failure

Expected result
===

volume should be successfully detached

Actual result
=

failed because nova-compute uses a non existent volume id from connection
info

Environment
===

1. openstack nova victoria
2. ubuntu 18.04 with docker image using 20.04

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1934203

Title:
  cannot multi attach enabled volume after swap volume

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  detach a multi-attach enabled volume failed after swapping volume.

  Steps to reproduce
  ==

  1. Create two volume type with multi attach enabled (A, B)
  2. Create a new volume using type A
  3. attach it a server
  4. Retype this volume to type B
  5. wait for it succeeds and detach it will cause a failure

  Expected result
  ===

  volume should be successfully detached

  Actual result
  =

  failed because nova-compute uses a non existent volume id from connection
  info

  Environment
  ===

  1. openstack nova victoria
  2. ubuntu 18.04 with docker image using 20.04

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1934203/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1931209] [NEW] Circular reference detected during cold migration

2021-06-08 Thread norman shen
Public bug reported:

Description
===

cold migration failed when server is specified with a numa topology

Steps to reproduce
==

create server from a flavor specified with numa topology parameters and then do
a cold migrate or resize

Expected


success

Actual
==

failed with following messages

2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server 
[req-c299c521-2a07-483b-b19e-deb136572da0 dde6a5842265470a8e2f40938ae66097 
f3d6994dfaf043479c9cf5bbac19ab87 - default default] Exception during message 
handling: ValueError: Circular reference detected
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server Traceback (most 
recent call last):
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", 
line 166, in _process_incoming
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server res = 
self.dispatcher.dispatch(message)
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py",
 line 265, in dispatch
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server return 
self._do_dispatch(endpoint, method, ctxt, args)
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py",
 line 194, in _do_dispatch
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server result = 
func(ctxt, **new_args)
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", 
line 229, in inner
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server return 
func(*args, **kwargs)
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/manager.py", 
line 94, in wrapper
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server return fn(self, 
context, *args, **kwargs)
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/nova/compute/utils.py", line 
1164, in decorated_function
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server return 
function(self, context, *args, **kwargs)
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/manager.py", 
line 298, in migrate_server
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server host_list)
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/manager.py", 
line 358, in _cold_migrate
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server updates, ex, 
request_spec)
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_utils/excutils.py", line 
220, in __exit__
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server 
self.force_reraise()
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_utils/excutils.py", line 
196, in force_reraise
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server 
six.reraise(self.type_, self.value, self.tb)
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/six.py", line 693, in reraise
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server raise value
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/manager.py", 
line 327, in _cold_migrate
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server task.execute()
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/tasks/base.py", 
line 27, in wrap
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server self.rollback()
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_utils/excutils.py", line 
220, in __exit__
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server 
self.force_reraise()
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/oslo_utils/excutils.py", line 
196, in force_reraise
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server 
six.reraise(self.type_, self.value, self.tb)
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/six.py", line 693, in reraise
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server raise value
2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server   File 
"/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/tasks/base.py", 
line 

[Yahoo-eng-team] [Bug 1929480] [NEW] cloud-init for ubuntu 18.04

2021-05-24 Thread norman shen
Public bug reported:

ubuntu 18.04 uses netplan to manage networks, netplan could either use 
NetworkManager or systemd-networkd
internally, but it does not use networking. 

cloud-init.service explicitly depends on networking.service to complete which 
might be problematic
because network service might not get ready..

** Affects: cloud-init
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1929480

Title:
  cloud-init for ubuntu 18.04

Status in cloud-init:
  New

Bug description:
  ubuntu 18.04 uses netplan to manage networks, netplan could either use 
NetworkManager or systemd-networkd
  internally, but it does not use networking. 

  cloud-init.service explicitly depends on networking.service to complete which 
might be problematic
  because network service might not get ready..

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1929480/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1927747] [NEW] neutron ovs agent apply openvswitch security group slow

2021-05-07 Thread norman shen
Public bug reported:

I am using neutron-ovs-agent using openvswitch firewall, there are
around 40 ports with same security group on the same compute node. it
seems update security group for each port will consume near 3 seconds
which sums up to around 100 seconds in total. This significantly affects
the speed of spawning virtual machines.

** Affects: neutron
 Importance: Undecided
 Status: New

** Description changed:

  I am using neutron-ovs-agent using openvswitch firewall, there are
  around 40 ports with same security group on the same compute node. it
  seems update security group for each port will consume near 3 seconds
  which sums up to around 100 seconds in total. This significantly affects
- the speed of creating new ports.
+ the speed of spawning virtual machines.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1927747

Title:
  neutron ovs agent apply openvswitch security group  slow

Status in neutron:
  New

Bug description:
  I am using neutron-ovs-agent using openvswitch firewall, there are
  around 40 ports with same security group on the same compute node. it
  seems update security group for each port will consume near 3 seconds
  which sums up to around 100 seconds in total. This significantly
  affects the speed of spawning virtual machines.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1927747/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1926049] [NEW] check_changed_vlans failed

2021-04-24 Thread norman shen
Public bug reported:

2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None 
req-413ff802-0c14-47ad-8221-14d7e972bad3 - - - - -] Error while processing VIF 
ports: TypeError: %d format: a number is required, not list
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most 
recent call last):
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/var/lib/openstack/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
 line 2658, in rpc_loop
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
ports_not_ready_yet) = (self.process_port_info(
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/var/lib/openstack/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
 line 2453, in process_port_info
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent port_info = 
self.scan_ports(reg_ports, sync,
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/var/lib/openstack/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
 line 1764, in scan_ports
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
updated_ports.update(self.check_changed_vlans())
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/var/lib/openstack/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
 line 1795, in check_changed_vlans
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent LOG.info(
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/usr/lib/python3.8/logging/__init__.py", line 1794, in info
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
self.log(INFO, msg, *args, **kwargs)
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/usr/lib/python3.8/logging/__init__.py", line 1832, in log
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
self.logger.log(level, msg, *args, **kwargs)
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/usr/lib/python3.8/logging/__init__.py", line 1500, in log
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
self._log(level, msg, args, **kwargs)
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/usr/lib/python3.8/logging/__init__.py", line 1577, in _log
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
self.handle(record)
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/usr/lib/python3.8/logging/__init__.py", line 1587, in handle
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
self.callHandlers(record)
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/usr/lib/python3.8/logging/__init__.py", line 1649, in callHandlers
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
hdlr.handle(record)
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/usr/lib/python3.8/logging/__init__.py", line 950, in handle
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
self.emit(record)
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/var/lib/openstack/lib/python3.8/site-packages/fluent/handler.py", line 237, 
in emit
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent data = 
self.format(record)
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/usr/lib/python3.8/logging/__init__.py", line 925, in format
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return 
fmt.format(record)
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/var/lib/openstack/lib/python3.8/site-packages/oslo_log/formatters.py", line 
315, in format
2021-04-25 03:19:37.303 1 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = 
{'message': 

[Yahoo-eng-team] [Bug 1925144] [NEW] timeout in rados connect does not take effect

2021-04-20 Thread norman shen
Public bug reported:

Description
===

from
https://github.com/ceph/ceph/blob/0be78da368f2dc1c891e3caafac38f7aa96d3c49/src/pybind/rados/rados.pyx#L660,
it looks like function connect in object rados will ignore timeout input
and therefore makes current configuration does not take effect.

Steps to reproduce
==

just configure rbd to use a non-existing ip address and rados.connect
will hang there.


Expected result
===

timeout should take effect

Actual result
=

when there is a problem on network, it will hang more than timeout
configured

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1925144

Title:
  timeout in rados connect does not take effect

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  from
  
https://github.com/ceph/ceph/blob/0be78da368f2dc1c891e3caafac38f7aa96d3c49/src/pybind/rados/rados.pyx#L660,
  it looks like function connect in object rados will ignore timeout
  input and therefore makes current configuration does not take effect.

  Steps to reproduce
  ==

  just configure rbd to use a non-existing ip address and rados.connect
  will hang there.

  
  Expected result
  ===

  timeout should take effect

  Actual result
  =

  when there is a problem on network, it will hang more than timeout
  configured

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1925144/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1925143] [NEW] timeout in rados connect does not take effect

2021-04-20 Thread norman shen
Public bug reported:

Description
===

from
https://github.com/ceph/ceph/blob/0be78da368f2dc1c891e3caafac38f7aa96d3c49/src/pybind/rados/rados.pyx#L660,
it looks like function connect in object rados will ignore timeout input
and therefore makes current configuration does not take effect.

Steps to reproduce
==

just configure rbd to use a non-existing ip address and rados.connect
will hang there.


Expected result
===

timeout should take effect

Actual result
=

when there is a problem on network, it will hang more than timeout
configured

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1925143

Title:
  timeout in rados connect does not take effect

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  from
  
https://github.com/ceph/ceph/blob/0be78da368f2dc1c891e3caafac38f7aa96d3c49/src/pybind/rados/rados.pyx#L660,
  it looks like function connect in object rados will ignore timeout
  input and therefore makes current configuration does not take effect.

  Steps to reproduce
  ==

  just configure rbd to use a non-existing ip address and rados.connect
  will hang there.

  
  Expected result
  ===

  timeout should take effect

  Actual result
  =

  when there is a problem on network, it will hang more than timeout
  configured

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1925143/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1923560] [NEW] retrieving security group is slow for server detail

2021-04-13 Thread norman shen
Public bug reported:

Description
===

querying large number of vms through server detail is slow, and a lot of time
is wasted on calling neutron api to obtain security group info.


Expected result
===

obtaining security group info should not consumes half of total query
time

Actual result
=

too slow...

Environment
===
1. ubuntu 18.04 + nova 22

2. libvirt + qemu + kvm

2. ceph

3. vxlan + vlan

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1923560

Title:
  retrieving security group is slow for server detail

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  querying large number of vms through server detail is slow, and a lot of time
  is wasted on calling neutron api to obtain security group info.

  
  Expected result
  ===

  obtaining security group info should not consumes half of total query
  time

  Actual result
  =

  too slow...

  Environment
  ===
  1. ubuntu 18.04 + nova 22

  2. libvirt + qemu + kvm

  2. ceph

  3. vxlan + vlan

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1923560/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1922222] [NEW] allow using tap device on netdev enabled host

2021-04-01 Thread norman shen
Public bug reported:

hello, after reading the code, it seems nova-compute can only use
vhostuser mode if netdev is enabled on ovs bridge. an internal use case
requires us to allow using tap device as well as vhostuser device on the
same host. Do this sound like a valid use case?

** Affects: neutron
 Importance: Undecided
 Status: Opinion

** Changed in: neutron
   Status: New => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/192

Title:
  allow using tap device on netdev enabled host

Status in neutron:
  Opinion

Bug description:
  hello, after reading the code, it seems nova-compute can only use
  vhostuser mode if netdev is enabled on ovs bridge. an internal use
  case requires us to allow using tap device as well as vhostuser device
  on the same host. Do this sound like a valid use case?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/192/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1921804] [NEW] leftover bdm when rabbitmq unstable

2021-03-29 Thread norman shen
Public bug reported:

Description
===

When rabbitMQ unstable, there might be a chance that method
https://github.com/openstack/nova/blob/7a1222a8654684262a8e589d91e67f2b9a9da336/nova/compute/api.py#L4741
will timeout but bdm is successfully created.

Under such cases, volume will be shown in server show, but cannot be
detached. and volume status is available.

Steps to reproduce
==
there might be no way to safely reproduce this failure, because when rabbitmq is
unstable, many other services will also show unusual behavior.

Expected result
===
We should be able to remove such attachment from api without manually fixing 
db...

```console
root@mgt02:~# openstack server show 4e5c3c7d-6b4c-4841-9e6e-9a3374036a3e
+-+---+
| Field   | Value   
  |
+-+---+
| OS-DCF:diskConfig   | MANUAL  
  |
| OS-EXT-AZ:availability_zone | cn-north-3a 
  |
| OS-EXT-SRV-ATTR:host| compute01   
  |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute01   
  |
| OS-EXT-SRV-ATTR:instance_name   | instance-ce4c   
  |
| OS-EXT-STS:power_state  | Running 
  |
| OS-EXT-STS:task_state   | None
  |
| OS-EXT-STS:vm_state | active  
  |
| OS-SRV-USG:launched_at  | 2021-03-29T09:06:38.00  
  |
| OS-SRV-USG:terminated_at| None
  |
| accessIPv4  | 
  |
| accessIPv6  | 
  |
| addresses   | newsql-net=192.168.1.217; 
service_mgt=100.114.3.41|
| config_drive| True
  |
| created | 2021-03-29T09:05:19Z
  |
| flavor  | newsql_2C8G40G_general 
(51db3192-cece-4b9a-9969-7916b4543beb) |
| hostId  | 
cf1f3937a3286677b3020d817541ac33d7c8f1ca74be49b26f128093
  |
| id  | 4e5c3c7d-6b4c-4841-9e6e-9a3374036a3e
  |
| image   | 
newsql-bini2.0.0alpha-ubuntu18.04-x64-20210112-pub 
(4531e3bf-0433-40c6-816b-6763f9d02c7a) |
| key_name| None
  |
| name| 
NewSQL-1abc5b28-b9e6-45cd-893d-5bb3a7732a43-3   
  |
| progress| 0   
  |
| project_id  | acfcc87fc1db430880f0cb1cce410906
  |
| properties  | productTag='NewSQL' 
  |
| security_groups | name='default'  
  |
| | 
name='csf-NewSQL-cluster-security-group'
  |
| status  | ACTIVE  
  |
| updated | 2021-03-29T09:06:39Z
  |
| user_id | a38ef24677cc4a45a143a31c5fb59ee9
 

[Yahoo-eng-team] [Bug 1914522] [NEW] migrate from iptables firewall to ovs firewall

2021-02-03 Thread norman shen
Public bug reported:

Sorry this is actually a bug report but discussing for better
clarification in document.

Currently, we are running iptables firewall in production and saw performance 
degrade thus
we plan to upgrade to ovs firewall in place. By reading the doc I found 
upgrading process is described
here 
https://docs.openstack.org/neutron/latest/contributor/internals/openvswitch_firewall.html#upgrade-path-from-iptables-hybrid-driver.
 it does provide three methods to allow upgrade the existing cluster.

I am interested in method 2 which quotes "plug the tap device into the 
integration bridge", since it does not
provide the command so I would like to ask how to actually perform it. I tried 
with

```console
# brctl delif qbrxxx tapxxx
# ovs-vsctl add-port br-int tapxxx
```

but it does not work because network appears to be disconnected.

Another question is that is there an option 4, such that ovs firewall
could takes control of existing iptables firewalled port and later users
could transition to ovs firewalls gradually.

Thank you.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1914522

Title:
  migrate from iptables firewall to ovs firewall

Status in neutron:
  New

Bug description:
  Sorry this is actually a bug report but discussing for better
  clarification in document.

  Currently, we are running iptables firewall in production and saw performance 
degrade thus
  we plan to upgrade to ovs firewall in place. By reading the doc I found 
upgrading process is described
  here 
https://docs.openstack.org/neutron/latest/contributor/internals/openvswitch_firewall.html#upgrade-path-from-iptables-hybrid-driver.
 it does provide three methods to allow upgrade the existing cluster.

  I am interested in method 2 which quotes "plug the tap device into the 
integration bridge", since it does not
  provide the command so I would like to ask how to actually perform it. I 
tried with

  ```console
  # brctl delif qbrxxx tapxxx
  # ovs-vsctl add-port br-int tapxxx
  ```

  but it does not work because network appears to be disconnected.

  Another question is that is there an option 4, such that ovs firewall
  could takes control of existing iptables firewalled port and later
  users could transition to ovs firewalls gradually.

  Thank you.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1914522/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1910946] [NEW] ovs is dead but ovs agent is up

2021-01-10 Thread norman shen
Public bug reported:

we are using openstack-neutron rocky with openvswitch versioned 2.10.0

We are using ubuntu 18.04 which shipped with a libc6 bug, reported here
https://github.com/openvswitch/ovs-issues/issues/175.

My question is that when this bug happens ovs agent will not working and 
reported dead observed from log,
but it still reports hearthbeat to neutron-server which is problematic because 
users will be unaware that ovs-agent is working anymore but looking at agent 
service state.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1910946

Title:
  ovs is dead but ovs agent is up

Status in neutron:
  New

Bug description:
  we are using openstack-neutron rocky with openvswitch versioned 2.10.0

  We are using ubuntu 18.04 which shipped with a libc6 bug, reported
  here https://github.com/openvswitch/ovs-issues/issues/175.

  My question is that when this bug happens ovs agent will not working and 
reported dead observed from log,
  but it still reports hearthbeat to neutron-server which is problematic 
because users will be unaware that ovs-agent is working anymore but looking at 
agent service state.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1910946/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1910947] [NEW] ovs is dead but ovs agent is up

2021-01-10 Thread norman shen
Public bug reported:

we are using openstack-neutron rocky with openvswitch versioned 2.10.0

We are using ubuntu 18.04 which shipped with a libc6 bug, reported here
https://github.com/openvswitch/ovs-issues/issues/175.

My question is that when this bug happens ovs agent will not working and 
reported dead observed from log,
but it still reports hearthbeat to neutron-server which is problematic because 
users will be unaware that ovs-agent is working anymore but looking at agent 
service state.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1910947

Title:
  ovs is dead but ovs agent is up

Status in neutron:
  New

Bug description:
  we are using openstack-neutron rocky with openvswitch versioned 2.10.0

  We are using ubuntu 18.04 which shipped with a libc6 bug, reported
  here https://github.com/openvswitch/ovs-issues/issues/175.

  My question is that when this bug happens ovs agent will not working and 
reported dead observed from log,
  but it still reports hearthbeat to neutron-server which is problematic 
because users will be unaware that ovs-agent is working anymore but looking at 
agent service state.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1910947/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1909160] Re: high cpu usage when listing security groups

2020-12-27 Thread norman shen
Ok, i'll try out Victoria and compare the result. thank you for reply.

** Changed in: neutron
   Status: New => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1909160

Title:
  high cpu usage when listing security groups

Status in neutron:
  Opinion

Bug description:
  I saw listing security group is slow and causing cpu spikes
  unexpectedly,

  I run a Rock neutron-server with api worker set to 1, when executing
  command like

  ```console
  root@mgt01:~# time curl -H "x-auth-token: $token" 
http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups
 -owide -s

  real  0m2.328s
  user  0m0.016s
  sys   0m0.012s

  
  root@mgt01:~# curl -H "x-auth-token: $token" 
http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups
 | jq '.security_groups | length'
% Total% Received % Xferd  Average Speed   TimeTime Time  
Current
   Dload  Upload   Total   SpentLeft  Speed
  100  497k  100  497k0 0   219k  0  0:00:02  0:00:02 --:--:--  219k
  225
  ```

  It will return in around 2 seconds. There are around 200 security groups, so 
maybe it is
  not extremely slow, but what is interesting is that calling this rest api 
seems making cpu
  spikes for neutorn-server pod,

  ```console
  CONTAINER IDNAME  
CPU %   MEM USAGE / 
LIMITMEM %   NET I/O BLOCK I/O   PIDS
  8a30733e3932
k8s_neutron-server_neutron-server-787dcd7964-2zxt5_openstack_71cbb9bc-4530-11eb-bcc6-525400d22fc9_0
   92.83%  1020MiB / 2.441GiB   40.81%  0B / 0B 
0B / 16.4kB 8
  ```

  I am wondering why security group listing is cpu bound?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1909160/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1909160] [NEW] high cpu usage when listing security groups

2020-12-23 Thread norman shen
Public bug reported:

I saw listing security group is slow and causing cpu spikes
unexpectedly,

I run a Rock neutron-server with api worker set to 1, when executing
command like

```console
root@mgt01:~# time curl -H "x-auth-token: $token" 
http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups
 -owide -s

real0m2.328s
user0m0.016s
sys 0m0.012s


root@mgt01:~# curl -H "x-auth-token: $token" 
http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups
 | jq '.security_groups | length'
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100  497k  100  497k0 0   219k  0  0:00:02  0:00:02 --:--:--  219k
225
```

It will return in around 2 seconds. There are around 200 security groups, so 
maybe it is
not extremely slow, but what is interesting is that calling this rest api seems 
making cpu
spikes for neutorn-server pod,

```console
CONTAINER IDNAME
  CPU %   MEM USAGE / LIMIT 
   MEM %   NET I/O BLOCK I/O   PIDS
8a30733e3932
k8s_neutron-server_neutron-server-787dcd7964-2zxt5_openstack_71cbb9bc-4530-11eb-bcc6-525400d22fc9_0
   92.83%  1020MiB / 2.441GiB   40.81%  0B / 0B 
0B / 16.4kB 8
```

I am wondering why security group listing is cpu bound?

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1909160

Title:
  high cpu usage when listing security groups

Status in neutron:
  New

Bug description:
  I saw listing security group is slow and causing cpu spikes
  unexpectedly,

  I run a Rock neutron-server with api worker set to 1, when executing
  command like

  ```console
  root@mgt01:~# time curl -H "x-auth-token: $token" 
http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups
 -owide -s

  real  0m2.328s
  user  0m0.016s
  sys   0m0.012s

  
  root@mgt01:~# curl -H "x-auth-token: $token" 
http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups
 | jq '.security_groups | length'
% Total% Received % Xferd  Average Speed   TimeTime Time  
Current
   Dload  Upload   Total   SpentLeft  Speed
  100  497k  100  497k0 0   219k  0  0:00:02  0:00:02 --:--:--  219k
  225
  ```

  It will return in around 2 seconds. There are around 200 security groups, so 
maybe it is
  not extremely slow, but what is interesting is that calling this rest api 
seems making cpu
  spikes for neutorn-server pod,

  ```console
  CONTAINER IDNAME  
CPU %   MEM USAGE / 
LIMITMEM %   NET I/O BLOCK I/O   PIDS
  8a30733e3932
k8s_neutron-server_neutron-server-787dcd7964-2zxt5_openstack_71cbb9bc-4530-11eb-bcc6-525400d22fc9_0
   92.83%  1020MiB / 2.441GiB   40.81%  0B / 0B 
0B / 16.4kB 8
  ```

  I am wondering why security group listing is cpu bound?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1909160/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1908957] [NEW] iptable rules collision deployed with k8s iptables kube-proxy enabled

2020-12-21 Thread norman shen
Public bug reported:


Maybe it's a k8s kube-proxy related bug, but maybe it is easier to solve on 
neutron's side...

In k8s either NodePort or ExternalIP will generate iptable rules which will 
effect vm traffic when
hybrid iptable plugin enabled.

The problem is:

Chain PREROUTING (policy ACCEPT 650 packets, 65873 bytes)
 pkts bytes target prot opt in out source   destination 

 560K   37M ACCEPT all  --  *  *   0.0.0.0/00.0.0.0/0   
 PHYSDEV match --physdev-is-in
  56M 4944M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
  40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
  40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
  40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
  40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
  40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
  40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
  40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
  40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
  40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
  40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */

And packets will be DNAT to something which we do not want and such
traffic will be dropped in the end.

By adding the following rule it seems problem is mitigated,

iptables -t nat -I PREROUTING 2 -m physdev --physdev-is-in  -j ACCEPT

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1908957

Title:
  iptable rules collision deployed with k8s iptables kube-proxy enabled

Status in neutron:
  New

Bug description:
  
  Maybe it's a k8s kube-proxy related bug, but maybe it is easier to solve on 
neutron's side...

  In k8s either NodePort or ExternalIP will generate iptable rules which will 
effect vm traffic when
  hybrid iptable plugin enabled.

  The problem is:

  Chain PREROUTING (policy ACCEPT 650 packets, 65873 bytes)
   pkts bytes target prot opt in out source   
destination 
   560K   37M ACCEPT all  --  *  *   0.0.0.0/00.0.0.0/0 
   PHYSDEV match --physdev-is-in
56M 4944M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */
40M 3785M KUBE-SERVICES  all  --  *  *   0.0.0.0/0
0.0.0.0/0/* kubernetes service portals */

  And packets will be DNAT to something which we do not want and such
  traffic will be dropped in the end.

  By adding the following rule it seems problem is mitigated,

  iptables -t nat -I PREROUTING 2 -m physdev --physdev-is-in  -j ACCEPT

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1908957/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More 

[Yahoo-eng-team] [Bug 1902806] [NEW] only 7 iscsi disk could be attached

2020-11-03 Thread norman shen
Public bug reported:

for libvirt version 4.0.0, scsi disk with an unit equal to 7 will not be
able to attach due to libvirt's own limitation.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1902806

Title:
  only 7 iscsi disk could be attached

Status in OpenStack Compute (nova):
  New

Bug description:
  for libvirt version 4.0.0, scsi disk with an unit equal to 7 will not
  be able to attach due to libvirt's own limitation.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1902806/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1901124] [NEW] memcached cache not get expired

2020-10-22 Thread norman shen
Public bug reported:

We are using openstack rocky and we I check the memcached, I found

root@compute:~# telnet compute 11211
Trying 192.168.0.17...
Connected to compute.
Escape character is '^]'.
stats cachedump 15 1
ITEM c9067b617ec1e6e7f78318c19e7ce2c7f4f9dcd6 [2034 b; 0 s]


expiration time not setup for the keys. even I set all of the cache_time 
options, it still not get changed.


[identity]
password_hash_rounds = 4
driver = sql

[assignment]
driver = sql

[catalog]
cache_time = 300

[role]
driver = sql
cache_time = 300

[resource]
driver = sql
cache_time = 300

[application_credential]
cache_time = 300

[oslo.cache]
expiration_time = 300

[cache]
memcache_servers = compute:11211
backend = dogpile.cache.memcached
enabled = true
expiration_time = 300
cache_time = 300

[oslo_messaging_notifications]
transport_url = rabbit://stackrabbit:secret@192.168.0.5:5672/

[DEFAULT]
max_token_size = 16384
debug = True
logging_exception_prefix = ERROR %(name)s %(instance)s
logging_default_format_string = %(color)s%(levelname)s %(name)s [-%(color)s] 
%(instance)s%(color)s%(message)s
logging_context_format_string = %(color)s%(levelname)s %(name)s 
[%(global_request_id)s %(request_id)s %(project_name)s %(user_name)s%(color)s] 
%(instance)s%(color)s%(message)s
logging_debug_format_suffix = {{(pid=%(process)d) %(funcName)s 
%(pathname)s:%(lineno)d}}
admin_endpoint = http://192.168.0.5/identity
public_endpoint = http://192.168.0.5/identity

[token]
provider = fernet
cache_time = 300

[database]
connection = mysql+pymysql://root:secret@127.0.0.1/keystone?charset=utf8

[fernet_tokens]
key_repository = /etc/keystone/fernet-keys/

[credential]
key_repository = /etc/keystone/credential-keys/

[security_compliance]
unique_last_password_count = 2
lockout_duration = 10
lockout_failure_attempts = 2

[unified_limit]
cache_time = 300

** Affects: keystone
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Identity (keystone).
https://bugs.launchpad.net/bugs/1901124

Title:
  memcached cache not get expired

Status in OpenStack Identity (keystone):
  New

Bug description:
  We are using openstack rocky and we I check the memcached, I found

  root@compute:~# telnet compute 11211
  Trying 192.168.0.17...
  Connected to compute.
  Escape character is '^]'.
  stats cachedump 15 1
  ITEM c9067b617ec1e6e7f78318c19e7ce2c7f4f9dcd6 [2034 b; 0 s]

  
  expiration time not setup for the keys. even I set all of the cache_time 
options, it still not get changed.

  
  [identity]
  password_hash_rounds = 4
  driver = sql

  [assignment]
  driver = sql

  [catalog]
  cache_time = 300

  [role]
  driver = sql
  cache_time = 300

  [resource]
  driver = sql
  cache_time = 300

  [application_credential]
  cache_time = 300

  [oslo.cache]
  expiration_time = 300

  [cache]
  memcache_servers = compute:11211
  backend = dogpile.cache.memcached
  enabled = true
  expiration_time = 300
  cache_time = 300

  [oslo_messaging_notifications]
  transport_url = rabbit://stackrabbit:secret@192.168.0.5:5672/

  [DEFAULT]
  max_token_size = 16384
  debug = True
  logging_exception_prefix = ERROR %(name)s %(instance)s
  logging_default_format_string = %(color)s%(levelname)s %(name)s [-%(color)s] 
%(instance)s%(color)s%(message)s
  logging_context_format_string = %(color)s%(levelname)s %(name)s 
[%(global_request_id)s %(request_id)s %(project_name)s %(user_name)s%(color)s] 
%(instance)s%(color)s%(message)s
  logging_debug_format_suffix = {{(pid=%(process)d) %(funcName)s 
%(pathname)s:%(lineno)d}}
  admin_endpoint = http://192.168.0.5/identity
  public_endpoint = http://192.168.0.5/identity

  [token]
  provider = fernet
  cache_time = 300

  [database]
  connection = mysql+pymysql://root:secret@127.0.0.1/keystone?charset=utf8

  [fernet_tokens]
  key_repository = /etc/keystone/fernet-keys/

  [credential]
  key_repository = /etc/keystone/credential-keys/

  [security_compliance]
  unique_last_password_count = 2
  lockout_duration = 10
  lockout_failure_attempts = 2

  [unified_limit]
  cache_time = 300

To manage notifications about this bug go to:
https://bugs.launchpad.net/keystone/+bug/1901124/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1897236] [NEW] create port in a shared network failed for user with member role

2020-09-25 Thread norman shen
Public bug reported:

Create a port on a shared network using a user with member role on
another project fails.

** Affects: horizon
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1897236

Title:
  create port in a shared network failed for user with member role

Status in OpenStack Dashboard (Horizon):
  New

Bug description:
  Create a port on a shared network using a user with member role on
  another project fails.

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1897236/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1896574] Re: how to deal with hypervisor name changing

2020-09-24 Thread norman shen
I think previous title is misleading. Actually hostname itself is still
A. what changes is fqdn name seen by hostname --fqdn.

** Changed in: nova
   Status: Invalid => New

** Summary changed:

- how to deal with hypervisor name changing
+ how to deal with hypervisor host fqdn name changing

** Description changed:

- nova fails to correctly account for resources after hypervisor name
- changes. For example, if previously the hypervisor name is A, and some
- later it switches to A.B, then all of the instances which belong to A
+ nova fails to correctly account for resources after hypervisor hosntame
+ fqdn changes. For example, if previously the hypervisor hostname fqdn is
+ A, and some later it to A.B, then all of the instances which belong to A
  will not be included in the resource computation for A.B although
  effectively they are the same thing.
  
+ But under such circumstances, compute service's is still A.
+ 
  Is there any way to deal with this situation? we are using openstack
  rocky.

** Description changed:

  nova fails to correctly account for resources after hypervisor hosntame
  fqdn changes. For example, if previously the hypervisor hostname fqdn is
  A, and some later it to A.B, then all of the instances which belong to A
  will not be included in the resource computation for A.B although
  effectively they are the same thing.
  
- But under such circumstances, compute service's is still A.
+ But under such circumstances, compute service's is listed as A.
  
  Is there any way to deal with this situation? we are using openstack
  rocky.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1896574

Title:
  how to deal with hypervisor host fqdn name changing

Status in OpenStack Compute (nova):
  New

Bug description:
  nova fails to correctly account for resources after hypervisor
  hosntame fqdn changes. For example, if previously the hypervisor
  hostname fqdn is A, and some later it to A.B, then all of the
  instances which belong to A will not be included in the resource
  computation for A.B although effectively they are the same thing.

  But under such circumstances, compute service's is listed as A.

  Is there any way to deal with this situation? we are using openstack
  rocky.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1896574/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1896574] [NEW] how to deal with hypervisor name changing

2020-09-22 Thread norman shen
Public bug reported:

nova fails to correctly account for resources after hypervisor name
changes. For example, if previously the hypervisor name is A, and some
later it switches to A.B, then all of the instances which belong to A
will not be included in the resource computation for A.B although
effectively they are the same thing.

Is there any way to deal with this situation? we are using openstack
rocky.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1896574

Title:
  how to deal with hypervisor name changing

Status in OpenStack Compute (nova):
  New

Bug description:
  nova fails to correctly account for resources after hypervisor name
  changes. For example, if previously the hypervisor name is A, and some
  later it switches to A.B, then all of the instances which belong to A
  will not be included in the resource computation for A.B although
  effectively they are the same thing.

  Is there any way to deal with this situation? we are using openstack
  rocky.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1896574/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1542032] Re: IP reassembly issue on the Linux bridges in Openstack

2020-09-17 Thread norman shen
** Changed in: neutron
   Status: Confirmed => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1542032

Title:
  IP reassembly issue on the Linux bridges in Openstack

Status in neutron:
  Invalid

Bug description:
  Hi,

  Sorry for text diagram. It does not look very well on this screen.
  Please, copy paste in a decent fixed width text editor.

  Thanks,

  Claude.

  
  Title: IP reassembly issue on the Linux bridges in Openstack
  

  Summary: When the security groups and the Neutron firewall are active
  in Openstack, each and every VM virtual network interfaces (VNIC) is
  isolated in a Linux bridge and IP reassembly must be performed in
  order to allow firewall inspection of the traffic. The reassembled
  traffic sometimes exceed the capacity of the physical interfaces and
  the traffic is not forwarded properly.

  Linux bridge diagram:
  -

  --|   |--|
 VM |   |  OVS |
--- |   --  --- | -  - |
---
| TAP |-|---| QBR bridge |--| QVB |-|-|QVO|  | P |-|| 
FW-ADMIN || PHY |
--- |   --  --- | -  - |
---
|   |  |
  - |   |--|

  Introduction:
  -

  In Openstack, the virtual machine (VM) uses the OpenvSwitch (OVS) for
  networking purposes. This is not a mandatory setup but this is a
  common setup in Openstack.

  When the Neutron firewall and the security groups are active, each VM
  VNIC, also called a tap interface, is connected to a Linux bridge.
  This is the QBR bridge. The QVB interface enables the network
  communication with OVS. The QVB interface interacts with the QVO
  interface in OVS.

  Security analysis is performed on the Linux bridge. In order to
  perform adequate traffic inspection, the fragmented traffic has to be
  re-assembled. The traffic is then forwarded according to Maximum
  Transmit Unit (MTU) of the interfaces in the bridge.

  The MTU values on all the interfaces are set to 65000 bytes. This is
  where a part of the problem experienced with NFV applications is
  observed.

  Analysis:
  -

  As a real life example, the NFV application uses NFS between VMs. NFS
  is a well known feature in Unix environments. This feature provides
  network file systems. This is the equivalent of a network drive in the
  Windows world.

  NFS is known to produce large frames. In this example, the VM1
  (169.254.4.242) send a larg NFS write instruction to the VM2. The
  example below shows a 5 KB packet. The traffic is fragmented in
  several packets as instructed by the VM1 VNIC. This is the desired
  behavior.

  root@node-11:~# tcpdump -e -n -i tap3e79842d-eb host 169.254.1.13

  23:46:48.938255 00:80:37:0e:0f:12 > 00:80:37:0e:0b:12, ethertype IPv4 
(0x0800), length 1514: 169.254.4.242.3015988240 > 169.254.1.13.2049: 1472 write 
fh Unknown/01000601B1198A1CB3CC4E1EA3AB0B26017B0AD653620700D59B28C7 
4863 (4863) bytes @ 229376
  23:46:48.938271 00:80:37:0e:0f:12 > 00:80:37:0e:0b:12, ethertype IPv4 
(0x0800), length 1514: 169.254.4.242 > 169.254.1.13: ip-proto-17
  23:46:48.938279 00:80:37:0e:0f:12 > 00:80:37:0e:0b:12, ethertype IPv4 
(0x0800), length 1514: 169.254.4.242 > 169.254.1.13: ip-proto-17
  23:46:48.938287 00:80:37:0e:0f:12 > 00:80:37:0e:0b:12, ethertype IPv4 
(0x0800), length 590: 169.254.4.242 > 169.254.1.13: ip-proto-17

  The same packet is found on the QVB interface in one large frame.

  root@node-11:~# tcpdump -e -n -i qvb3e79842d-eb host 169.254.1.13

  23:46:48.938322 00:80:37:0e:0f:12 > 00:80:37:0e:0b:12, ethertype IPv4
  (0x0800), length 5030: 169.254.4.242.3015988240 > 169.254.1.13.2049:
  4988 write fh
  Unknown/01000601B1198A1CB3CC4E1EA3AB0B26017B0AD653620700D59B28C7
  4863 (4863) bytes @ 229376

  Such large packets cannot cross physical interfaces without being
  fragmented again if jumbo frames support is not active in the network.
  Even with jumbo frames, the NFS frame size can easily cross the 9K
  barrier. NFS frame size up to 32 KB can be observed with NFS over UDP.

  For some reasons, this traffic does not seem to be transmitted
  properly between compute hosts in Openstack.

  Further investigations have revealed the large frames are leaving the
  OVS internal bridge (br-int) in direction of the private bridge (br-
  prv) using a patch interface in OVS. Once the traffic has reached this
  point, it uses the "P" interface (i.e.: p_51a2-0) to reach another
  Linux bridge (br-fw-admin) where the physical interface is connected
  to. The "P" 

[Yahoo-eng-team] [Bug 1895063] [NEW] Allow rescue volume backed instance

2020-09-09 Thread norman shen
Public bug reported:

Should we offer support for volume backed instance?

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1895063

Title:
  Allow rescue volume backed instance

Status in OpenStack Compute (nova):
  New

Bug description:
  Should we offer support for volume backed instance?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1895063/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1893015] [NEW] ping with large package size fails

2020-08-26 Thread norman shen
Public bug reported:

We are using neutron rocky, with security driver set to iptables_hybrid,
the cluster is deployed on top of a kubernetes cluster. And all the
networks are set to mtu 1500

The problem I am facing right now is that ping across compute nodes
fails with a packet size larger than mtu.

ping -s 2000 172.20.93.171

Surprisingly, if I ping an IP address from the same node, it works
without any issue.

I have done a simple tcpdump on qvb like (both on remote and local
compute node)

tcpdump -i qbv host 172.20.93.171 and icmp

And I saw the traffic, but if I am listening on tap or qbr, no traffic
is captured.

I try to add a log iptable rule to debug, by

iptables -t raw -I PREROUTING 1 -m physdev --physdev-in qvb373214e3-8d
-p icmp -s 172.20.93.173/12 -j LOG --log-prefix='[netfilter] '

Weird enough, there are no packets counted when packet size set to 2000.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1893015

Title:
  ping with large package size fails

Status in neutron:
  New

Bug description:
  We are using neutron rocky, with security driver set to
  iptables_hybrid, the cluster is deployed on top of a kubernetes
  cluster. And all the networks are set to mtu 1500

  The problem I am facing right now is that ping across compute nodes
  fails with a packet size larger than mtu.

  ping -s 2000 172.20.93.171

  Surprisingly, if I ping an IP address from the same node, it works
  without any issue.

  I have done a simple tcpdump on qvb like (both on remote and local
  compute node)

  tcpdump -i qbv host 172.20.93.171 and icmp

  And I saw the traffic, but if I am listening on tap or qbr, no traffic
  is captured.

  I try to add a log iptable rule to debug, by

  iptables -t raw -I PREROUTING 1 -m physdev --physdev-in qvb373214e3-8d
  -p icmp -s 172.20.93.173/12 -j LOG --log-prefix='[netfilter] '

  Weird enough, there are no packets counted when packet size set to
  2000.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1893015/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1892582] [NEW] image creation does not fail immediately if volume not created

2020-08-22 Thread norman shen
Public bug reported:

Cinder backend Image creation failed after long waiting when the volume
is still creating

root@mgt01:~# openstack volume list --all | grep 
fb8aee1b-e19e-4336-8fa2-864f1664b834
| b1e021bd-974d-4974-961b-47ab7f9b0a16 | 
image-fb8aee1b-e19e-4336-8fa2-864f1664b834 | 
creating   |  500 |

** Affects: glance
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/1892582

Title:
  image creation does not fail immediately if volume not created

Status in Glance:
  New

Bug description:
  Cinder backend Image creation failed after long waiting when the
  volume is still creating

  root@mgt01:~# openstack volume list --all | grep 
fb8aee1b-e19e-4336-8fa2-864f1664b834
  | b1e021bd-974d-4974-961b-47ab7f9b0a16 | 
image-fb8aee1b-e19e-4336-8fa2-864f1664b834 | 
creating   |  500 |

To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/1892582/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1887108] [NEW] wrong l2pop flows on vlan network

2020-07-09 Thread norman shen
Public bug reported:

I saw l2pop rules for a vlan network which causes problems for mac
learning. There is no dvr routed associated with it. It is a pure vlan
netowrk.

root@compute02:/tmp# ovs-ofctl dump-flows br-tun table=21
 cookie=0xcd381baa7a6d5b5c, duration=1703630.319s, table=21, n_packets=0, 
n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.36 
actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163ec89311->NXM_NX_ARP_SHA[],load:0xacc89224->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:c8:93:11,IN_PORT
 cookie=0xcd381baa7a6d5b5c, duration=1703175.829s, table=21, n_packets=0, 
n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.38 
actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e8b426c->NXM_NX_ARP_SHA[],load:0xacc89226->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:8b:42:6c,IN_PORT
 cookie=0xcd381baa7a6d5b5c, duration=1703156.363s, table=21, n_packets=0, 
n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.37 
actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e31dd83->NXM_NX_ARP_SHA[],load:0xacc89225->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:31:dd:83,IN_PORT
 cookie=0xcd381baa7a6d5b5c, duration=1703137.459s, table=21, n_packets=0, 
n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.39 
actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e2e8650->NXM_NX_ARP_SHA[],load:0xacc89227->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:2e:86:50,IN_PORT
 cookie=0xcd381baa7a6d5b5c, duration=1703090.494s, table=21, n_packets=0, 
n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.41 
actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e0a4d1c->NXM_NX_ARP_SHA[],load:0xacc89229->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:0a:4d:1c,IN_PORT
 cookie=0xcd381baa7a6d5b5c, duration=1703068.578s, table=21, n_packets=0, 
n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.40 
actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e99553b->NXM_NX_ARP_SHA[],load:0xacc89228->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:99:55:3b,IN_PORT
 cookie=0xcd381baa7a6d5b5c, duration=1703050.537s, table=21, n_packets=0, 
n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.45 
actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163ecc5303->NXM_NX_ARP_SHA[],load:0xacc8922d->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:cc:53:03,IN_PORT
 cookie=0xcd381baa7a6d5b5c, duration=1703033.613s, table=21, n_packets=0, 
n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.43 
actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e5ffd39->NXM_NX_ARP_SHA[],load:0xacc8922b->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:5f:fd:39,IN_PORT

root@mgt01:~# openstack port list --fixed-ip ip-address=172.200.146.36
+--+---+---+---++
| ID   | Name  | MAC Address   | Fixed 
IP Addresses| 
Status |
+--+---+---+---++
| 48131502-da22-4968-9b0b-f1efc3a860a1 | ecs_eni_0 | fa:16:3e:c8:93:11 | 
ip_address='172.200.146.36', subnet_id='d0890fec-6f33-4f08-8f7c-67fc429c91b8' | 
ACTIVE |
+--+---+---+---++
root@mgt01:~# openstack network show `openstack port show 
48131502-da22-4968-9b0b-f1efc3a860a1 -c network_id -f value`
+---+--+
| Field | Value|
+---+--+
| admin_state_up| UP   |
| availability_zone_hints   |  |
| availability_zones| az-jiaozuo-zww-1 |
| created_at| 2020-06-14T00:09:26Z |
| description   |  |
| dns_domain

[Yahoo-eng-team] [Bug 1886355] [NEW] glance upload image to rbd backend stuck

2020-07-05 Thread norman shen
Public bug reported:

Uploading image to rbd backend stuck at saving state, and rbd du command
shows image size is not increasing, as well as ceph osd pool stats shows
that there is no client io.

a tcpdump shows the program is actually trying receive from client with
a rather small window size (which is 280 bytes) which is considerably
small compared to actual image size (35GB).

** Affects: glance
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/1886355

Title:
  glance upload image to rbd backend stuck

Status in Glance:
  New

Bug description:
  Uploading image to rbd backend stuck at saving state, and rbd du
  command shows image size is not increasing, as well as ceph osd pool
  stats shows that there is no client io.

  a tcpdump shows the program is actually trying receive from client
  with a rather small window size (which is 280 bytes) which is
  considerably small compared to actual image size (35GB).

To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/1886355/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1884695] [NEW] allow less strict cpu flag comparison

2020-06-22 Thread norman shen
Public bug reported:

Description
===

Nova uses strict cpu flag comparison during live migration, this introduces 
some problems when 
migrating with some cpu flags which do not affect actually migration. For 
example, `monitoring` flag
could be neglected safely.

So I think it might be reasonable to ignore some features provided from
user input, whether static by configuration or dynamically from api
input.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1884695

Title:
  allow less strict cpu flag comparison

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  Nova uses strict cpu flag comparison during live migration, this introduces 
some problems when 
  migrating with some cpu flags which do not affect actually migration. For 
example, `monitoring` flag
  could be neglected safely.

  So I think it might be reasonable to ignore some features provided
  from user input, whether static by configuration or dynamically from
  api input.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1884695/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1884532] [NEW] inconsistent data in ipamallocations

2020-06-22 Thread norman shen
Public bug reported:

Sometimes I saw database is not consistent for some reasons,

for example, as shown below

MariaDB [neutron]> select * from ipamsubnets where 
neutron_subnet_id='9a8fd2b0-743c-4500-8978-9e5bf9b38347'
-> ;
+--+--+
| id   | neutron_subnet_id|
+--+--+
| 85e7171c-2648-4447-ada6-a37c3c113686 | 9a8fd2b0-743c-4500-8978-9e5bf9b38347 |
+--+--+
1 row in set (0.00 sec)

MariaDB [neutron]> select * from ipamallocations where  ipam_subnet_id = 
'85e7171c-2648-4447-ada6-a37c3c113686' \G
*** 1. row ***
ip_address: 10.13.45.1
status: ALLOCATED
ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686
*** 2. row ***
ip_address: 10.13.45.2
status: ALLOCATED
ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686
*** 3. row ***
ip_address: 10.13.45.3
status: ALLOCATED
ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686
*** 4. row ***
ip_address: 10.13.45.4
status: ALLOCATED
ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686
*** 5. row ***
ip_address: 10.13.45.5
status: ALLOCATED
ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686
*** 6. row ***
ip_address: 10.13.45.6
status: ALLOCATED
ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686
6 rows in set (0.00 sec)

MariaDB [neutron]> select * from ipamallocations where  ipam_subnet_id =
'85e7171c-2648-4447-ada6-a37c3c113686' \G


MariaDB [neutron]> select * from ipallocations where 
subnet_id='9a8fd2b0-743c-4500-8978-9e5bf9b38347' 
-> ;
+--++--+--+
| port_id  | ip_address | subnet_id 
   | network_id   |
+--++--+--+
| 0ae2630a-76d9-47b1-bf2f-012c2356df75 | 10.13.45.1 | 
9a8fd2b0-743c-4500-8978-9e5bf9b38347 | c3ea28cd-e76a-4e49-b538-cc05c0173b83 |
| 83b4683a-fb57-4844-9e1d-55b111fa0e19 | 10.13.45.2 | 
9a8fd2b0-743c-4500-8978-9e5bf9b38347 | c3ea28cd-e76a-4e49-b538-cc05c0173b83 |
| 7f0224dd-c49b-42a8-8c8a-bd3aa6c24223 | 10.13.45.3 | 
9a8fd2b0-743c-4500-8978-9e5bf9b38347 | c3ea28cd-e76a-4e49-b538-cc05c0173b83 |
| f53b335e-535b-42ce-be53-0d6cee48cf28 | 10.13.45.4 | 
9a8fd2b0-743c-4500-8978-9e5bf9b38347 | c3ea28cd-e76a-4e49-b538-cc05c0173b83 |
+--++--+--+
4 rows in set (0.00 sec)


MariaDB [neutron]> 

apparently ipam is not consitent with real ip allocations, when this
happens some IP address is not allocatible although openstack port list
cannot find it.

We are using mariadb for production, and I haven't seen problem like
this using MySQL.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1884532

Title:
  inconsistent data in ipamallocations

Status in neutron:
  New

Bug description:
  Sometimes I saw database is not consistent for some reasons,

  for example, as shown below

  MariaDB [neutron]> select * from ipamsubnets where 
neutron_subnet_id='9a8fd2b0-743c-4500-8978-9e5bf9b38347'
  -> ;
  
+--+--+
  | id   | neutron_subnet_id
|
  
+--+--+
  | 85e7171c-2648-4447-ada6-a37c3c113686 | 9a8fd2b0-743c-4500-8978-9e5bf9b38347 
|
  
+--+--+
  1 row in set (0.00 sec)

  MariaDB [neutron]> select * from ipamallocations where  ipam_subnet_id = 
'85e7171c-2648-4447-ada6-a37c3c113686' \G
  *** 1. row ***
  ip_address: 10.13.45.1
  status: ALLOCATED
  ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686
  *** 2. row ***
  ip_address: 10.13.45.2
  status: ALLOCATED
  ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686
  *** 3. row ***
  ip_address: 10.13.45.3
  status: ALLOCATED
  

[Yahoo-eng-team] [Bug 1881455] [NEW] migrate server reporting list index of out bound

2020-05-30 Thread norman shen
Public bug reported:

Description


When resize to local host enabled and do a cold migration sometimes
fails with

1. migrating to same host failed
2. and then a list index out of bound error


Steps to reproduce
===

deploy two compute nodes and make workload imbalance, for example compute01 has 
more allocations
than compute02. Then migrate server on compute02.

Expected result


cold migration succeeded

actual result
==

sometimes failed

log
==

8084-4fa8-a3c4-2874555fb27c held by migration 
0a8a29a5-7f9c-4af3-85a1-ea62ee5658c3 for instance
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager 
[req-ee48014a-51c1-4e82-9ef3-e3b68a9a34e4 5f0b0ff35b914c84b24efb363965530d 
0606e9bf4e9c4334b6cb9a5012c60fb8 - default default] [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Error: Unable to migrate instance (
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02).: 
UnableToMigrateToSelf: Unable to migrate instance 
(8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02).
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Traceback (most recent call last):
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b]   File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 
4555, in prep_resize
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] node, migration, clean_shutdown)
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b]   File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 
4499, in _prep_resize
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] instance_id=instance.uuid, 
host=self.host)
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] UnableToMigrateToSelf: Unable to migrate 
instance (8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02).
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] 
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager 
[req-ee48014a-51c1-4e82-9ef3-e3b68a9a34e4 5f0b0ff35b914c84b24efb363965530d 
0606e9bf4e9c4334b6cb9a5012c60fb8 - default default] [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Error: Unable to migrate instance (
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02).: 
UnableToMigrateToSelf: Unable to migrate instance 
(8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02).
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Traceback (most recent call last):
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b]   File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 
4555, in prep_resize
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] node, migration, clean_shutdown)
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b]   File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 
4499, in _prep_resize
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] instance_id=instance.uuid, 
host=self.host)
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] UnableToMigrateToSelf: Unable to migrate 
instance (8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02).
2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] 
2020-05-31 02:55:51.740 2419133 ERROR nova.compute.manager 
[req-ee48014a-51c1-4e82-9ef3-e3b68a9a34e4 5f0b0ff35b914c84b24efb363965530d 
0606e9bf4e9c4334b6cb9a5012c60fb8 - default default] [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Setting instance vm_state to ERROR:
 IndexError: list index out of range
2020-05-31 02:55:51.740 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Traceback (most recent call last):
2020-05-31 02:55:51.740 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b]   File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 
8333, in _error_out_instance_on_exception
2020-05-31 02:55:51.740 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] yield
2020-05-31 02:55:51.740 2419133 ERROR nova.compute.manager [instance: 
8189fa53-3e8a-42e3-a735-1d91b9ff0c3b]   File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 
4576, 

[Yahoo-eng-team] [Bug 1880455] [NEW] interrupted vlan connection after live migration

2020-05-24 Thread norman shen
Public bug reported:

After
https://github.com/openstack/neutron/commit/efa8dd08957b5b6b1a05f0ed412ff00462a9f216
this patch, I saw unexpected vlan interruption after live migration.

The steps to reproduce the problem is simple,

first create two vm01, vm02 on compute01 and compute02 separately, then live 
migrate vm02 to compute01, after it completes live migrate vm02 to compute02. 
After this you saw vm01 cannot access vm02. And ovs-appctl dpif/dump-flows 
br-int saw flow from vm01 to vm02 are dropped.
 
I am now suspecting the following code are never executed

https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L685

because for nova port are removed before delete port get called.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1880455

Title:
  interrupted vlan connection after live migration

Status in neutron:
  New

Bug description:
  After
  
https://github.com/openstack/neutron/commit/efa8dd08957b5b6b1a05f0ed412ff00462a9f216
  this patch, I saw unexpected vlan interruption after live migration.

  The steps to reproduce the problem is simple,

  first create two vm01, vm02 on compute01 and compute02 separately, then live 
migrate vm02 to compute01, after it completes live migrate vm02 to compute02. 
After this you saw vm01 cannot access vm02. And ovs-appctl dpif/dump-flows 
br-int saw flow from vm01 to vm02 are dropped.
   
  I am now suspecting the following code are never executed

  
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L685

  because for nova port are removed before delete port get called.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1880455/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1870866] [NEW] inconsistent connection info data after live migration

2020-04-04 Thread norman shen
Public bug reported:

Description
===

after live migration, block device mapping's connection stays at "attaching",
which is confusing piece of information. The root cause seems caused by
different code path between live migration and attach volume.

Steps to reproduce
==

attach a volume and then live migrate to different host.

Expected result


consistent information. Either there is no info nor connection info should
be reserved.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1870866

Title:
  inconsistent connection info data after live migration

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  after live migration, block device mapping's connection stays at "attaching",
  which is confusing piece of information. The root cause seems caused by
  different code path between live migration and attach volume.

  Steps to reproduce
  ==

  attach a volume and then live migrate to different host.

  Expected result
  

  consistent information. Either there is no info nor connection info should
  be reserved.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1870866/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1869808] [NEW] reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

2020-03-30 Thread norman shen
Public bug reported:

We are using Openstack Neutron 13.0.6 and it is deployed using
OpenStack-helm.

I test ping servers in the same vlan while rebooting neutron-ovs-agent.
The result shows

root@mgt01:~# openstack server list
+--+-++--+--+---+
| ID   | Name| Status | Networks
 | Image| Flavor|
+--+-++--+--+---+
| 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1| ACTIVE | 
vlan105=172.31.10.4  | Cirros 0.4.0 64-bit  | 
m1.tiny   |
| 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2| ACTIVE | 
vlan105=172.31.10.18 | Cirros 0.4.0 64-bit  | 
m1.tiny   |

$ ping 172.31.10.4
PING 172.31.10.4 (172.31.10.4): 56 data bytes
..
64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <
64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms

As one can see, packet seq 62 is lost, I believe, during rebooting ovs
agent.

Right now, I am suspecting
https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
this code is refreshing flow table rules even though it is not
necessary.

Because when I dump flows on phys bridge, I can see duration is
rewinding to 0 which suggests flow has been deleted and created again

"""   duration=secs
  The  time,  in  seconds,  that  the entry has been in the table.
  secs includes as much precision as the switch provides, possibly
  to nanosecond resolution.
"""

root@compute01:~# ovs-ofctl dump-flows br-floating
...
 cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, 
n_bytes=103409, 
^-- this value resets
priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
...

IMO, rebooting ovs-agent should not affecting data plane.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1869808

Title:
  reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

Status in neutron:
  New

Bug description:
  We are using Openstack Neutron 13.0.6 and it is deployed using
  OpenStack-helm.

  I test ping servers in the same vlan while rebooting neutron-ovs-
  agent. The result shows

  root@mgt01:~# openstack server list
  
+--+-++--+--+---+
  | ID   | Name| Status | Networks  
   | Image| Flavor|
  
+--+-++--+--+---+
  | 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1| ACTIVE | 
vlan105=172.31.10.4  | Cirros 0.4.0 64-bit  | 
m1.tiny   |
  | 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2| ACTIVE | 
vlan105=172.31.10.18 | Cirros 0.4.0 64-bit  | 
m1.tiny   |

  $ ping 172.31.10.4
  PING 172.31.10.4 (172.31.10.4): 56 data bytes
  ..
  64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
  64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <
  64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
  64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
  64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
  64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
  64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
  64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms

  As one can see, packet seq 62 is lost, I believe, during rebooting ovs
  agent.

  Right now, I am suspecting
  
https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
  this code is refreshing flow table rules even though it is not
  necessary.

  Because when I dump flows on phys bridge, I can see duration is
  rewinding to 0 which suggests flow has been deleted and created again

  """   

[Yahoo-eng-team] [Bug 1866288] [NEW] tox pep8 fails on ubuntu 18.04.3

2020-03-05 Thread norman shen
Public bug reported:

pep8 checking fails for rocky branch on ubuntu 18.04.3

root@mgt02:~/src/nova# tox -epep8 -vvv
  removing /root/src/nova/.tox/log
using tox.ini: /root/src/nova/tox.ini
using tox-3.1.0 from /usr/local/lib/python2.7/dist-packages/tox/__init__.pyc
skipping sdist step
pep8 start: getenv /root/src/nova/.tox/shared
pep8 recreate: /root/src/nova/.tox/shared
ERROR: InterpreterNotFound: python3.5
pep8 finish: getenv after 0.00 seconds
__
 summary 
___
ERROR:  pep8: InterpreterNotFound: python3.5


root@mgt02:~/src/nova# uname -a
Linux mgt02 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 
x86_64 x86_64 x86_64 GNU/Linux

** Affects: nova
 Importance: Undecided
 Status: New

** Description changed:

  pep8 checking fails for rocky branch on ubuntu 18.04.3
  
- 
  root@mgt02:~/src/nova# tox -epep8 -vvv
-   removing /root/src/nova/.tox/log
+   removing /root/src/nova/.tox/log
  using tox.ini: /root/src/nova/tox.ini
  using tox-3.1.0 from /usr/local/lib/python2.7/dist-packages/tox/__init__.pyc
  skipping sdist step
  pep8 start: getenv /root/src/nova/.tox/shared
  pep8 recreate: /root/src/nova/.tox/shared
  ERROR: InterpreterNotFound: python3.5
  pep8 finish: getenv after 0.00 seconds
  
__
 summary 
___
  ERROR:  pep8: InterpreterNotFound: python3.5
+ 
+ 
+ root@mgt02:~/src/nova# uname -a
+ Linux mgt02 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 
x86_64 x86_64 x86_64 GNU/Linux

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1866288

Title:
  tox pep8 fails on ubuntu 18.04.3

Status in OpenStack Compute (nova):
  New

Bug description:
  pep8 checking fails for rocky branch on ubuntu 18.04.3

  root@mgt02:~/src/nova# tox -epep8 -vvv
    removing /root/src/nova/.tox/log
  using tox.ini: /root/src/nova/tox.ini
  using tox-3.1.0 from /usr/local/lib/python2.7/dist-packages/tox/__init__.pyc
  skipping sdist step
  pep8 start: getenv /root/src/nova/.tox/shared
  pep8 recreate: /root/src/nova/.tox/shared
  ERROR: InterpreterNotFound: python3.5
  pep8 finish: getenv after 0.00 seconds
  
__
 summary 
___
  ERROR:  pep8: InterpreterNotFound: python3.5

  
  root@mgt02:~/src/nova# uname -a
  Linux mgt02 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 
x86_64 x86_64 x86_64 GNU/Linux

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1866288/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1865120] [NEW] arm64 vm boot failed when set num_pcie_ports to 28

2020-02-28 Thread norman shen
Public bug reported:

We are testing OpenStack on Phytium,FT2000PLUS

root@compute01:~# lscpu 
Architecture:  aarch64
Byte Order:Little Endian
CPU(s):64
On-line CPU(s) list:   0-63
Thread(s) per core:1
Core(s) per socket:4
Socket(s): 16
NUMA node(s):  8
Model name:Phytium,FT2000PLUS
CPU max MHz:   2200.
CPU min MHz:   1000.
BogoMIPS:  3600.00
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
NUMA node4 CPU(s): 32-39
NUMA node5 CPU(s): 40-47
NUMA node6 CPU(s): 48-55
NUMA node7 CPU(s): 56-63
Flags: fp asimd evtstrm crc32

The problem we initially met are we are not able to attach to more than
2 volumes (virtio-blk) if config drive enabled. We somehow work around
the problem by using scsi-bus instead.

But we are still interesting to make plug more than 2 virtio-blk devices
possible, and after some investigation I think `num_pcie_ports` might be
too small (looks like it default to 9 if unspecified), and `pcie-root`
does not allow hot plugging, and `pcie-root-port` does not allow more
than 1 slots, so the only way I am thinking to mitigate the problem is
to increase this option to maximum.

But the current problem is vms with previously working images failed to
boot and when I try to virsh console, I only saw the uefi shell console.

Maybe this is not a bug for `code`, but I definitely think it is
necessary to improve the doc and make it easier to understand these
terms. I am glad to provide to additional details if asked. thanks

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1865120

Title:
  arm64 vm boot failed when set num_pcie_ports to 28

Status in OpenStack Compute (nova):
  New

Bug description:
  We are testing OpenStack on Phytium,FT2000PLUS

  root@compute01:~# lscpu 
  Architecture:  aarch64
  Byte Order:Little Endian
  CPU(s):64
  On-line CPU(s) list:   0-63
  Thread(s) per core:1
  Core(s) per socket:4
  Socket(s): 16
  NUMA node(s):  8
  Model name:Phytium,FT2000PLUS
  CPU max MHz:   2200.
  CPU min MHz:   1000.
  BogoMIPS:  3600.00
  NUMA node0 CPU(s): 0-7
  NUMA node1 CPU(s): 8-15
  NUMA node2 CPU(s): 16-23
  NUMA node3 CPU(s): 24-31
  NUMA node4 CPU(s): 32-39
  NUMA node5 CPU(s): 40-47
  NUMA node6 CPU(s): 48-55
  NUMA node7 CPU(s): 56-63
  Flags: fp asimd evtstrm crc32

  The problem we initially met are we are not able to attach to more
  than 2 volumes (virtio-blk) if config drive enabled. We somehow work
  around the problem by using scsi-bus instead.

  But we are still interesting to make plug more than 2 virtio-blk
  devices possible, and after some investigation I think
  `num_pcie_ports` might be too small (looks like it default to 9 if
  unspecified), and `pcie-root` does not allow hot plugging, and `pcie-
  root-port` does not allow more than 1 slots, so the only way I am
  thinking to mitigate the problem is to increase this option to
  maximum.

  But the current problem is vms with previously working images failed
  to boot and when I try to virsh console, I only saw the uefi shell
  console.

  Maybe this is not a bug for `code`, but I definitely think it is
  necessary to improve the doc and make it easier to understand these
  terms. I am glad to provide to additional details if asked. thanks

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1865120/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1856962] [NEW] openid method failed when federation_group_ids is empty list

2019-12-19 Thread norman shen
s/keystone/auth/plugins/mapped.py",
 line 80, in handle_scoped_token
2019-12-17 02:25:09.345722 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi for group_dict in token.federated_groups:
2019-12-17 02:25:09.345726 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi TypeError: 'NoneType' object is not iterable
2019-12-17 02:25:09.345730 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi 
10.16.4.45 - - [17/Dec/2019:02:25:09 +] "POST /v3/auth/tokens HTTP/1.1" 400 
96 "-" "curl/7.58.0"

OpenStack Version:

Rocky

We are hitting this error message when using keystone federation. The
mapping is simple as follow:

[ 
   { 
  "remote":[ 
 { 
"type":"REMOTE_USER"
 },
 { 
"type":"OIDC-project"
 }
  ],
  "local":[ 
 { 
"user":{ 
   "name":"{0}"
}
 },
 { 
"projects":[ 
   { 
  "name":"{1}",
      "roles":[ 
 { 
"name":"member"
 }
  ]
   }
]
 }
  ]
   }
]

** Affects: keystone
 Importance: Undecided
 Assignee: norman shen (jshen28)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Identity (keystone).
https://bugs.launchpad.net/bugs/1856962

Title:
  openid method failed when federation_group_ids  is empty list

Status in OpenStack Identity (keystone):
  In Progress

Bug description:
  LOG:
  2019-12-17 02:25:09.269827 2019-12-17 02:25:09.269 10 INFO 
keystone.common.wsgi [req-521eb002-385e-4015-8035-16bfbdcf0d33 - - - - -] POST 
http://keystone.openstack.svc.region-guiyang-zyy.myinspurcloud.com/v3/auth/tokens
  2019-12-17 02:25:09.270180 2019-12-17 02:25:09.269 10 INFO 
keystone.common.wsgi [req-521eb002-385e-4015-8035-16bfbdcf0d33 - - - - -] POST 
http://keystone.openstack.svc.region-guiyang-zyy.myinspurcloud.com/v3/auth/tokens
  2019-12-17 02:25:09.298401 2019-12-17 02:25:09.297 10 WARNING 
keystone.common.fernet_utils [req-521eb002-385e-4015-8035-16bfbdcf0d33 - - - - 
-] key_repository is world readable: /etc/keystone/fernet-keys/: 
NeedRegenerationException
  2019-12-17 02:25:09.298764 2019-12-17 02:25:09.297 10 WARNING 
keystone.common.fernet_utils [req-521eb002-385e-4015-8035-16bfbdcf0d33 - - - - 
-] key_repository is world readable: /etc/keystone/fernet-keys/: 
NeedRegenerationException
  2019-12-17 02:25:09.344893 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi [req-521eb002-385e-4015-8035-16bfbdcf0d33 - - - - -] 
'NoneType' object is not iterable: TypeError: 'NoneType' object is not iterable
  2019-12-17 02:25:09.344916 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi Traceback (most recent call last):
  2019-12-17 02:25:09.344921 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/keystone/common/wsgi.py", 
line 148, in __call__
  2019-12-17 02:25:09.344925 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi result = method(req, **params)
  2019-12-17 02:25:09.344929 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/keystone/auth/controllers.py",
 line 67, in authenticate_for_token
  2019-12-17 02:25:09.344934 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi self.authenticate(request, auth_info, auth_context)
  2019-12-17 02:25:09.344938 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/keystone/auth/controllers.py",
 line 236, in authenticate
  2019-12-17 02:25:09.344942 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi auth_info.get_method_data(method_name))
  2019-12-17 02:25:09.344945 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/keystone/auth/plugins/mapped.py",
 line 58, in authenticate
  2019-12-17 02:25:09.344949 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi PROVIDERS.identity_api)
  2019-12-17 02:25:09.344953 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/keystone/auth/plugins/mapped.py",
 line 80, in handle_scoped_token
  2019-12-17 02:25:09.344957 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi for group_dict in token.federated_groups:
  2019-12-17 02:25:09.344961 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi TypeError: 'NoneType' object is not iterable
  2019-12-17 02:25:09.344965 2019-12-17 02:25:09.343 10 ERROR 
keystone.common.wsgi 
  20

[Yahoo-eng-team] [Bug 1856312] [NEW] RuntimeError during calling log_opts_values

2019-12-13 Thread norman shen
Public bug reported:

During starting up nova-compute service, we are hit by the following
error message

+ sed -i s/HOST_IP// /tmp/logging-nova-compute.conf
+ exec nova-compute --config-file /etc/nova/nova.conf --config-file 
/tmp/pod-shared/nova-console.conf --config-file 
/tmp/pod-shared/nova-libvirt.conf --config-file 
/tmp/pod-shared/nova-hypervisor.conf --log-config-append 
/tmp/logging-nova-compute.conf
2019-12-13 06:53:09.556 29036 WARNING oslo_config.cfg [-] Deprecated: Option 
"use_neutron" from group "DEFAULT" is deprecated for removal (
nova-network is deprecated, as are any related configuration options.
).  Its value may be silently ignored in the future.
2019-12-13 06:53:12.000 29036 INFO nova.compute.rpcapi 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Automatically selected 
compute RPC version 5.0 from minimum service version 35
2019-12-13 06:53:12.000 29036 INFO nova.compute.rpcapi 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Automatically selected 
compute RPC version 5.0 from minimum service version 35
2019-12-13 06:53:12.029 29036 INFO nova.virt.driver 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Loading compute driver 
'libvirt.LibvirtDriver'
2019-12-13 06:53:12.029 29036 INFO nova.virt.driver 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Loading compute driver 
'libvirt.LibvirtDriver'
2019-12-13 06:53:22.064 29036 WARNING oslo_config.cfg 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Deprecated: Option 
"firewall_driver" from group "DEFAULT" is deprecated for removal (
nova-network is deprecated, as are any related configuration options.
).  Its value may be silently ignored in the future.
2019-12-13 06:53:22.192 29036 WARNING os_brick.initiator.connectors.remotefs 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Connection details not 
present. RemoteFsClient may not initialize properly.
2019-12-13 06:53:22.409 29036 WARNING oslo_config.cfg 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Deprecated: Option 
"linuxnet_interface_driver" from group "DEFAULT" is deprecated for removal (
nova-network is deprecated, as are any related configuration options.
).  Its value may be silently ignored in the future.
2019-12-13 06:53:22.414 29036 WARNING oslo_config.cfg 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Deprecated: Option 
"metadata_port" from group "DEFAULT" is deprecated for removal (
nova-network is deprecated, as are any related configuration options.
).  Its value may be silently ignored in the future.
2019-12-13 06:53:22.440 29036 INFO nova.service [-] Starting compute node 
(version 18.0.0)
2019-12-13 06:53:22.570 29036 WARNING oslo_config.cfg 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Deprecated: Option 
"api_endpoint" from group "ironic" is deprecated for removal (Endpoint lookup 
uses the service catalog via common keystoneauth1 Adapter configuration 
options. In the current release, api_endpoint will override this behavior, but 
will be ignored and/or removed in a future release. To achieve the same result, 
use the endpoint_override option instead.).  Its value may be silently ignored 
in the future.
2019-12-13 06:53:22.440 29036 INFO nova.service [-] Starting compute node 
(version 18.0.0)
2019-12-13 06:53:22.594 29036 WARNING oslo_config.cfg 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Deprecated: Option 
"api_endpoint" from group "ironic" is deprecated. Use option 
"endpoint-override" from group "ironic".
2019-12-13 06:53:22.911 29036 CRITICAL nova 
[req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Unhandled error: 
RuntimeError: dictionary changed size during iteration
2019-12-13 06:53:22.911 29036 ERROR nova Traceback (most recent call last):
2019-12-13 06:53:22.911 29036 ERROR nova   File 
"/var/lib/openstack/bin/nova-compute", line 8, in 
2019-12-13 06:53:22.911 29036 ERROR nova sys.exit(main())
2019-12-13 06:53:22.911 29036 ERROR nova   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/nova/inspur/cmd/compute.py",
 line 71, in main
2019-12-13 06:53:22.911 29036 ERROR nova service.wait()
2019-12-13 06:53:22.911 29036 ERROR nova   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/nova/service.py", line 
460, in wait
2019-12-13 06:53:22.911 29036 ERROR nova _launcher.wait()
2019-12-13 06:53:22.911 29036 ERROR nova   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/oslo_service/service.py", 
line 392, in wait
2019-12-13 06:53:22.911 29036 ERROR nova status, signo = 
self._wait_for_exit_or_signal()
2019-12-13 06:53:22.911 29036 ERROR nova   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/oslo_service/service.py", 
line 367, in _wait_for_exit_or_signal
2019-12-13 06:53:22.911 29036 ERROR nova self.conf.log_opt_values(LOG, 
logging.DEBUG)
2019-12-13 06:53:22.911 29036 ERROR nova   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/oslo_config/cfg.py", line 
2579, in log_opt_values
2019-12-13 06:53:22.911 29036 ERROR 

[Yahoo-eng-team] [Bug 1840579] [NEW] excessive number of dvrs where vm got a fixed ip on floating network

2019-08-18 Thread norman shen
Public bug reported:

we are running into an unexpected situation where number of dvr routers
is increasing to nearly 2000 on a compute node on which some instances
got a nic on floating ip network.

We are using Queens release,

neutron-common/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic]
neutron-l3-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed]
neutron-metadata-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all 
[installed,automatic]
neutron-openvswitch-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed]
python-neutron/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic]
python-neutron-fwaas/xenial,xenial,now 2:12.0.1-1.0~u16.04+mcp6 all 
[installed,automatic]
python-neutron-lib/xenial,xenial,now 1.13.0-1.0~u16.04+mcp9 all 
[installed,automatic]
python-neutronclient/xenial,xenial,now 1:6.7.0-1.0~u16.04+mcp17 all 
[installed,automatic]

Currently, my guess is that some applications mistakenly invokes rpc
calls like this
https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/api/rpc/agentnotifiers/l3_rpc_agent_api.py#L166
with dvr associated with a floating ip address on a host which has fixed
ip address allocated from floating network (aka device_owner prefix with
compute:). Then such router will be kept by this
https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/db/l3_dvrscheduler_db.py#L427
function, because `get_subnet_ids_on_router` does not filter out
router:gateway ports.

I think this is a bug because as long as we do not have ports with
specific device owners we should not have a dvr router on it.


besides it is pretty easy to replay this bug.

First create a dvr router with an external gateway on floating network
Then create on virtual machine with fixed ip on floating network
Then call `routers_updated_on_host` manually, then this dvr will be created on 
the host where vm resides on, but actually it should be there.

** Affects: neutron
 Importance: Undecided
 Assignee: norman shen (jshen28)
 Status: In Progress

** Description changed:

  we are running into an unexpected situation where number of dvr routers
  is increasing to nearly 2000 on a compute node on which some instances
  got a nic on floating ip network.
  
  We are using Queens release,
  
  neutron-common/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic]
  neutron-l3-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed]
  neutron-metadata-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all 
[installed,automatic]
  neutron-openvswitch-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed]
  python-neutron/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic]
  python-neutron-fwaas/xenial,xenial,now 2:12.0.1-1.0~u16.04+mcp6 all 
[installed,automatic]
  python-neutron-lib/xenial,xenial,now 1.13.0-1.0~u16.04+mcp9 all 
[installed,automatic]
  python-neutronclient/xenial,xenial,now 1:6.7.0-1.0~u16.04+mcp17 all 
[installed,automatic]
  
  Currently, my guess is that some applications mistakenly invokes rpc
  calls like this
  
https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/api/rpc/agentnotifiers/l3_rpc_agent_api.py#L166
  with dvr associated with a floating ip address on a host which has fixed
  ip address allocated from floating network (aka device_owner prefix with
  compute:). Then such router will be kept by this
  
https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/db/l3_dvrscheduler_db.py#L427
  function, because `get_subnet_ids_on_router` does not filter out
  router:gateway ports.
  
  I think this is a bug because as long as we do not have ports with
  specific device owners we should not have a dvr router on it.
+ 
+ 
+ besides it is pretty easy to replay this bug.
+ 
+ First create a dvr router with an external gateway on floating network
+ Then create on virtual machine with fixed ip on floating network
+ Then call `routers_updated_on_host` manually, then this dvr will be created 
on the host where vm resides on, but actually it should be there.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1840579

Title:
  excessive number of dvrs where vm got a fixed ip on floating network

Status in neutron:
  In Progress

Bug description:
  we are running into an unexpected situation where number of dvr
  routers is increasing to nearly 2000 on a compute node on which some
  instances got a nic on floating ip network.

  We are using Queens release,

  neutron-common/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic]
  neutron-l3-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed]
  neutron-metadata-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all 
[installed,automatic]
  neutron-openvswitch-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed]
  python-neutron/xenial,now 2:12.0.5

[Yahoo-eng-team] [Bug 1836680] [NEW] attach volume succeeded but device not found on guest machine

2019-07-15 Thread norman shen
Public bug reported:

sorry post bug at wrong place.

** Affects: neutron
 Importance: Undecided
 Status: Invalid

** Changed in: neutron
   Status: New => Invalid

** Description changed:

- we are using OpenStack Queens: 
- nova-common/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed]
- nova-compute/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed,automatic]
- nova-compute-kvm/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed]
- 
- guest vm uses windows 2012 datacenter edition
- 
- after successfully executing openstack server add volume ${instance_id}
- ${volume_id}, we observe volume status has changed to in-used and
- attachments info are correctly stored in both nova and neutron. But
- device does not show up in guest machine.
- 
- we execute `virsh dumpxml ${instance_id}` but device is not there. We
- then try to edit directly by executing `virsh edit ${instance_id}` and
- we see the device with proper attachments info...
- 
- At last we have to shutdown the vm and boot again to solve the problem.
- 
- 
- command line outputs are put below,
- 
- /var/lib/libvirt/qemu# virsh dumpxml 55 --inactive
- 
- 
-   
-   
-   
- 
- .
- 
- # virsh domblklist 55
- Target Source
- 
- vdavms/xxx
- vdbvms/
- 
- manually attach vdc reports `vdc` in-used
+ sorry post bug at wrong place.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1836680

Title:
  attach volume succeeded but device not found on guest machine

Status in neutron:
  Invalid

Bug description:
  sorry post bug at wrong place.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1836680/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1836681] [NEW] attach volume succeeded but device not found on guest machine

2019-07-15 Thread norman shen
Public bug reported:

we are using OpenStack Queens: 
nova-common/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed]
nova-compute/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed,automatic]
nova-compute-kvm/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed]

guest vm uses windows 2012 datacenter edition

after successfully executing openstack server add volume ${instance_id}
${volume_id}, we observe volume status has changed to in-used and
attachments info are correctly stored in both nova and neutron. But
device does not show up in guest machine.

we execute `virsh dumpxml ${instance_id}` but device is not there. We
then try to edit directly by executing `virsh edit ${instance_id}` and
we see the device with proper attachments info...

At last we have to shutdown the vm and boot again to solve the problem.


command line outputs are put below,

/var/lib/libvirt/qemu# virsh dumpxml 55 --inactive


  
  
  

.

# virsh domblklist 55
Target Source

vdavms/xxx
vdbvms/

manually attach vdc reports `vdc` in-used

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1836681

Title:
  attach volume succeeded but device not found on guest machine

Status in OpenStack Compute (nova):
  New

Bug description:
  we are using OpenStack Queens: 
  nova-common/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed]
  nova-compute/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed,automatic]
  nova-compute-kvm/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed]

  guest vm uses windows 2012 datacenter edition

  after successfully executing openstack server add volume
  ${instance_id} ${volume_id}, we observe volume status has changed to
  in-used and attachments info are correctly stored in both nova and
  neutron. But device does not show up in guest machine.

  we execute `virsh dumpxml ${instance_id}` but device is not there. We
  then try to edit directly by executing `virsh edit ${instance_id}` and
  we see the device with proper attachments info...

  At last we have to shutdown the vm and boot again to solve the
  problem.

  
  command line outputs are put below,

  /var/lib/libvirt/qemu# virsh dumpxml 55 --inactive
  
  



  
  .

  # virsh domblklist 55
  Target Source
  
  vdavms/xxx
  vdbvms/

  manually attach vdc reports `vdc` in-used

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1836681/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1830456] [NEW] dvr router slow response during port update

2019-05-24 Thread norman shen
Public bug reported:

We are having a distributed router which used by hundreds of virtual
machines scattered across around 150 compute nodes. When nova sends port
update request to neutron, it will generally taking nearly 4 min to
complete.

Neutron version is openstack Queens 12.0.5.

I found the following log entries printed by neutron-server,

2019-05-25 05:24:16,285.285 11834 INFO neutron.wsgi [req- x -
default default] x.x.x.x "PUT
/v2.0/ports/8c252d91-741a-4627-9600-916d1da5178f HTTP/1.1" status: 200
len: 0 time: 233.6103470

You can see it takes around 240 seconds to finish request.

Right now I am suspecting this code snippet
https://github.com/openstack/neutron/blob/de59a21754747335d0d9d26082c7f0df105a30c9/neutron/db/l3_dvrscheduler_db.py#L139
leads to the issue.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1830456

Title:
  dvr router slow response during port update

Status in neutron:
  New

Bug description:
  We are having a distributed router which used by hundreds of virtual
  machines scattered across around 150 compute nodes. When nova sends
  port update request to neutron, it will generally taking nearly 4 min
  to complete.

  Neutron version is openstack Queens 12.0.5.

  I found the following log entries printed by neutron-server,

  2019-05-25 05:24:16,285.285 11834 INFO neutron.wsgi [req- x -
  default default] x.x.x.x "PUT
  /v2.0/ports/8c252d91-741a-4627-9600-916d1da5178f HTTP/1.1" status: 200
  len: 0 time: 233.6103470

  You can see it takes around 240 seconds to finish request.

  Right now I am suspecting this code snippet
  
https://github.com/openstack/neutron/blob/de59a21754747335d0d9d26082c7f0df105a30c9/neutron/db/l3_dvrscheduler_db.py#L139
  leads to the issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1830456/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp