[Yahoo-eng-team] [Bug 2007635] Re: ask for large-scale deployment help

2023-02-18 Thread Belmiro Moreira
This is not a bug. I'm closing it.

You can find more information about large deployments in the the Large Scale 
SIG.
https://docs.openstack.org/large-scale/journey/index.html

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2007635

Title:
  ask for large-scale deployment help

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  hello, all, this is not a bug, I want some help.
  I am new for openstack, I want to find some information about official 
large-scale openstack deployment and pressure test indicator data, for example, 
how many compute nodes and vms can a sigle openstack region (not use nova cell 
and only three controller node) have at most, how manay vms can be 
created/stop/start/migrate at a same time?

  Is there any official data about these, or any where I can find these
  information, Thank you for help!!

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2007635/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1951617] [NEW] "Quota exceeded" message is confusing for "resize"

2021-11-19 Thread Belmiro Moreira
Public bug reported:

"Quota exceeded" message is confusing for "resize"


When trying to create an instance and there is no quota available, the user 
gets an error message.
example:
"Quota exceeded for cores: Requested 1, but already used 100 of 100 cores (HTTP 
403)"

The user can see that the project is already using 100 vCPUs out of 100
vCPUs available (vCPU quota) in the project.

However, if he tries to resize an instance we can get a similar error message:
"Quota exceeded for cores: Requested 2, but already used 42 of 100 cores (HTTP 
403)"

So, this has a completely different meaning!
It means that the user (of the instance that he's trying to resize) is using 42 
vCPUs in the project out of 100 cores allowed by the quota.

This is hard to understand for a end user.
When naively reading this message looks like the project still has plenty of 
resources for the resize.

I believe this comes from the time when Nova allowed quotas per user.
In my opinion this distinction shouldn't be done anymore. As mentioned we don't 
do it when creating a new instance.

+++

This was tested with the master branch (19/11/2021)

** Affects: nova
     Importance: Undecided
 Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: New

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1951617

Title:
  "Quota exceeded" message is confusing for "resize"

Status in OpenStack Compute (nova):
  New

Bug description:
  "Quota exceeded" message is confusing for "resize"

  
  When trying to create an instance and there is no quota available, the user 
gets an error message.
  example:
  "Quota exceeded for cores: Requested 1, but already used 100 of 100 cores 
(HTTP 403)"

  The user can see that the project is already using 100 vCPUs out of
  100 vCPUs available (vCPU quota) in the project.

  However, if he tries to resize an instance we can get a similar error message:
  "Quota exceeded for cores: Requested 2, but already used 42 of 100 cores 
(HTTP 403)"

  So, this has a completely different meaning!
  It means that the user (of the instance that he's trying to resize) is using 
42 vCPUs in the project out of 100 cores allowed by the quota.

  This is hard to understand for a end user.
  When naively reading this message looks like the project still has plenty of 
resources for the resize.

  I believe this comes from the time when Nova allowed quotas per user.
  In my opinion this distinction shouldn't be done anymore. As mentioned we 
don't do it when creating a new instance.

  +++

  This was tested with the master branch (19/11/2021)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1951617/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1947753] [NEW] Evacuated instances are not removed from the source

2021-10-19 Thread Belmiro Moreira
Public bug reported:

Instance "evacuation" is a great feature and we are trying to take advantage of 
it.
But, it has some limitations, depending how "broken" is the node.

Let me give some context...

In the scenario where the compute node loses connectivity (broken switch
port, loose network cable, ...) or nova-compute is suck (filesystem
issue) evacuating instances can have some unexpected consequences and
lead to data corruption in the application (for example in a DB
application).

If a compute node loses connectivity (or an entire set of compute nodes), 
nova-compute and the instances are "not available".
If the node runs critical applications (let's suppose a MySQL DB), the cloud 
operator could be tempted to "evacuate" the instance to recover the critical 
application for the user. At this point the cloud operator may not know yet the 
compute node issue and maybe it won't be possible to shut it down (management 
network affected?, ...) or even simply don't want to interfere with the work of 
the repair team.

The repair teams fixes the issue (it can take few minutes or hours...)
and nova-compute and the instances are available again.

The problem is that nova-compute doesn't destroy the evacuated instances
in the source.

```
2021-10-19 11:17:51.519 3050 WARNING nova.compute.resource_tracker 
[req-0ed10e35-2715-466a-918b-69eb1fc770e8 - - - - -] Instance 
fc3be091-56d3-4c69-8adb-2fdb8b0a35d2 has been moved to another host 
foo.cern.ch(foo.cern.ch). There are allocations remaining against the source 
host that might need to be removed: {u'resources': {u'VCPU': 1, u'MEMORY_MB': 
1875}}.
```

At this point we have 2 instances sharing the same IP and possibly
writing into the same volume.

Only when nova-compute is restarted (I guess that was always the
assumption... the compute node was really broken) the evacuated
instances in the affected node are removed.

```
2021-10-19 15:39:49.257 21189 INFO nova.compute.manager 
[req-ded45b0c-20ab-4587-9533-8c613d977f79 - - - - -] Destroying instance as it 
has been evacuated from this host but still exists in the hypervisor
2021-10-19 15:39:52.949 21189 INFO nova.virt.libvirt.driver [ ] Instance 
destroyed successfully.
```

I would expect that nova-compute will constantly check for the evacuated 
instances and then removed them.
Otherwise, this requires a lot of coordination between different support teams.

Should this be moved to a periodic task?
https://github.com/openstack/nova/blob/e14eef0719eceef35e7e96b3e3d242ec79a80969/nova/compute/manager.py#L1440


I'm running Stein, but looking into the code, we have the same behaviour in 
master.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1947753

Title:
  Evacuated instances are not removed from the source

Status in OpenStack Compute (nova):
  New

Bug description:
  Instance "evacuation" is a great feature and we are trying to take advantage 
of it.
  But, it has some limitations, depending how "broken" is the node.

  Let me give some context...

  In the scenario where the compute node loses connectivity (broken
  switch port, loose network cable, ...) or nova-compute is suck
  (filesystem issue) evacuating instances can have some unexpected
  consequences and lead to data corruption in the application (for
  example in a DB application).

  If a compute node loses connectivity (or an entire set of compute nodes), 
nova-compute and the instances are "not available".
  If the node runs critical applications (let's suppose a MySQL DB), the cloud 
operator could be tempted to "evacuate" the instance to recover the critical 
application for the user. At this point the cloud operator may not know yet the 
compute node issue and maybe it won't be possible to shut it down (management 
network affected?, ...) or even simply don't want to interfere with the work of 
the repair team.

  The repair teams fixes the issue (it can take few minutes or hours...)
  and nova-compute and the instances are available again.

  The problem is that nova-compute doesn't destroy the evacuated
  instances in the source.

  ```
  2021-10-19 11:17:51.519 3050 WARNING nova.compute.resource_tracker 
[req-0ed10e35-2715-466a-918b-69eb1fc770e8 - - - - -] Instance 
fc3be091-56d3-4c69-8adb-2fdb8b0a35d2 has been moved to another host 
foo.cern.ch(foo.cern.ch). There are allocations remaining against the source 
host that might need to be removed: {u'resources': {u'VCPU': 1, u'MEMORY_MB': 
1875}}.
  ```

  At this point we have 2 instances sharing the same IP and possibly
  writing into the same volume.

  Only when nova-compute is restarted (I guess that was always the
  assumption... the compute node was really broken) the evacuated
  instances in the affected node are removed.

  ```
  2021-10-19 15:39:49.257 21189 INFO nova.compute.manager 

[Yahoo-eng-team] [Bug 1933955] [NEW] Power sync using the Ironic driver queries all the nodes from Ironic when using Conductor Groups

2021-06-29 Thread Belmiro Moreira
Public bug reported:


"""
While synchronizing instance power states, found 447 instances in the database 
and 8712 instances on the hypervisor.
"""

This is the warning message that we get when using conductor groups
during a power sync.

Conductor groups allow to have dedicated nova-compute nodes to manage a set of 
Ironic nodes.
However, the "_sync_power_states" doesn't deal correctly with it.

First, this function gets all the nodes from the DB that are managed by
the Nova compute node. Then it asks the "driver" to get all the
instances. When using the Ironic driver, it returns all the nodes in
Ironic! (When having thousands of nodes Ironic can also get several
minutes to return, but that is a different bug)!.

Of course, then the comparison fails, returning the previous warn
message.

There are different possibilities...
- We can change the ironic driver to return only the nodes from the conductor 
group that this Nova compute-node belongs. However, this is not good enough if 
the conductor group is managed by more than 1 Nova compute-node. Ironic doesn't 
know which Nova compute-node manages each node!

- We agree that this check doesn't bring a lot of value when using the
Ironic driver. We just skip it if the Ironic driver is used.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1933955

Title:
  Power sync using the Ironic driver queries all the nodes from Ironic
  when using Conductor Groups

Status in OpenStack Compute (nova):
  New

Bug description:
  
  """
  While synchronizing instance power states, found 447 instances in the 
database and 8712 instances on the hypervisor.
  """

  This is the warning message that we get when using conductor groups
  during a power sync.

  Conductor groups allow to have dedicated nova-compute nodes to manage a set 
of Ironic nodes.
  However, the "_sync_power_states" doesn't deal correctly with it.

  First, this function gets all the nodes from the DB that are managed
  by the Nova compute node. Then it asks the "driver" to get all the
  instances. When using the Ironic driver, it returns all the nodes in
  Ironic! (When having thousands of nodes Ironic can also get several
  minutes to return, but that is a different bug)!.

  Of course, then the comparison fails, returning the previous warn
  message.

  There are different possibilities...
  - We can change the ironic driver to return only the nodes from the conductor 
group that this Nova compute-node belongs. However, this is not good enough if 
the conductor group is managed by more than 1 Nova compute-node. Ironic doesn't 
know which Nova compute-node manages each node!

  - We agree that this check doesn't bring a lot of value when using the
  Ironic driver. We just skip it if the Ironic driver is used.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1933955/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1927740] [NEW] Ironic driver persistent warn msg when running only a node per conductor group

2021-05-07 Thread Belmiro Moreira
Public bug reported:


```
2021-05-07 13:55:12.570 3142 WARNING nova.virt.ironic.driver 
[req-bcca8fbe-3293-4d85-a3a3-a07328d91c17 - - - - -] This compute service 
(XXX) is the only service present in the [ironic]/peer_list option. Are you 
sure this should not include more hosts?
```

The decision about the number of compute nodes behind each conductor
group depends in the deployment architecture and risk tolerance.

For deployments that decided to only run one compute node per conductor
group they get the above msg in the logs every periodic task cycle.

It's good that Nova points that this can be an issue, but the frequency
really "pollutes" the logs for all the operators that made a conscious
decision.

I propose to move the log level from warn to debug.

In debug operators will continue to have this message. Usually operators
run in debug mode when debugging issues or in the deployment phase.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1927740

Title:
  Ironic driver persistent warn msg when running only a node per
  conductor group

Status in OpenStack Compute (nova):
  New

Bug description:
  
  ```
  2021-05-07 13:55:12.570 3142 WARNING nova.virt.ironic.driver 
[req-bcca8fbe-3293-4d85-a3a3-a07328d91c17 - - - - -] This compute service 
(XXX) is the only service present in the [ironic]/peer_list option. Are you 
sure this should not include more hosts?
  ```

  The decision about the number of compute nodes behind each conductor
  group depends in the deployment architecture and risk tolerance.

  For deployments that decided to only run one compute node per
  conductor group they get the above msg in the logs every periodic task
  cycle.

  It's good that Nova points that this can be an issue, but the
  frequency really "pollutes" the logs for all the operators that made a
  conscious decision.

  I propose to move the log level from warn to debug.

  In debug operators will continue to have this message. Usually
  operators run in debug mode when debugging issues or in the deployment
  phase.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1927740/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1924612] [NEW] Can't list "killed" images using the CLI

2021-04-15 Thread Belmiro Moreira
Public bug reported:

Doing a DB clean up I noticed that we have several images in "killed" state.
But using the CLI I wasn't able to list them.
However, when the image_id is known the details can be shown and they can be 
deleted.

If an user can't list "killed" images, he doesn't know that those images
belong to his project so they can't be deleted. Mostly is "cosmetics"
but it would be good to clean them.

Talking with abhishekk on IRC, he suggests to try:
"glance image-list --property-filter status=killed"

It doesn't work in Ussuri release.

** Affects: glance
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/1924612

Title:
  Can't list "killed" images using the CLI

Status in Glance:
  New

Bug description:
  Doing a DB clean up I noticed that we have several images in "killed" state.
  But using the CLI I wasn't able to list them.
  However, when the image_id is known the details can be shown and they can be 
deleted.

  If an user can't list "killed" images, he doesn't know that those
  images belong to his project so they can't be deleted. Mostly is
  "cosmetics" but it would be good to clean them.

  Talking with abhishekk on IRC, he suggests to try:
  "glance image-list --property-filter status=killed"

  It doesn't work in Ussuri release.

To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/1924612/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1924585] [NEW] Live Migration - if libvirt timeout the instance goes to error state but the live migration continues

2021-04-15 Thread Belmiro Moreira
Public bug reported:

Recently we live migrated an entire cell to new hardware and we hit the
following problem several times...

During a live migration Nova monitors the state of the migration quering
libvirt every 0.5s

https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452

If libvirt timeout, the instance is left in a very bad state...
The instance goes to error state. For Nova the instance continues in the source 
compute node. However, libvirt continues with the live migration, that will 
eventually end up the the destination compute node.

I'm using Stein release, but looking into the current release the code
path seems the same.

Here's the Stein trace:

```
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, 
in _do_live_migration
block_migration, migrate_data)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 
7581, in live_migration
migrate_data)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 
8068, in _live_migration
finish_event, disk_paths)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 
7873, in _live_migration_monitor
info = guest.get_job_info()
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, 
in get_job_info
stats = self._domain.jobStats()
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit
result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in 
proxy_call
rv = execute(f, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in 
execute
six.reraise(c, e, tb)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
rv = meth(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats
if ret is None: raise libvirtError ('virDomainGetJobStats() failed', 
dom=self)
libvirtError: Timed out during operation: cannot acquire state change lock 
(held by monitor=remoteDispatchDomainMemoryStats)
```

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1924585

Title:
  Live Migration - if libvirt timeout the instance goes to error state
  but the live migration continues

Status in OpenStack Compute (nova):
  New

Bug description:
  Recently we live migrated an entire cell to new hardware and we hit
  the following problem several times...

  During a live migration Nova monitors the state of the migration
  quering libvirt every 0.5s

  
https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452

  If libvirt timeout, the instance is left in a very bad state...
  The instance goes to error state. For Nova the instance continues in the 
source compute node. However, libvirt continues with the live migration, that 
will eventually end up the the destination compute node.

  I'm using Stein release, but looking into the current release the code
  path seems the same.

  Here's the Stein trace:

  ```
  Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, 
in _do_live_migration
  block_migration, migrate_data)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 
7581, in live_migration
  migrate_data)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 
8068, in _live_migration
  finish_event, disk_paths)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 
7873, in _live_migration_monitor
  info = guest.get_job_info()
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 
705, in get_job_info
  stats = self._domain.jobStats()
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit
  result = proxy_call(self._autowrap, f, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in 
proxy_call
  rv = execute(f, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in 
execute
  six.reraise(c, e, tb)
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in 
tworker
  rv = meth(*args, **kwargs)
File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats
  if ret is None: raise libvirtError ('virDomainGetJobStats() failed', 
dom=self)
  libvirtError: Timed out during operation: cannot acquire state change lock 
(held by monitor=remoteDispatchDomainMemoryStats)
  ```

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1924585/+subscriptions

-- 

[Yahoo-eng-team] [Bug 1924123] [NEW] If source compute node is overcommitted instances can't be migrated

2021-04-14 Thread Belmiro Moreira
Public bug reported:

I'm facing a similar issue to "https://bugs.launchpad.net/nova/+bug/1918419;
but somehow different which makes me open a new bug.

I'm giving some context to this bug to better explain how this affects
operations. Here's the story...

When a compute node needs a hardware intervention we have an automated
process that the repair team uses (they don't have access to OpenStack
APIs) to live migrate all the instances before starting the repair. The
motivation is to minimize the impact on users.

However, instances can't be live migrated if the compute node becomes
overcommitted!

It happens that if a DIMM fails in a compute node that has all the
memory allocated to VMs, it's not possible to move those VMs.

"No valid host was found. Unable to replace instance claim on source
(HTTP 400)"

The compute node becomes overcommitted (because the DIMM is not visible
anymore) and placement can't create the migration allocation in the
source.

The operator can workaround and "tune" the memory overcommit for the
affected compute node, but that requires investigation and a manual
intervention of an operator defeating automation and delegation to other
teams. Extremely complicated in large deployments.

I don't believe this behaviour is correct. 
If there are available resources to host the instances in a different compute 
node, placement shouldn't block the live migration because the source is 
overcommitted.

+++

Using Nova Stein.
For what I checked looks it's still the behaviour in recent releases.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1924123

Title:
  If source compute node is overcommitted instances can't be migrated

Status in OpenStack Compute (nova):
  New

Bug description:
  I'm facing a similar issue to "https://bugs.launchpad.net/nova/+bug/1918419;
  but somehow different which makes me open a new bug.

  I'm giving some context to this bug to better explain how this affects
  operations. Here's the story...

  When a compute node needs a hardware intervention we have an automated
  process that the repair team uses (they don't have access to OpenStack
  APIs) to live migrate all the instances before starting the repair.
  The motivation is to minimize the impact on users.

  However, instances can't be live migrated if the compute node becomes
  overcommitted!

  It happens that if a DIMM fails in a compute node that has all the
  memory allocated to VMs, it's not possible to move those VMs.

  "No valid host was found. Unable to replace instance claim on source
  (HTTP 400)"

  The compute node becomes overcommitted (because the DIMM is not
  visible anymore) and placement can't create the migration allocation
  in the source.

  The operator can workaround and "tune" the memory overcommit for the
  affected compute node, but that requires investigation and a manual
  intervention of an operator defeating automation and delegation to
  other teams. Extremely complicated in large deployments.

  I don't believe this behaviour is correct. 
  If there are available resources to host the instances in a different compute 
node, placement shouldn't block the live migration because the source is 
overcommitted.

  +++

  Using Nova Stein.
  For what I checked looks it's still the behaviour in recent releases.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1924123/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1918419] [NEW] vCPU resource max_unit is hardcoded

2021-03-10 Thread Belmiro Moreira
Public bug reported:

Becasue the spectre/meltdown vulnerabilities (2018) we needed to disable
SMT in all public facing compute nodes. As result the number of
available cores was reduced by half.

We had flavors available with 32vCPUs that couldn't be used anymore
because placement max_unit for vCPUs is hardcoded to be the total number
of cpus regardless the allocation_ratio.

To me it's a sensible default but doesn't offer any flexibility for
operators.

See the IRC discussion at that time:
http://eavesdrop.openstack.org/irclogs/%23openstack-placement/%23openstack-placement.2018-09-20.log.html


As conclusion, we informed the users that we couldn't offer those flavors 
anymore. The old VMs (that were created before disabling SMT) continued to run 
without any issue.

So... after ~2 year I'm hitting again this problem :)

These compute nodes need now to be retired and we are live migrating all
the instances to the replacement hardware.

When trying to live migrate these instances (vCPUs > max_unit) it fails,
becasue the migration allocation can't be created against the source
compute node. For the new hardware (dest_compute) the vCPUS  < max_unit,
so no issue for the new allocation.

I'm working around this problem (to live migrate the instances),
patching the code to have a higher max_unit for vCPUs in the compute
nodes hosting these instances.

I feel that this issue should be discussed again and consider the
possibility to configure the max_unit value.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1918419

Title:
  vCPU resource max_unit is hardcoded

Status in OpenStack Compute (nova):
  New

Bug description:
  Becasue the spectre/meltdown vulnerabilities (2018) we needed to
  disable SMT in all public facing compute nodes. As result the number
  of available cores was reduced by half.

  We had flavors available with 32vCPUs that couldn't be used anymore
  because placement max_unit for vCPUs is hardcoded to be the total
  number of cpus regardless the allocation_ratio.

  To me it's a sensible default but doesn't offer any flexibility for
  operators.

  See the IRC discussion at that time:
  
http://eavesdrop.openstack.org/irclogs/%23openstack-placement/%23openstack-placement.2018-09-20.log.html

  
  As conclusion, we informed the users that we couldn't offer those flavors 
anymore. The old VMs (that were created before disabling SMT) continued to run 
without any issue.

  So... after ~2 year I'm hitting again this problem :)

  These compute nodes need now to be retired and we are live migrating
  all the instances to the replacement hardware.

  When trying to live migrate these instances (vCPUs > max_unit) it
  fails, becasue the migration allocation can't be created against the
  source compute node. For the new hardware (dest_compute) the vCPUS  <
  max_unit, so no issue for the new allocation.

  I'm working around this problem (to live migrate the instances),
  patching the code to have a higher max_unit for vCPUs in the compute
  nodes hosting these instances.

  I feel that this issue should be discussed again and consider the
  possibility to configure the max_unit value.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1918419/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1917645] [NEW] Nova can't create instances if RabbitMQ notification cluster is down

2021-03-03 Thread Belmiro Moreira
Public bug reported:

We use independent RabbitMQ clusters for each OpenStack project, Nova
Cells and also for notifications. Recently, I noticed in our test
infrastructure that if the RabbitMQ cluster for notifications has an
outage, Nova can't create new instances. Possibly other operations will
also hang.

Not being able to send a notification/connect to the RabbitMQ cluster
shouldn't stop new instances to be created. (If this is actually an use-
case for some deployments, the operator should have the possibility to
configure it.)

Tested against the master branch.

If the notification RabbitMQ is stooped, when creating an instance,
nova-scheduler is stuck with:

```
Mar 01 21:16:28 devstack nova-scheduler[18384]: DEBUG 
nova.scheduler.request_filter [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 
demo demo] Request filter 'accelerators_filter' took 0.0 seconds {{(pid=18384) 
wrapper /opt/stack/nova/nova/scheduler/request_filter.py:46}}
Mar 01 21:16:32 devstack nova-scheduler[18384]: ERROR 
oslo.messaging._drivers.impl_rabbit [None 
req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 
113] EHOSTUNREACH (retrying in 2.0 seconds): OSError: [Errno 113] EHOSTUNREACH
Mar 01 21:16:35 devstack nova-scheduler[18384]: ERROR 
oslo.messaging._drivers.impl_rabbit [None 
req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 
113] EHOSTUNREACH (retrying in 4.0 seconds): OSError: [Errno 113] EHOSTUNREACH
Mar 01 21:16:42 devstack nova-scheduler[18384]: ERROR 
oslo.messaging._drivers.impl_rabbit [None 
req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 
113] EHOSTUNREACH (retrying in 6.0 seconds): OSError: [Errno 113] EHOSTUNREACH
Mar 01 21:16:51 devstack nova-scheduler[18384]: ERROR 
oslo.messaging._drivers.impl_rabbit [None 
req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 
113] EHOSTUNREACH (retrying in 8.0 seconds): OSError: [Errno 113] EHOSTUNREACH
Mar 01 21:17:02 devstack nova-scheduler[18384]: ERROR 
oslo.messaging._drivers.impl_rabbit [None 
req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 
113] EHOSTUNREACH (retrying in 10.0 seconds): OSError: [Errno 113] EHOSTUNREACH
(...)
```

Because the notification RabbitMQ cluster is down, Nova gets stuck in:

https://github.com/openstack/nova/blob/5b66caab870558b8a7f7b662c01587b959ad3d41/nova/scheduler/filter_scheduler.py#L85

because oslo messaging never gives up:

https://github.com/openstack/oslo.messaging/blob/5aa645b38b4c1cf08b00e687eb6c7c4b8a0211fc/oslo_messaging/_drivers/impl_rabbit.py#L736

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1917645

Title:
  Nova can't create instances if RabbitMQ notification cluster is down

Status in OpenStack Compute (nova):
  New

Bug description:
  We use independent RabbitMQ clusters for each OpenStack project, Nova
  Cells and also for notifications. Recently, I noticed in our test
  infrastructure that if the RabbitMQ cluster for notifications has an
  outage, Nova can't create new instances. Possibly other operations
  will also hang.

  Not being able to send a notification/connect to the RabbitMQ cluster
  shouldn't stop new instances to be created. (If this is actually an
  use-case for some deployments, the operator should have the
  possibility to configure it.)

  Tested against the master branch.

  If the notification RabbitMQ is stooped, when creating an instance,
  nova-scheduler is stuck with:

  ```
  Mar 01 21:16:28 devstack nova-scheduler[18384]: DEBUG 
nova.scheduler.request_filter [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 
demo demo] Request filter 'accelerators_filter' took 0.0 seconds {{(pid=18384) 
wrapper /opt/stack/nova/nova/scheduler/request_filter.py:46}}
  Mar 01 21:16:32 devstack nova-scheduler[18384]: ERROR 
oslo.messaging._drivers.impl_rabbit [None 
req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 
113] EHOSTUNREACH (retrying in 2.0 seconds): OSError: [Errno 113] EHOSTUNREACH
  Mar 01 21:16:35 devstack nova-scheduler[18384]: ERROR 
oslo.messaging._drivers.impl_rabbit [None 
req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 
113] EHOSTUNREACH (retrying in 4.0 seconds): OSError: [Errno 113] EHOSTUNREACH
  Mar 01 21:16:42 devstack nova-scheduler[18384]: ERROR 
oslo.messaging._drivers.impl_rabbit [None 
req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 
113] EHOSTUNREACH (retrying in 6.0 seconds): OSError: [Errno 113] EHOSTUNREACH
  Mar 01 21:16:51 devstack nova-scheduler[18384]: ERROR 
oslo.messaging._drivers.impl_rabbit [None 
req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 
113] EHOSTUNREACH (retrying in 8.0 seconds): OSError: [Errno 113] EHOSTUNREACH
  Mar 

[Yahoo-eng-team] [Bug 1916031] [NEW] Wrong elapsed time logged during a live migration

2021-02-18 Thread Belmiro Moreira
Public bug reported:

In a recent VM live migration I noticed that the migration time reported in
the logs was not consistent with the actual time that it was taking:

```
2021-01-15 09:51:07.41 43553 INFO nova.virt.libvirt.driver [ ] Migration 
running for 0 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
2021-01-15 09:52:37.740 43553 DEBUG nova.virt.libvirt.driver [ ] Migration 
running for 5 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
2021-01-15 09:53:34.574 43553 DEBUG nova.virt.libvirt.driver [ ] Migration 
running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
2021-01-15 09:54:21.186 43553 DEBUG nova.virt.libvirt.driver [ ] Migration 
running for 15 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
(...)
```

This is because Nova doesn’t log the actual time that is taking. It cycles
to check the migration job status every 500ms and it logs the number of 
cycles/2.

Nova assumes that libvirt calls will report immediately, which was not
the case. (In this particular example the compute node had issues and
libvirt calls were taking a few seconds).

This behavior can cause some confusion when operators are debugging
issues.

In my opinion Nova should log the real migration time.

** Affects: nova
 Importance: Undecided
 Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: New

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

** Description changed:

- In a recent VM live migration I noticed that the migration time reported in 
+ In a recent VM live migration I noticed that the migration time reported in
  the logs was not consistent with the actual time that it was taking:
  
  ```
  2021-01-15 09:51:07.41 43553 INFO nova.virt.libvirt.driver [ ] Migration 
running for 0 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
  2021-01-15 09:52:37.740 43553 DEBUG nova.virt.libvirt.driver [ ] Migration 
running for 5 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
  2021-01-15 09:53:34.574 43553 DEBUG nova.virt.libvirt.driver [ ] Migration 
running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
  2021-01-15 09:54:21.186 43553 DEBUG nova.virt.libvirt.driver [ ] Migration 
running for 15 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
  (...)
  ```
  
- This is because Nova doesn’t log the actual time that is taking. It cycles 
+ This is because Nova doesn’t log the actual time that is taking. It cycles
  to check the migration job status every 500ms and it logs the number of 
cycles/2.
  
- Nova assumes that libvirt calls will report immediately, which was not the 
case. 
- (In this particular example the compute node had issues and libvirt calls were
- taking a few seconds).
+ Nova assumes that libvirt calls will report immediately, which was not
+ the case. (In this particular example the compute node had issues and
+ libvirt calls were taking a few seconds).
  
- This behavior can cause some confusion when operators are debugging issues.
+ This behavior can cause some confusion when operators are debugging
+ issues.
+ 
  In my opinion Nova should log the real migration time.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1916031

Title:
  Wrong elapsed time logged during a live migration

Status in OpenStack Compute (nova):
  New

Bug description:
  In a recent VM live migration I noticed that the migration time reported in
  the logs was not consistent with the actual time that it was taking:

  ```
  2021-01-15 09:51:07.41 43553 INFO nova.virt.libvirt.driver [ ] Migration 
running for 0 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
  2021-01-15 09:52:37.740 43553 DEBUG nova.virt.libvirt.driver [ ] Migration 
running for 5 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
  2021-01-15 09:53:34.574 43553 DEBUG nova.virt.libvirt.driver [ ] Migration 
running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
  2021-01-15 09:54:21.186 43553 DEBUG nova.virt.libvirt.driver [ ] Migration 
running for 15 secs, memory 100% remaining; (bytes processed=0, remaining=0, 
total=0)
  (...)
  ```

  This is because Nova doesn’t log the actual time that is taking. It cycles
  to check the migration job status every 500ms and it logs the number of 
cycles/2.

  Nova assumes that libvirt calls will report immediately, which was not
  the case. (In this particular example the compute node had issues and
  libvirt calls were taking a few seconds).

  This behavior can cause some confusion when operators are debugging
  issues.

  In my opinion Nova should log the real migration time.

To manage notifications about this bug go to:
ht

[Yahoo-eng-team] [Bug 1902216] [NEW] Can't define a cpu_model from a different architecture

2020-10-30 Thread Belmiro Moreira
Public bug reported:

"""
It would be great if Nova supports instances with a different architecture than 
the host.
My use case is to be run aarch64 guests in a x86_64 compute node.
"""

In order to create an aarch64 guest in an x86_64 compute node we need to define 
the emulated CPU.
However, Nova doesn't allow to define a CPU model that doesn't match with the 
host architecture.

For example:
CONF.libvirt.virt_type=qemu
CONF.libvirt.cpu_model=cortex-a57
CONF.libvirt.cpu_mode=custom

It fails with:
nova.exception.InvalidCPUInfo: Configured CPU model: cortex-a57 is not correct, 
or your host CPU arch does not support this model. Please correct your config 
and try again.

The problem is related with the this nova check in driver.py:
if cpu_info['arch'] not in (fields.Architecture.I686,
fields.Architecture.X86_64,
fields.Architecture.PPC64,
fields.Architecture.PPC64LE,
fields.Architecture.PPC):
return model

Again, it's relying the host architecture for the x86_64.


Environment
===

Tested using the master branch (29/10/2020)

Other
=

I'm now opening target bugs for the generic issue reported in
https://bugs.launchpad.net/nova/+bug/1863728

** Affects: nova
 Importance: Undecided
 Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: New

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1902216

Title:
  Can't define a cpu_model from a different architecture

Status in OpenStack Compute (nova):
  New

Bug description:
  """
  It would be great if Nova supports instances with a different architecture 
than the host.
  My use case is to be run aarch64 guests in a x86_64 compute node.
  """

  In order to create an aarch64 guest in an x86_64 compute node we need to 
define the emulated CPU.
  However, Nova doesn't allow to define a CPU model that doesn't match with the 
host architecture.

  For example:
  CONF.libvirt.virt_type=qemu
  CONF.libvirt.cpu_model=cortex-a57
  CONF.libvirt.cpu_mode=custom

  It fails with:
  nova.exception.InvalidCPUInfo: Configured CPU model: cortex-a57 is not 
correct, or your host CPU arch does not support this model. Please correct your 
config and try again.

  The problem is related with the this nova check in driver.py:
  if cpu_info['arch'] not in (fields.Architecture.I686,
  fields.Architecture.X86_64,
  fields.Architecture.PPC64,
  fields.Architecture.PPC64LE,
  fields.Architecture.PPC):
  return model

  Again, it's relying the host architecture for the x86_64.

  
  Environment
  ===

  Tested using the master branch (29/10/2020)

  Other
  =

  I'm now opening target bugs for the generic issue reported in
  https://bugs.launchpad.net/nova/+bug/1863728

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1902216/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1902205] [NEW] UEFI loader should consider the guest architecture not the host

2020-10-30 Thread Belmiro Moreira
Public bug reported:

"""
It would be great if Nova supports instances with a different architecture than 
the host.
An use case would be run aarch64 guests in a x86_64 compute node.
"""

In order to use boot an aarch64 guest in a x86_64 host we need to use UEFI.
However, Nova always uses the UEFI loader considering the host architecture.
The guest architecture should be considered instead.

in livbvirt.driver.py:
"for lpath in DEFAULT_UEFI_LOADER_PATH[caps.host.cpu.arch]"

Environment
===

Tested using the master branch (29/10/2020)

Other
=

I'm now opening target bugs for this issue.
It was first reported has a generic bug in  
https://bugs.launchpad.net/nova/+bug/1863728

** Affects: nova
     Importance: Undecided
 Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: New

** Description changed:

  """
  It would be great if Nova supports instances with a different architecture 
than the host.
  An use case would be run aarch64 guests in a x86_64 compute node.
  """
  
  In order to use boot an aarch64 guest in a x86_64 host we need to use UEFI.
  However, Nova always uses the UEFI loader considering the host architecture.
  The guest architecture should be considered instead.
  
  in livbvirt.driver.py:
  "for lpath in DEFAULT_UEFI_LOADER_PATH[caps.host.cpu.arch]"
  
  Environment
  ===
  
  Tested using the master branch (29/10/2020)
  
  Other
  =
  
  I'm now opening target bugs for this issue.
- It was first reported has a generic bug in https://bugs.launchpad
+ It was first reported has a generic bug in  
https://bugs.launchpad.net/nova/+bug/1863728

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1902205

Title:
  UEFI loader should consider the guest architecture not the host

Status in OpenStack Compute (nova):
  New

Bug description:
  """
  It would be great if Nova supports instances with a different architecture 
than the host.
  An use case would be run aarch64 guests in a x86_64 compute node.
  """

  In order to use boot an aarch64 guest in a x86_64 host we need to use UEFI.
  However, Nova always uses the UEFI loader considering the host architecture.
  The guest architecture should be considered instead.

  in livbvirt.driver.py:
  "for lpath in DEFAULT_UEFI_LOADER_PATH[caps.host.cpu.arch]"

  Environment
  ===

  Tested using the master branch (29/10/2020)

  Other
  =

  I'm now opening target bugs for this issue.
  It was first reported has a generic bug in  
https://bugs.launchpad.net/nova/+bug/1863728

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1902205/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1902203] [NEW] Instance architecture should be reflected in the instance domain

2020-10-30 Thread Belmiro Moreira
Public bug reported:

"""
It would be great if Nova supports instances with a different architecture than 
the host.
An use case would be run aarch64 guests in a x86_64 compute node.
"""

The issue is that nova always uses the architecture from the host when defining 
the instance domain and not what's defined in the image architecture.
Also, because of this the emulator is not correctly defined.

Almost all the pieces are already there!
- CONF.libvirt.hw_machine_type / or using the instance metadata
(it's defined as expected in the instance domain, I'm using "virt-4.0")

- CONF.libvirt.virt_type
(it's defined as expected in the instance domain, I'm using "qemu")

- Defined the image architecture to "aarch64". Actually Nova reads this
property from the image but doesn't use it.

===

The instance creation fails because:

Nova only uses in the domain definition: 
hvm

and then libvirt uses the host architecture in the domain definition.
In my case this results in using the x86_64 emulator.

When hardcoding the right architecture in the guest.os_mach_type it works as 
expected.
if self.os_mach_type is not None:
type_node.set("arch", 'aarch64')
type_node.set("machine", self.os_mach_type)

The domain is created correctly:
hvm

/usr/bin/qemu-system-aarch64

Environment
===

Tested using the master branch (29/10/2020)

Other
=

I'm now opening target bugs for this issue.
It was first reported has a generic bug in 
https://bugs.launchpad.net/nova/+bug/1863728

** Affects: nova
 Importance: Undecided
 Status: New

** Description changed:

  """
  It would be great if Nova supports instances with a different architecture 
than the host.
  An use case would be run aarch64 guests in a x86_64 compute node.
  """
  
  The issue is that nova always uses the architecture from the host when 
defining the instance domain and not what's defined in the image architecture.
  Also, because of this the emulator is not correctly defined.
  
  Almost all the pieces are already there!
- - CONF.libvirt.hw_machine_type / or using the instance metadata 
+ - CONF.libvirt.hw_machine_type / or using the instance metadata
  (it's defined as expected in the instance domain, I'm using "virt-4.0")
  
- - CONF.libvirt.virt_type 
+ - CONF.libvirt.virt_type
  (it's defined as expected in the instance domain, I'm using "qemu")
  
  - Defined the image architecture to "aarch64". Actually Nova reads this
  property from the image but doesn't use it.
  
  ===
  
  The instance creation fails because:
  
- Nova only uses in the domain definition: hvm
+ Nova only uses in the domain definition: 
+ hvm
+ 
  and then libvirt uses the host architecture in the domain definition.
  In my case this results in using the x86_64 emulator.
  
  When hardcoding the right architecture in the guest.os_mach_type it works as 
expected.
  if self.os_mach_type is not None:
- type_node.set("arch", 'aarch64')
- type_node.set("machine", self.os_mach_type)
+ type_node.set("arch", 'aarch64')
+ type_node.set("machine", self.os_mach_type)
  
  The domain is created correctly:
  hvm
  
  /usr/bin/qemu-system-aarch64
- 
  
  Environment
  ===
  
  Tested using the master branch (29/10/2020)
  
  Other
  =
  
  I'm now opening target bugs for this issue.
  It was first reported has a generic bug in 
https://bugs.launchpad.net/nova/+bug/1863728

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1902203

Title:
  Instance architecture should be reflected in the instance domain

Status in OpenStack Compute (nova):
  New

Bug description:
  """
  It would be great if Nova supports instances with a different architecture 
than the host.
  An use case would be run aarch64 guests in a x86_64 compute node.
  """

  The issue is that nova always uses the architecture from the host when 
defining the instance domain and not what's defined in the image architecture.
  Also, because of this the emulator is not correctly defined.

  Almost all the pieces are already there!
  - CONF.libvirt.hw_machine_type / or using the instance metadata
  (it's defined as expected in the instance domain, I'm using "virt-4.0")

  - CONF.libvirt.virt_type
  (it's defined as expected in the instance domain, I'm using "qemu")

  - Defined the image architecture to "aarch64". Actually Nova reads
  this property from the image but doesn't use it.

  ===

  The instance creation fails because:

  Nova only uses in the domain definition: 
  hvm

  and then libvirt uses the host architecture in the domain definition.
  In my case this results in using the x86_64 emulator.

  When hardcoding the right architecture in the guest.os_mach_type it works as 
expected.
  if self.os_mach_type is not None:
  type_node.set("arch", 'aarch64')
  type_node.set("machine", self.os_mach_type)

  The domain is created correctly:
  hvm

  

[Yahoo-eng-team] [Bug 1863728] [NEW] Nova can't create instances for a different arch

2020-02-18 Thread Belmiro Moreira
Public bug reported:

This is more a wish feature than a bug but considering the use cases I'm
surprised that it's not supported by nova.

*Support to create instances for a different architecture than the host
architecture*

My use case: Running ARM instances in x86_64 compute nodes.

This is not possible because nova always assume the host architecture.
Also, there's different assumptions considering the different architectures.

Some examples:
- cpu_mode for AARC64 is passthrough (not good if trying to emulate).
- Nova always checks the cpu_model against the host so is not possible to 
define an ARM cpu.
- architecture image property is not used for defining instance domain
(...)

This is mostly for discussion and to see if the community is interested
in supporting this use case.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1863728

Title:
  Nova can't create instances for a different arch

Status in OpenStack Compute (nova):
  New

Bug description:
  This is more a wish feature than a bug but considering the use cases
  I'm surprised that it's not supported by nova.

  *Support to create instances for a different architecture than the
  host architecture*

  My use case: Running ARM instances in x86_64 compute nodes.

  This is not possible because nova always assume the host architecture.
  Also, there's different assumptions considering the different architectures.

  Some examples:
  - cpu_mode for AARC64 is passthrough (not good if trying to emulate).
  - Nova always checks the cpu_model against the host so is not possible to 
define an ARM cpu.
  - architecture image property is not used for defining instance domain
  (...)

  This is mostly for discussion and to see if the community is
  interested in supporting this use case.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1863728/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1848514] [NEW] Booting from volume providing an image fails

2019-10-17 Thread Belmiro Moreira
Public bug reported:

Trying to create an instance (booting from volume when specifying an image) 
fails.
Running Stein (19.0.1)

###
When using:
###
nova boot --flavor FLAVOR_ID --block-device 
source=image,id=IMAGE_ID,dest=volume,size=10,shutdown=preserve,bootindex=0 
INSTANCE_NAME

###
nova-compute logs:
###

Instance failed block device setup Forbidden: Policy doesn't allow
volume:update_volume_admin_metadata to be performed. (HTTP 403)
(Request-ID: req-875cc6e1-ffe1-45dd-b942-944166c6040a)

The full trace:
http://paste.openstack.org/raw/784535/


Definitely this is a policy issue!
Our cinder policy: "volume:update_volume_admin_metadata": "rule:admin_api" 
(default)
Using an user with admin credentials works as expected!

Is this expected? we didn't identified this behaviour previously (before
stein) using the same policy for "update_volume_admin_metadata"

Found an old similar report:
https://bugs.launchpad.net/nova/+bug/1661189

** Affects: nova
 Importance: Undecided
 Assignee: Surya Seetharaman (tssurya)
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1848514

Title:
  Booting from volume providing an image fails

Status in OpenStack Compute (nova):
  New

Bug description:
  Trying to create an instance (booting from volume when specifying an image) 
fails.
  Running Stein (19.0.1)

  ###
  When using:
  ###
  nova boot --flavor FLAVOR_ID --block-device 
source=image,id=IMAGE_ID,dest=volume,size=10,shutdown=preserve,bootindex=0 
INSTANCE_NAME

  ###
  nova-compute logs:
  ###

  Instance failed block device setup Forbidden: Policy doesn't allow
  volume:update_volume_admin_metadata to be performed. (HTTP 403)
  (Request-ID: req-875cc6e1-ffe1-45dd-b942-944166c6040a)

  The full trace:
  http://paste.openstack.org/raw/784535/

  
  Definitely this is a policy issue!
  Our cinder policy: "volume:update_volume_admin_metadata": "rule:admin_api" 
(default)
  Using an user with admin credentials works as expected!

  Is this expected? we didn't identified this behaviour previously
  (before stein) using the same policy for
  "update_volume_admin_metadata"

  Found an old similar report:
  https://bugs.launchpad.net/nova/+bug/1661189

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1848514/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1837200] [NEW] Deleted images info should be obfuscated - OSSN-0075

2019-07-19 Thread Belmiro Moreira
Public bug reported:

Because OSSN-0075 the Cloud Operator may choose to never purge the "images" 
table.
But, regulations/policy may require that deleted data is not kept.

For this case the deleted image records need to be obfuscated (except
the image id).

** Affects: glance
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/1837200

Title:
  Deleted images info should be obfuscated - OSSN-0075

Status in Glance:
  New

Bug description:
  Because OSSN-0075 the Cloud Operator may choose to never purge the "images" 
table.
  But, regulations/policy may require that deleted data is not kept.

  For this case the deleted image records need to be obfuscated (except
  the image id).

To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/1837200/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1817542] [NEW] nova instance-action fails if project_id=NULL

2019-02-25 Thread Belmiro Moreira
Public bug reported:

nova instance-action fails if project_id=NULL

Starting in api version 2.62 "an obfuscated hashed host id is returned"
To generate the host_id it uses utils.generate_hostid() that uses (in this 
case) the project_id and the host of the action.

However, we can have actions without a user_id/project_id defined.
For example, when something happens outside nova API (user shutdown the VM 
inside the guest OS).
In this case we have an action "stop", without a user_id/project_id.

When running 2.62 it fails when performing:
nova instance-action  

no issues if using:
--os-compute-api-version 2.60 

===
The trace in nova-api logs:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 801, 
in wrapped
return f(*args, **kwargs)
  File 
"/usr/lib/python2.7/site-packages/nova/api/openstack/compute/instance_actions.py",
 line 169, in show
) for evt in events_raw]
  File 
"/usr/lib/python2.7/site-packages/nova/api/openstack/compute/instance_actions.py",
 line 69, in _format_event
project_id)
  File "/usr/lib/python2.7/site-packages/nova/utils.py", line 1295, in 
generate_hostid
data = (project_id + host).encode('utf-8')
TypeError: unsupported operand type(s) for +: 'NoneType' and 'unicode'

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: api

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1817542

Title:
  nova instance-action fails if project_id=NULL

Status in OpenStack Compute (nova):
  New

Bug description:
  nova instance-action fails if project_id=NULL

  Starting in api version 2.62 "an obfuscated hashed host id is returned"
  To generate the host_id it uses utils.generate_hostid() that uses (in this 
case) the project_id and the host of the action.

  However, we can have actions without a user_id/project_id defined.
  For example, when something happens outside nova API (user shutdown the VM 
inside the guest OS).
  In this case we have an action "stop", without a user_id/project_id.

  When running 2.62 it fails when performing:
  nova instance-action  

  no issues if using:
  --os-compute-api-version 2.60 

  ===
  The trace in nova-api logs:

  Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 
801, in wrapped
  return f(*args, **kwargs)
File 
"/usr/lib/python2.7/site-packages/nova/api/openstack/compute/instance_actions.py",
 line 169, in show
  ) for evt in events_raw]
File 
"/usr/lib/python2.7/site-packages/nova/api/openstack/compute/instance_actions.py",
 line 69, in _format_event
  project_id)
File "/usr/lib/python2.7/site-packages/nova/utils.py", line 1295, in 
generate_hostid
  data = (project_id + host).encode('utf-8')
  TypeError: unsupported operand type(s) for +: 'NoneType' and 'unicode'

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1817542/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1816086] [NEW] Resource Tracker performance with Ironic driver

2019-02-15 Thread Belmiro Moreira
Public bug reported:

The problem is in rocky.

The resource tracker builds the resource provider tree and it's updated 2 times 
in "_update_available_resource". 
With "_init_compute_node" and in the "_update_available_resource" itself.

The problem is that the RP tree will contain all the ironic RP and all
the tree is flushed to placement (2 times as described above) when the
periodic task iterate per Ironic RP.

In our case with 1700 ironic nodes, the period task takes:
1700 x (2 x 7s) = ~6h

+++

mitigations:
- shard nova-compute. Have several nova-computes dedicated to ironic.
Most of the current deployments only use 1 nova-compute to avoid resources 
shuffle/recreation between nova-computes.
Several nova-computes will be need to accommodate the load.

- why do we need to do the full resource provider tree flush to placement and 
not only the RP that is being considered?
As a work around we are doing this now!

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1816086

Title:
  Resource Tracker performance with Ironic driver

Status in OpenStack Compute (nova):
  New

Bug description:
  The problem is in rocky.

  The resource tracker builds the resource provider tree and it's updated 2 
times in "_update_available_resource". 
  With "_init_compute_node" and in the "_update_available_resource" itself.

  The problem is that the RP tree will contain all the ironic RP and all
  the tree is flushed to placement (2 times as described above) when the
  periodic task iterate per Ironic RP.

  In our case with 1700 ironic nodes, the period task takes:
  1700 x (2 x 7s) = ~6h

  +++

  mitigations:
  - shard nova-compute. Have several nova-computes dedicated to ironic.
  Most of the current deployments only use 1 nova-compute to avoid resources 
shuffle/recreation between nova-computes.
  Several nova-computes will be need to accommodate the load.

  - why do we need to do the full resource provider tree flush to placement and 
not only the RP that is being considered?
  As a work around we are doing this now!

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1816086/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1816034] [NEW] Ironic flavor migration and default resource classes

2019-02-15 Thread Belmiro Moreira
Public bug reported:

The Ironic flavor migration to use resource classes happened in
Pike/Queens.

The flavors and the instances needed to be upgraded with the correct resource 
class.
This was done by an online data migration. 
Looking into Rocky code: ironic.driver._pike_flavor_migration
There is also an offline data migration using nova-manage.

These migrations added the node resource class into
instance_extra.flavor however I don't see that they also included the
default resource classes (VCPU, MEMORY_MB, DISK_GB) set to 0.

Looking into Rocky code there is also a TODO in _pike_flavor_migration:
"This code can be removed in Queens, and will need to be updated to also alter 
extra_specs to zero-out the old-style standard resource classes of VCPU, 
MEMORY_MB, and DISK_GB."

Currently all my Ironic instances have the correct node resource class
defined, but "old" instances (created before the flavor migration) don't
have VCPU, MEMORY_MB, DISK_GB set to 0, in instance_extra.flavor.

In Rocky the resource tracker raises the following message:
"There was a conflict when trying to complete your request.\n\n Unable to 
allocate inventory: Inventory for 'VCPU' on resource provider  invalid.  ", 
"title": "Conflict"

because it tries to update the allocation but the inventory doesn't have
vcpu resources.


---
As mitigation we now have: "requires_allocation_refresh = False" in the Ironic 
Driver.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1816034

Title:
  Ironic flavor migration and default resource classes

Status in OpenStack Compute (nova):
  New

Bug description:
  The Ironic flavor migration to use resource classes happened in
  Pike/Queens.

  The flavors and the instances needed to be upgraded with the correct resource 
class.
  This was done by an online data migration. 
  Looking into Rocky code: ironic.driver._pike_flavor_migration
  There is also an offline data migration using nova-manage.

  These migrations added the node resource class into
  instance_extra.flavor however I don't see that they also included the
  default resource classes (VCPU, MEMORY_MB, DISK_GB) set to 0.

  Looking into Rocky code there is also a TODO in _pike_flavor_migration:
  "This code can be removed in Queens, and will need to be updated to also 
alter extra_specs to zero-out the old-style standard resource classes of VCPU, 
MEMORY_MB, and DISK_GB."

  Currently all my Ironic instances have the correct node resource class
  defined, but "old" instances (created before the flavor migration)
  don't have VCPU, MEMORY_MB, DISK_GB set to 0, in
  instance_extra.flavor.

  In Rocky the resource tracker raises the following message:
  "There was a conflict when trying to complete your request.\n\n Unable to 
allocate inventory: Inventory for 'VCPU' on resource provider  invalid.  ", 
"title": "Conflict"

  because it tries to update the allocation but the inventory doesn't
  have vcpu resources.

  
  ---
  As mitigation we now have: "requires_allocation_refresh = False" in the 
Ironic Driver.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1816034/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1810342] [NEW] API unexpected exception message

2019-01-02 Thread Belmiro Moreira
Public bug reported:

The "API unexpected exception" message tells the user to open a bug in 
launchpad and attach the log file if possible.
Usually an user doesn't have access to API logs and doesn't know about the nuts 
and bolts of OpenStack.
This error message has been confusing some of our users because asks them to 
not contact the cloud provider support but instead a website that they don't 
know.

I would prefer to have only a simple error message like "API unexpected
exception" or instead, a configurable message where the cloud provider
can point their users to the correct support page.

** Affects: nova
 Importance: Undecided
     Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: In Progress

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

** Description changed:

- The "API unexpected exception" message tells the user to open a bug in 
launchpad
- and attach the log file if possible.
- Usually an user doesn't have access to API logs and doesn't know about the 
nuts 
- and bolts of OpenStack.
- This error message has been confusing some of our users because asks them to 
not
- contact the cloud provider support but instead a website that they don't know.
+ The "API unexpected exception" message tells the user to open a bug in 
launchpad and attach the log file if possible.
+ Usually an user doesn't have access to API logs and doesn't know about the 
nuts and bolts of OpenStack.
+ This error message has been confusing some of our users because asks them to 
not contact the cloud provider support but instead a website that they don't 
know.
  
- I would prefer to have only a simple error message like "API unexpected 
exception" 
- or instead, a configurable message where the cloud provider can point their 
users 
- to the correct support page.
+ I would prefer to have only a simple error message like "API unexpected
+ exception" or instead, a configurable message where the cloud provider
+ can point their users to the correct support page.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1810342

Title:
  API unexpected exception message

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  The "API unexpected exception" message tells the user to open a bug in 
launchpad and attach the log file if possible.
  Usually an user doesn't have access to API logs and doesn't know about the 
nuts and bolts of OpenStack.
  This error message has been confusing some of our users because asks them to 
not contact the cloud provider support but instead a website that they don't 
know.

  I would prefer to have only a simple error message like "API
  unexpected exception" or instead, a configurable message where the
  cloud provider can point their users to the correct support page.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1810342/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1810340] [NEW] Repetitive info messages from nova-compute

2019-01-02 Thread Belmiro Moreira
Public bug reported:

There are 2 repetitive info messages from nova-compute:
INFO nova.compute.resource_tracker Final resource view:
INFO nova.virt.libvirt.driver Libvirt baseline CPU 

By default they are logged every minute. In my view they should be
"debug" messages.

In large infrastructures that store log files for analytics they use
significant storage space without bringing reasonable value.

** Affects: nova
 Importance: Undecided
 Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: In Progress

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1810340

Title:
  Repetitive info messages from nova-compute

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  There are 2 repetitive info messages from nova-compute:
  INFO nova.compute.resource_tracker Final resource view:
  INFO nova.virt.libvirt.driver Libvirt baseline CPU 

  By default they are logged every minute. In my view they should be
  "debug" messages.

  In large infrastructures that store log files for analytics they use
  significant storage space without bringing reasonable value.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1810340/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1805989] [NEW] Weight policy to stack/spread instances and "max_placement_results"

2018-11-30 Thread Belmiro Moreira
Public bug reported:

Weights are applyed by the scheduler.
This means that if using "max_placement_results" with a number bellow to the 
existing resources,
the weight policy will only be applied to the subset of allocation candidates 
retrieved by placement.

As consequence we lose the policy to stack/spread instances.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: placement

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1805989

Title:
  Weight policy to stack/spread instances and "max_placement_results"

Status in OpenStack Compute (nova):
  New

Bug description:
  Weights are applyed by the scheduler.
  This means that if using "max_placement_results" with a number bellow to the 
existing resources,
  the weight policy will only be applied to the subset of allocation candidates 
retrieved by placement.

  As consequence we lose the policy to stack/spread instances.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1805989/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1805984] [NEW] Placement is not aware of disable compute nodes

2018-11-29 Thread Belmiro Moreira
Public bug reported:

Placement doesn't know if a resource provider (in this particular case a
compute node) is disabled. This is only filtered by the scheduler using
the "ComputeFilter".

However, when using the option "max_placement_results" to restrict the
amount of placement results there is the possibility to get only
"disabled" allocation candidates from placement. The creation of new VMs
will end up in ERROR because there are "No Valid Hosts".

There are several use-cases when an operator may want to disable nodes
to avoid the creation of new VMs.

Related with: https://bugs.launchpad.net/nova/+bug/1708958

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: placement

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1805984

Title:
  Placement is not aware of disable compute nodes

Status in OpenStack Compute (nova):
  New

Bug description:
  Placement doesn't know if a resource provider (in this particular case
  a compute node) is disabled. This is only filtered by the scheduler
  using the "ComputeFilter".

  However, when using the option "max_placement_results" to restrict the
  amount of placement results there is the possibility to get only
  "disabled" allocation candidates from placement. The creation of new
  VMs will end up in ERROR because there are "No Valid Hosts".

  There are several use-cases when an operator may want to disable nodes
  to avoid the creation of new VMs.

  Related with: https://bugs.launchpad.net/nova/+bug/1708958

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1805984/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1801897] [NEW] List AVZs can take several seconds

2018-11-06 Thread Belmiro Moreira
Public bug reported:

Getting the list of AVZs can take several seconds (~30 secs. in our case)
This is noticeable in Horizon when creating a new instance because the user 
can't select an AVZ until this completes.

workflow:
- get all services from all cells (~1 for us)
- fetch all aggregates which are tagged as an AVZ
- construct a dict of {'service['host']: avz.value}
- return a dict of {'avz_value': list of hosts}
- separate available and not available zones.

Reproducible in Queens, Rocky

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1801897

Title:
  List AVZs can take several seconds

Status in OpenStack Compute (nova):
  New

Bug description:
  Getting the list of AVZs can take several seconds (~30 secs. in our case)
  This is noticeable in Horizon when creating a new instance because the user 
can't select an AVZ until this completes.

  workflow:
  - get all services from all cells (~1 for us)
  - fetch all aggregates which are tagged as an AVZ
  - construct a dict of {'service['host']: avz.value}
  - return a dict of {'avz_value': list of hosts}
  - separate available and not available zones.

  Reproducible in Queens, Rocky

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1801897/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1796920] [NEW] Baremetal nodes should not be exposing non-custom-resource-class (vcpu, ram, disk)

2018-10-09 Thread Belmiro Moreira
Public bug reported:

Description
===
Baremetal nodes report CPU, RAM and DISK inventory.

The issue is that allocations for baremetal nodes are only done considering the 
custom_resource_class. This happens because baremetal flavors are set to not 
consume these resources.
See: 
https://docs.openstack.org/ironic/queens/install/configure-nova-flavors.html

If we use flavor that doesn't include a custom_resource_class ,
placement can include a baremetal nodee that are already deployed because cpu, 
ram, disk is available (but results in a error from ironic), or worst the 
instance is created in a baremetal node (if it wasn't deployed yet).


Environment
===
Nova and Ironic running Queens release.

** Affects: nova
 Importance: Undecided
 Status: Invalid

** Affects: nova/pike
 Importance: High
 Status: Triaged

** Affects: nova/queens
 Importance: High
 Status: Triaged

** Affects: nova/rocky
 Importance: High
 Assignee: Matt Riedemann (mriedem)
 Status: Triaged


** Tags: ironic

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1796920

Title:
  Baremetal nodes should not be exposing non-custom-resource-class
  (vcpu, ram, disk)

Status in OpenStack Compute (nova):
  Invalid
Status in OpenStack Compute (nova) pike series:
  Triaged
Status in OpenStack Compute (nova) queens series:
  Triaged
Status in OpenStack Compute (nova) rocky series:
  Triaged

Bug description:
  Description
  ===
  Baremetal nodes report CPU, RAM and DISK inventory.

  The issue is that allocations for baremetal nodes are only done considering 
the custom_resource_class. This happens because baremetal flavors are set to 
not consume these resources.
  See: 
https://docs.openstack.org/ironic/queens/install/configure-nova-flavors.html

  If we use flavor that doesn't include a custom_resource_class ,
  placement can include a baremetal nodee that are already deployed because 
cpu, ram, disk is available (but results in a error from ironic), or worst the 
instance is created in a baremetal node (if it wasn't deployed yet).

  
  Environment
  ===
  Nova and Ironic running Queens release.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1796920/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1771810] [NEW] Quota calculation connects to all available cells

2018-05-17 Thread Belmiro Moreira
Public bug reported:

Quota utilisation calculation connects to all cells DBs to get all consumed 
resources for a project.
When having several cells this can be inefficient and can fail if one of the 
cell DBs is not available.

To calculate the quota utilization of a project should be enough to use
only the cells where the project has/had instances. This information is
available in nova_api DB.

** Affects: nova
 Importance: Undecided
 Assignee: Surya Seetharaman (tssurya)
 Status: New


** Tags: cells quotas

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1771810

Title:
  Quota calculation connects to all available cells

Status in OpenStack Compute (nova):
  New

Bug description:
  Quota utilisation calculation connects to all cells DBs to get all consumed 
resources for a project.
  When having several cells this can be inefficient and can fail if one of the 
cell DBs is not available.

  To calculate the quota utilization of a project should be enough to
  use only the cells where the project has/had instances. This
  information is available in nova_api DB.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1771810/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1771806] [NEW] Ironic nova-compute failover creates new resource provider removing the resource_provider_aggregates link

2018-05-17 Thread Belmiro Moreira
Public bug reported:

When using the request_filter functionality, aggregates are mapped to 
placement_aggregates.
placement_provider_aggregates contains the resource providers mapped in 
aggregate_hosts.

The problem happens when a nova-compute for ironic fails and hosts are
automatically moved to a different nova-compute. In this case a new
compute_node entry is created originating a new resource provider.

As consequence the placement_provider_aggregates doesn't have the new
resource providers.

** Affects: nova
 Importance: Undecided
 Assignee: Surya Seetharaman (tssurya)
 Status: New


** Tags: ironic placement

** Tags removed: placem
** Tags added: ironic placement

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1771806

Title:
  Ironic nova-compute failover creates new resource provider removing
  the resource_provider_aggregates link

Status in OpenStack Compute (nova):
  New

Bug description:
  When using the request_filter functionality, aggregates are mapped to 
placement_aggregates.
  placement_provider_aggregates contains the resource providers mapped in 
aggregate_hosts.

  The problem happens when a nova-compute for ironic fails and hosts are
  automatically moved to a different nova-compute. In this case a new
  compute_node entry is created originating a new resource provider.

  As consequence the placement_provider_aggregates doesn't have the new
  resource providers.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1771806/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1768876] [NEW] Old instances can get AVZ from metadata

2018-05-03 Thread Belmiro Moreira
Public bug reported:

Can't get AVZ for old instances:

curl http://169.254.169.254/latest/meta-data/placement/availability-zone 
None#

This is because the upcall to the nova_api DB was removed in the commit: 9f7bac2
and old instances may haven't the AVZ defined.
Previously, the AVZ in the instance was only set if explicitly defined by the 
user.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1768876

Title:
  Old instances can get AVZ from metadata

Status in OpenStack Compute (nova):
  New

Bug description:
  Can't get AVZ for old instances:

  curl http://169.254.169.254/latest/meta-data/placement/availability-zone 
  None#

  This is because the upcall to the nova_api DB was removed in the commit: 
9f7bac2
  and old instances may haven't the AVZ defined.
  Previously, the AVZ in the instance was only set if explicitly defined by the 
user.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1768876/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1767309] [NEW] Placement - Make association_refresh configurable

2018-04-27 Thread Belmiro Moreira
Public bug reported:

In Queens the provider-tree refresh happens every 5 min (also in master).
ASSOCIATION_REFRESH = 300

For large deployments this creates unnecessary load in placement.
This option should be configurable.

related with:
https://review.openstack.org/#/c/535517/

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1767309

Title:
  Placement - Make association_refresh configurable

Status in OpenStack Compute (nova):
  New

Bug description:
  In Queens the provider-tree refresh happens every 5 min (also in master).
  ASSOCIATION_REFRESH = 300

  For large deployments this creates unnecessary load in placement.
  This option should be configurable.

  related with:
  https://review.openstack.org/#/c/535517/

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1767309/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1767303] [NEW] Scheduler connects to all cells DBs to gather compute nodes info

2018-04-27 Thread Belmiro Moreira
Public bug reported:

The scheduler host.manager connects to all cells DBs to get compute node
info even if only a subset of compute nodes uuids are given by
placement.

This has a performance impact in large cloud deployments with several
cells.

Also related with:
https://review.openstack.org/#/c/539617/9/nova/scheduler/host_manager.py

{code}
def _get_computes_for_cells(self, context, cells, compute_uuids=None)
for cell in cells:
LOG.debug('Getting compute nodes and services for cell %(cell)s',
  {'cell': cell.identity})
with context_module.target_cell(context, cell) as cctxt:
if compute_uuids is None:
compute_nodes[cell.uuid].extend(
objects.ComputeNodeList.get_all(cctxt))
else:
compute_nodes[cell.uuid].extend(
objects.ComputeNodeList.get_all_by_uuids(
cctxt, compute_uuids))
services.update(
{service.host: service
 for service in objects.ServiceList.get_by_binary(
 cctxt, 'nova-compute',
 include_disabled=True)})
return compute_nodes, services
{code}

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1767303

Title:
  Scheduler connects to all cells DBs to gather compute nodes info

Status in OpenStack Compute (nova):
  New

Bug description:
  The scheduler host.manager connects to all cells DBs to get compute
  node info even if only a subset of compute nodes uuids are given by
  placement.

  This has a performance impact in large cloud deployments with several
  cells.

  Also related with:
  https://review.openstack.org/#/c/539617/9/nova/scheduler/host_manager.py

  {code}
  def _get_computes_for_cells(self, context, cells, compute_uuids=None)
  for cell in cells:
  LOG.debug('Getting compute nodes and services for cell %(cell)s',
{'cell': cell.identity})
  with context_module.target_cell(context, cell) as cctxt:
  if compute_uuids is None:
  compute_nodes[cell.uuid].extend(
  objects.ComputeNodeList.get_all(cctxt))
  else:
  compute_nodes[cell.uuid].extend(
  objects.ComputeNodeList.get_all_by_uuids(
  cctxt, compute_uuids))
  services.update(
  {service.host: service
   for service in objects.ServiceList.get_by_binary(
   cctxt, 'nova-compute',
   include_disabled=True)})
  return compute_nodes, services
  {code}

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1767303/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1761197] [NEW] Not defined keypairs in instance_extra cellsV1 DBs

2018-04-04 Thread Belmiro Moreira
Public bug reported:

In newton there was the data migration to fill the "keypair" in instance_extra 
table.
The migration checks if an instance has a keypair and then adds the keypair 
entry in the instance_extra table. This works if the keypair still exists in 
the keypair table.

However, when running with cellsV1 the keypairs only exist in top DB and the 
migration only works in the instance_extra table of that DB.
This means that in all cells DBs the instance_extra has the keypair not defined.

This is important when migrating to cellsV2 because we will rely in the
cells DBs.

We should have a migration that gets the keypairs from nova_api DB to
fill the keypair in instance_extra of the different cells DBs.

** Affects: nova
 Importance: Undecided
 Assignee: Surya Seetharaman (tssurya)
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1761197

Title:
  Not defined keypairs in instance_extra cellsV1 DBs

Status in OpenStack Compute (nova):
  New

Bug description:
  In newton there was the data migration to fill the "keypair" in 
instance_extra table.
  The migration checks if an instance has a keypair and then adds the keypair 
entry in the instance_extra table. This works if the keypair still exists in 
the keypair table.

  However, when running with cellsV1 the keypairs only exist in top DB and the 
migration only works in the instance_extra table of that DB.
  This means that in all cells DBs the instance_extra has the keypair not 
defined.

  This is important when migrating to cellsV2 because we will rely in
  the cells DBs.

  We should have a migration that gets the keypairs from nova_api DB to
  fill the keypair in instance_extra of the different cells DBs.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1761197/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1761198] [NEW] "Orphan" request_specs and instance_mappings

2018-04-04 Thread Belmiro Moreira
Public bug reported:

request_specs and instance_mappings in nova_api DB are not removed when an 
instance is deleted.
In Queens they are removed when the instances are archived 
(https://review.openstack.org/#/c/515034/)

However, for the deployments that archived instances before running
Queens they will have request_specs and instance_mappings that are not
associated to any instance (they were already deleted).

We should have a nova-manage tool to clean these "orphan" records.

** Affects: nova
 Importance: Undecided
 Assignee: Surya Seetharaman (tssurya)
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1761198

Title:
  "Orphan" request_specs and instance_mappings

Status in OpenStack Compute (nova):
  New

Bug description:
  request_specs and instance_mappings in nova_api DB are not removed when an 
instance is deleted.
  In Queens they are removed when the instances are archived 
(https://review.openstack.org/#/c/515034/)

  However, for the deployments that archived instances before running
  Queens they will have request_specs and instance_mappings that are not
  associated to any instance (they were already deleted).

  We should have a nova-manage tool to clean these "orphan" records.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1761198/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1757472] [NEW] Required to define database/connection when running services for nova_api cell

2018-03-21 Thread Belmiro Moreira
Public bug reported:

Services in nova_api cell fail to run if database/connection is not defined.
These services should only use api_database/connection. 

In devstack database/connection is defined with the cell0 DB endpoint.
This shouldn't be required because the cell0 is set in nova_api DB.

** Affects: nova
 Importance: Undecided
 Assignee: Surya Seetharaman (tssurya)
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1757472

Title:
  Required to define database/connection when running  services for
  nova_api cell

Status in OpenStack Compute (nova):
  New

Bug description:
  Services in nova_api cell fail to run if database/connection is not defined.
  These services should only use api_database/connection. 

  In devstack database/connection is defined with the cell0 DB endpoint.
  This shouldn't be required because the cell0 is set in nova_api DB.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1757472/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1735353] [NEW] build_request not deleted when using cellsV1 and local nova_api DB

2017-11-30 Thread Belmiro Moreira
Public bug reported:

Description
===
build_request not deleted when using cellsV1 and local nova_api

Placement needs to be enabled in Newton.
CellsV1 installations can deploy a placement service per child cell in order to 
have a more efficient schedule during the transition to cellV2. This requires a 
nova_api DB per cell.

With this configuration the "build_request" that was created in the top
nova_api DB is not deleted after the VM creation because is triggered in
"conductor/manager.py" that runs in the child cell and is pointing to
the local nova_api DB.

This leaves new VMs in BUILD state.


Expected result
===
build_request is removed from top nova_api DB.

Actual result
=
nova-cells tries to remove build_request from local cell nova_api DB.

Environment
===
Nova newton

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: cells

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1735353

Title:
  build_request not deleted when using cellsV1 and local nova_api DB

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  build_request not deleted when using cellsV1 and local nova_api

  Placement needs to be enabled in Newton.
  CellsV1 installations can deploy a placement service per child cell in order 
to have a more efficient schedule during the transition to cellV2. This 
requires a nova_api DB per cell.

  With this configuration the "build_request" that was created in the
  top nova_api DB is not deleted after the VM creation because is
  triggered in "conductor/manager.py" that runs in the child cell and is
  pointing to the local nova_api DB.

  This leaves new VMs in BUILD state.

  
  Expected result
  ===
  build_request is removed from top nova_api DB.

  Actual result
  =
  nova-cells tries to remove build_request from local cell nova_api DB.

  Environment
  ===
  Nova newton

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1735353/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1727266] [NEW] archive_deleted_instances is not atomic for insert/delete

2017-10-25 Thread Belmiro Moreira
Public bug reported:

Description
===
Archive deleted instances first moves deleted rows to the shadow
tables and then deletes the rows from the original tables.
However, because it does 2 different selects (to get the rows to insert
and to delete) we can have the case that a row is not inserted in the
shadow table but removed from the original.
This can happen when there are new deleted rows between the insert and
delete.
Shouldn't we deleted explicitly only the IDs that were inserted?


See:
insert = shadow_table.insert(inline=True).\
from_select(columns,
sql.select([table],
   deleted_column != deleted_column.default.arg).
order_by(column).limit(max_rows))
query_delete = sql.select([column],
  deleted_column != deleted_column.default.arg).\
  order_by(column).limit(max_rows)

delete_statement = DeleteFromSelect(table, query_delete, column)

(...)

conn.execute(insert)
result_delete = conn.execute(delete_statement)

** Affects: nova
 Importance: Undecided
 Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: New

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1727266

Title:
  archive_deleted_instances is not atomic for insert/delete

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  Archive deleted instances first moves deleted rows to the shadow
  tables and then deletes the rows from the original tables.
  However, because it does 2 different selects (to get the rows to insert
  and to delete) we can have the case that a row is not inserted in the
  shadow table but removed from the original.
  This can happen when there are new deleted rows between the insert and
  delete.
  Shouldn't we deleted explicitly only the IDs that were inserted?

  
  See:
  insert = shadow_table.insert(inline=True).\
  from_select(columns,
  sql.select([table],
 deleted_column != deleted_column.default.arg).
  order_by(column).limit(max_rows))
  query_delete = sql.select([column],
deleted_column != deleted_column.default.arg).\
order_by(column).limit(max_rows)

  delete_statement = DeleteFromSelect(table, query_delete, column)

  (...)

  conn.execute(insert)
  result_delete = conn.execute(delete_statement)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1727266/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1726310] [NEW] nova doesn't list services if it can't connect to a cell DB

2017-10-23 Thread Belmiro Moreira
Public bug reported:

Description
===
nova doesn't list services if it can't connect to a child cell DB.

I would expect nova to show the services from all child DBs that it can connect.
For the child DBs that can't connect it can show for the mandatory services 
(nova-conductor) with the status "not available" and in the disabled reason why 
("can't connect to the DB")


Steps to reproduce
==
Have at least 2 child cells.
Stop the DB in one of them.

"nova service-list" fails with "ERROR (ClientException): Unexpected API Error."
Not given any information about what's causing the problem.

Expected result
===
List the services of the available cells and list the status of the mandatory 
services of the affected cells as "not available".


Actual result
=
$nova service-list
fails.


Environment
===
nova master (commit: 8d21d711000fff80eb367692b157d09b6532923f)

** Affects: nova
 Importance: Undecided
 Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: New


** Tags: cells

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1726310

Title:
  nova doesn't list services if it can't connect to a cell DB

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  nova doesn't list services if it can't connect to a child cell DB.

  I would expect nova to show the services from all child DBs that it can 
connect.
  For the child DBs that can't connect it can show for the mandatory services 
(nova-conductor) with the status "not available" and in the disabled reason why 
("can't connect to the DB")

  
  Steps to reproduce
  ==
  Have at least 2 child cells.
  Stop the DB in one of them.

  "nova service-list" fails with "ERROR (ClientException): Unexpected API 
Error."
  Not given any information about what's causing the problem.

  Expected result
  ===
  List the services of the available cells and list the status of the mandatory 
services of the affected cells as "not available".

  
  Actual result
  =
  $nova service-list
  fails.

  
  Environment
  ===
  nova master (commit: 8d21d711000fff80eb367692b157d09b6532923f)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1726310/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1726301] [NEW] Nova should list instances even if it can't connect to a cell DB

2017-10-23 Thread Belmiro Moreira
Public bug reported:

Description
===
One of the goals of cells is to allow nova scale and to have cells as failure 
domains.
However, if a cell DB goes down nova doesn't list any instance. Even if the 
project doesn't have any instance in the affected cell. This affects all users.

The behavior that I would expect is nova to show what's available from
the nova_api DB if a cell DB is not available. (UUIDs and can we look
into the request_spec?)

Steps to reproduce
==
Have at least 2 child cells.
Stop the DB in one of them.

"nova list" fails with "ERROR (ClientException): Unexpected API Error."
Not given any more information to the user.

Expected result
===
List the project instances.
For the instances in the affect cell, list the available information in the 
nova_api.

Actual result
=
$nova list
fails without showing the project instances.

Environment
===
nova master (commit: 8d21d711000fff80eb367692b157d09b6532923f)

** Affects: nova
 Importance: Undecided
     Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: New


** Tags: cells

** Description changed:

  Description
  ===
  One of the goals of cells is to allow nova scale and to have cells as failure 
domains.
  However, if a cell DB goes down nova doesn't list any instance. Even if the 
project doesn't have any instance in the affected cell. This affects all users.
  
  The behavior that I would expect is nova to show what's available from
  the nova_api DB if a cell DB is not available. (UUIDs and can we look
  into the request_spec?)
  
- 
  Steps to reproduce
  ==
  Have at least 2 child cells.
- Stop the DB of one of them.
+ Stop the DB in one of them.
  
  "nova list" fails with "ERROR (ClientException): Unexpected API Error."
  Not given any more information to the user.
  
- 
  Expected result
  ===
  List the project instances.
- For the instances in the affect cell list the available information in the 
nova_api.
+ For the instances in the affect cell, list the available information in the 
nova_api.
  
  Actual result
  =
- $nova list 
- fails without showing the project instance.
+ $nova list
+ fails without showing the project instances.
  
  Environment
  ===
  nova master (commit: 8d21d711000fff80eb367692b157d09b6532923f)

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1726301

Title:
  Nova should list instances even if it can't connect to a cell DB

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  One of the goals of cells is to allow nova scale and to have cells as failure 
domains.
  However, if a cell DB goes down nova doesn't list any instance. Even if the 
project doesn't have any instance in the affected cell. This affects all users.

  The behavior that I would expect is nova to show what's available from
  the nova_api DB if a cell DB is not available. (UUIDs and can we look
  into the request_spec?)

  Steps to reproduce
  ==
  Have at least 2 child cells.
  Stop the DB in one of them.

  "nova list" fails with "ERROR (ClientException): Unexpected API Error."
  Not given any more information to the user.

  Expected result
  ===
  List the project instances.
  For the instances in the affect cell, list the available information in the 
nova_api.

  Actual result
  =
  $nova list
  fails without showing the project instances.

  Environment
  ===
  nova master (commit: 8d21d711000fff80eb367692b157d09b6532923f)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1726301/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1681431] Re: "nova-manage db sync" fails from Mitaka to Newton because deleted compute nodes

2017-04-10 Thread Belmiro Moreira
*** This bug is a duplicate of bug 1665719 ***
https://bugs.launchpad.net/bugs/1665719

Already fixed #1665719

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1681431

Title:
  "nova-manage db sync" fails from Mitaka to Newton because deleted
  compute nodes

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description
  ===
  "nova-manage db sync" fails from Mitaka to Newton because deleted compute 
nodes

  DB migration from Mitaka to Newton fails in migration 330 with:
  "error: There are still XX unmigrated records in the compute_nodes table. 
Migration cannot continue until all records have been migrated."

  This migration checks if there are compute_nodes without a UUID.
  However, "nova-manage db online_data_migrations" in Mitaka only
  migrates non deleted compute_node entries.

  
  Steps to reproduce
  ==
  1) Have a nova Mitaka DB (319)
  2) Make sure you have a deleted entry (deleted>0) in "compute_nodes" table.
  3) Make sure all data migrations are done in Mitaka. ("nova-manage db 
online_data_migrations")
  4) Sync the DB for Newton. ("nova-manage db sync" in a Newton node)

  
  Expected result
  ===
  DB migrations succeed (334)

  
  Actual result
  =
  DB doesn't migrate (329)

  
  Environment
  ===
  Tested with "13.1.2" and "14.0.3".

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1681431/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1681431] [NEW] "nova-manage db sync" fails from Mitaka to Newton because deleted compute nodes

2017-04-10 Thread Belmiro Moreira
Public bug reported:

Description
===
"nova-manage db sync" fails from Mitaka to Newton because deleted compute nodes

DB migration from Mitaka to Newton fails in migration 330 with:
"error: There are still XX unmigrated records in the compute_nodes table. 
Migration cannot continue until all records have been migrated."

This migration checks if there are compute_nodes without a UUID.
However, "nova-manage db online_data_migrations" in Mitaka only migrates
non deleted compute_node entries.


Steps to reproduce
==
1) Have a nova Mitaka DB (319)
2) Make sure you have a deleted entry (deleted>0) in "compute_nodes" table.
3) Make sure all data migrations are done in Mitaka. ("nova-manage db 
online_data_migrations")
4) Sync the DB for Newton. ("nova-manage db sync" in a Newton node)


Expected result
===
DB migrations succeed (334)


Actual result
=
DB doesn't migrate (329)


Environment
===
Tested with "13.1.2" and "14.0.3".

** Affects: nova
 Importance: Undecided
 Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: New

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1681431

Title:
  "nova-manage db sync" fails from Mitaka to Newton because deleted
  compute nodes

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  "nova-manage db sync" fails from Mitaka to Newton because deleted compute 
nodes

  DB migration from Mitaka to Newton fails in migration 330 with:
  "error: There are still XX unmigrated records in the compute_nodes table. 
Migration cannot continue until all records have been migrated."

  This migration checks if there are compute_nodes without a UUID.
  However, "nova-manage db online_data_migrations" in Mitaka only
  migrates non deleted compute_node entries.

  
  Steps to reproduce
  ==
  1) Have a nova Mitaka DB (319)
  2) Make sure you have a deleted entry (deleted>0) in "compute_nodes" table.
  3) Make sure all data migrations are done in Mitaka. ("nova-manage db 
online_data_migrations")
  4) Sync the DB for Newton. ("nova-manage db sync" in a Newton node)

  
  Expected result
  ===
  DB migrations succeed (334)

  
  Actual result
  =
  DB doesn't migrate (329)

  
  Environment
  ===
  Tested with "13.1.2" and "14.0.3".

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1681431/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1533380] [NEW] Creating multiple instances with a single request when using cells creates wrong instance names

2016-01-12 Thread Belmiro Moreira
Public bug reported:

When creating multiple instances with a single request the instance name has 
the format defined in the "multi_instance_display_name_template" option.
By default: multi_instance_display_name_template=%(name)s-%(count)d
When booting two instances (num-instances=2) with the name=test is expected to 
have the following instance names:
test-1
test-2

However, if using cells (only considering 2 levels) we have the following names:
test-1-1
test-1-2

Increasing the number of cell levels adds more hops in the instance name.
Changing the "multi_instance_display_name_template" to uuids has the same 
problem.
For example: (consider  a random uuid)
test--
test--

** Affects: nova
 Importance: Undecided
     Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: New


** Tags: cells

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

** Description changed:

- When creating multiple instances with a single request the instance name has 
the
- format defined in the "multi_instance_display_name_template" option.
+ When creating multiple instances with a single request the instance name has 
the format defined in the "multi_instance_display_name_template" option.
  By default: multi_instance_display_name_template=%(name)s-%(count)d
- When booting two instances (num-instances=2) with the name=test is expected 
to have
- the following instance names:
+ When booting two instances (num-instances=2) with the name=test is expected 
to have the following instance names:
  test-1
  test-2
  
  However, if using cells (only considering 2 levels) we have the following 
names:
  test-1-1
  test-1-2
  
  Increasing the number of cell levels adds more hops in the instance name.
  Changing the "multi_instance_display_name_template" to uuids has the same 
problem.
  For example: (consider  a random uuid)
  test--
  test--

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1533380

Title:
  Creating multiple instances with a single request when using cells
  creates wrong instance names

Status in OpenStack Compute (nova):
  New

Bug description:
  When creating multiple instances with a single request the instance name has 
the format defined in the "multi_instance_display_name_template" option.
  By default: multi_instance_display_name_template=%(name)s-%(count)d
  When booting two instances (num-instances=2) with the name=test is expected 
to have the following instance names:
  test-1
  test-2

  However, if using cells (only considering 2 levels) we have the following 
names:
  test-1-1
  test-1-2

  Increasing the number of cell levels adds more hops in the instance name.
  Changing the "multi_instance_display_name_template" to uuids has the same 
problem.
  For example: (consider  a random uuid)
  test--
  test--

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1533380/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1532562] [NEW] Cell capacities updates include available resources of compute nodes "down"

2016-01-10 Thread Belmiro Moreira
Public bug reported:

If a child cell has compute nodes without a heartbeat update but enabled (XXX 
state with
"nova-manage service list") the child cell continues to consider the available 
resources of
these compute nodes when updating the cell capacity.
This can be problematic when having several cells and trying to fill them 
completely.
Requests are sent to the cell that can fit more instances of the requested type 
however
when compute nodes are "down" the requests will fail with "No valid host" in 
the cell.

When updating the cell capacity the "disabled" compute nodes are excluded. This 
should
also happen if the compute node didn't have a heartbeat update during the 
"CONF.service_down_time".

How to reproduce:
1) Have a cell environment with 2 child cells (A and B).
2) Have nova-cells running in "debug". Confirm that the "Received capacities 
from child cell" A and B (in top nova-cell log) matches the number of available 
resources.
4) Stop some compute nodes in cell A.
5) Confirm that the "Received capacities from child cell A" don't change.
6) Cell scheduler can send requests to cell A that can fail with "No valid 
host".

** Affects: nova
 Importance: Undecided
 Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
     Status: New


** Tags: cells

** Changed in: nova
 Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1532562

Title:
  Cell capacities updates include available resources of compute nodes
  "down"

Status in OpenStack Compute (nova):
  New

Bug description:
  If a child cell has compute nodes without a heartbeat update but enabled (XXX 
state with
  "nova-manage service list") the child cell continues to consider the 
available resources of
  these compute nodes when updating the cell capacity.
  This can be problematic when having several cells and trying to fill them 
completely.
  Requests are sent to the cell that can fit more instances of the requested 
type however
  when compute nodes are "down" the requests will fail with "No valid host" in 
the cell.

  When updating the cell capacity the "disabled" compute nodes are excluded. 
This should
  also happen if the compute node didn't have a heartbeat update during the 
"CONF.service_down_time".

  How to reproduce:
  1) Have a cell environment with 2 child cells (A and B).
  2) Have nova-cells running in "debug". Confirm that the "Received capacities 
from child cell" A and B (in top nova-cell log) matches the number of available 
resources.
  4) Stop some compute nodes in cell A.
  5) Confirm that the "Received capacities from child cell A" don't change.
  6) Cell scheduler can send requests to cell A that can fail with "No valid 
host".

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1532562/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1524114] [NEW] nova-scheduler also loads deleted instances at startup

2015-12-08 Thread Belmiro Moreira
Public bug reported:

nova-scheduler is loading all instances (including deleted) at startup.

Experienced problems when each node has >6000 deleted instances, even when 
using batches of 10 nodes.
Each query can take several minutes and transfer several GB of data.
This prevented nova-scheduler connect to rabbitmq.


###
When nova-scheduler starts it calls "_async_init_instance_info()" and it does 
an "InstanceList.get_by_filters" that uses batches of 10 nodes. This uses 
"instance_get_all_by_filters_sort", however "Deleted instances will be returned 
by default, unless there's a filter that says otherwise".
Adding the filter: {"deleted": False} fixes the problem.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: scheduler

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1524114

Title:
  nova-scheduler also loads deleted instances at startup

Status in OpenStack Compute (nova):
  New

Bug description:
  nova-scheduler is loading all instances (including deleted) at
  startup.

  Experienced problems when each node has >6000 deleted instances, even when 
using batches of 10 nodes.
  Each query can take several minutes and transfer several GB of data.
  This prevented nova-scheduler connect to rabbitmq.

  
  ###
  When nova-scheduler starts it calls "_async_init_instance_info()" and it does 
an "InstanceList.get_by_filters" that uses batches of 10 nodes. This uses 
"instance_get_all_by_filters_sort", however "Deleted instances will be returned 
by default, unless there's a filter that says otherwise".
  Adding the filter: {"deleted": False} fixes the problem.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1524114/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1517006] [NEW] Can't create instances with flavors that have extra specs in a cell setup

2015-11-17 Thread Belmiro Moreira
Public bug reported:

In a cell setup can't create instances with flavors that have extra specs like:
hw:numa_nodes
hw:mem_page_size


nova-cell in the "child cell" fails with:

2015-11-17 10:51:50.574 ERROR nova.cells.scheduler 
[req-f7dc64e6-a545-4c2c-bc57-4e4a2e86cf58 demo demo] Couldn't communicate with 
cell 'cell'
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler Traceback (most recent call 
last):
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/opt/stack/nova/nova/cells/scheduler.py", line 186, in _build_instances
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler image, security_groups, 
block_device_mapping)
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/opt/stack/nova/nova/cells/scheduler.py", line 109, in _create_instances_here
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler 
instance.update(instance_values)
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 
727, in update
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler setattr(self, key, value)
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 
71, in setter
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler field_value = 
field.coerce(self, name, value)
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/fields.py", line 
189, in coerce
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler return 
self._type.coerce(obj, attr, value)
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/fields.py", line 
506, in coerce
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler 'valtype': obj_name})
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler ValueError: An object of 
type InstanceNUMATopology is required in field numa_topology, not a 
2015-11-17 10:51:50.574 TRACE nova.cells.scheduler 
2015-11-17 10:51:50.574 ERROR nova.cells.scheduler 
[req-f7dc64e6-a545-4c2c-bc57-4e4a2e86cf58 demo demo] Couldn't communicate with 
any cells


Reproduce steps:
1) Setup nova in order to use cells.

2) Create a flavor with the extra spec "hw:numa_nodes"
nova flavor-create m1.nano.numa2 30 64 1 1
nova flavor-key 30 set hw:numa_nodes=1

3) Create an instance with the new flavor


Actual Result:
Instance status: ERROR
Instance task state: scheduling

Trace in "child cell".


Tested in devstack (master).
Tested in Kilo.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: cells

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1517006

Title:
  Can't create instances with flavors that have extra specs in a cell
  setup

Status in OpenStack Compute (nova):
  New

Bug description:
  In a cell setup can't create instances with flavors that have extra specs 
like:
  hw:numa_nodes
  hw:mem_page_size

  
  nova-cell in the "child cell" fails with:

  2015-11-17 10:51:50.574 ERROR nova.cells.scheduler 
[req-f7dc64e6-a545-4c2c-bc57-4e4a2e86cf58 demo demo] Couldn't communicate with 
cell 'cell'
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler Traceback (most recent 
call last):
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/opt/stack/nova/nova/cells/scheduler.py", line 186, in _build_instances
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler image, 
security_groups, block_device_mapping)
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/opt/stack/nova/nova/cells/scheduler.py", line 109, in _create_instances_here
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler 
instance.update(instance_values)
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 
727, in update
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler setattr(self, key, 
value)
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 
71, in setter
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler field_value = 
field.coerce(self, name, value)
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/fields.py", line 
189, in coerce
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler return 
self._type.coerce(obj, attr, value)
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler   File 
"/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/fields.py", line 
506, in coerce
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler 'valtype': obj_name})
  2015-11-17 10:51:50.574 TRACE nova.cells.scheduler ValueError: An object of 
type InstanceNUMATopology is required in field numa_topology, not a 
  2015-11-17 

[Yahoo-eng-team] [Bug 1461777] [NEW] NUMA cell overcommit can leave NUMA cells unused

2015-06-04 Thread Belmiro Moreira
Public bug reported:

NUMA cell overcommit can leave NUMA cells unused

When no NUMA configuration is defined for the guest (no flavor extra specs),
nova identifies the NUMA topology of the host and tries to match the cpu 
placement to a NUMA cell (cpuset). 

The cpuset is selected randomly.
pin_cpuset = random.choice(viable_cells_cpus) #nova/virt/libvirt/driver.py

However, this can lead to NUMA cells not being used.
This is particular noticeable when the flavor as the same number of vcpus 
as the host NUMA cells and in the host CPUs are not overcommit 
(cpu_allocation_ratio = 1)

###
Particular use case:

Compute nodes with the NUMA topology:
VirtNUMAHostTopology: {'cells': [{'mem': {'total': 12279, 'used': 0}, 
'cpu_usage': 0, 'cpus': '0,1,2,3,8,9,10,11', 'id': 0}, {'mem': {'total': 12288, 
'used': 0}, 'cpu_usage': 0, 'cpus': '4,5,6,7,12,13,14,15', 'id': 1}]}

No CPU overcommit: cpu_allocation_ratio = 1
Boot instances using a flavor with 8 vcpus. 
(No NUMA topology defined for the guest in the flavor)

In this particular case the host can have 2 instances. (no cpu overcommit)
Both instances can be allocated (random) with the same cpuset from the 2 
options:
vcpu placement='static' cpuset='4-7,12-15'8/vcpu
vcpu placement='static' cpuset='0-3,8-11'8/vcpu

As consequence half of the host CPUs are not used.


###
How to reproduce:

Using: nova 2014.2.2
(not tested in trunk however the code path looks similar)

1. set cpu_allocation_ratio = 1
2. Identify the NUMA topology of the compute node
3. Using a flavor with a number of vcpus that matches a NUMA cell in the 
compute node,
boot instances until fill the compute node.
4. Check the cpu placement cpuset used by the each instance.

Notes: 
- at this point instances can use the same cpuset leaving NUMA cells unused.
- the selection of the cpuset is random. Different tries may be needed.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: libvirt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1461777

Title:
  NUMA cell overcommit can leave NUMA cells unused

Status in OpenStack Compute (Nova):
  New

Bug description:
  NUMA cell overcommit can leave NUMA cells unused

  When no NUMA configuration is defined for the guest (no flavor extra specs),
  nova identifies the NUMA topology of the host and tries to match the cpu 
  placement to a NUMA cell (cpuset). 

  The cpuset is selected randomly.
  pin_cpuset = random.choice(viable_cells_cpus) #nova/virt/libvirt/driver.py

  However, this can lead to NUMA cells not being used.
  This is particular noticeable when the flavor as the same number of vcpus 
  as the host NUMA cells and in the host CPUs are not overcommit 
(cpu_allocation_ratio = 1)

  ###
  Particular use case:

  Compute nodes with the NUMA topology:
  VirtNUMAHostTopology: {'cells': [{'mem': {'total': 12279, 'used': 0}, 
'cpu_usage': 0, 'cpus': '0,1,2,3,8,9,10,11', 'id': 0}, {'mem': {'total': 12288, 
'used': 0}, 'cpu_usage': 0, 'cpus': '4,5,6,7,12,13,14,15', 'id': 1}]}

  No CPU overcommit: cpu_allocation_ratio = 1
  Boot instances using a flavor with 8 vcpus. 
  (No NUMA topology defined for the guest in the flavor)

  In this particular case the host can have 2 instances. (no cpu overcommit)
  Both instances can be allocated (random) with the same cpuset from the 2 
options:
  vcpu placement='static' cpuset='4-7,12-15'8/vcpu
  vcpu placement='static' cpuset='0-3,8-11'8/vcpu

  As consequence half of the host CPUs are not used.

  
  ###
  How to reproduce:

  Using: nova 2014.2.2
  (not tested in trunk however the code path looks similar)

  1. set cpu_allocation_ratio = 1
  2. Identify the NUMA topology of the compute node
  3. Using a flavor with a number of vcpus that matches a NUMA cell in the 
compute node,
  boot instances until fill the compute node.
  4. Check the cpu placement cpuset used by the each instance.

  Notes: 
  - at this point instances can use the same cpuset leaving NUMA cells unused.
  - the selection of the cpuset is random. Different tries may be needed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1461777/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1454418] [NEW] Evacuate fails when using cells - AttributeError: 'NoneType' object has no attribute 'count'

2015-05-12 Thread Belmiro Moreira
Public bug reported:

nova version: 2014.2.2
Using cells (parent - child setup)


How to reproduce:

nova evacuate instance_uuid target_host
ERROR: The server has either erred or is incapable of performing the requested 
operation. (HTTP 500) (Request-ID: req-af20-182a-4acd-869a-1b23314b21d4)


LOG:

2015-05-12 23:17:27.274 8013 ERROR nova.api.openstack 
[req-af20-182a-4acd-869a-1b23314b21d4 None] Caught error: 'NoneType' object 
has no attribute 'count'
Traceback (most recent call last):

  File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, 
line 134, in _dispatch_and_reply
incoming.message))

  File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, 
line 177, in _dispatch
return self._do_dispatch(endpoint, method, ctxt, args)

  File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, 
line 123, in _do_dispatch
result = getattr(endpoint, method)(ctxt, **new_args)

  File /usr/lib/python2.7/site-packages/nova/cells/manager.py, line 268, in 
service_get_by_compute_host
service = response.value_or_raise()

  File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 406, in 
process
next_hop = self._get_next_hop()

  File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 361, in 
_get_next_hop
dest_hops = target_cell.count(_PATH_CELL_SEP)

AttributeError: 'NoneType' object has no attribute 'count'

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: cells

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1454418

Title:
  Evacuate fails when using cells - AttributeError: 'NoneType' object
  has no attribute 'count'

Status in OpenStack Compute (Nova):
  New

Bug description:
  nova version: 2014.2.2
  Using cells (parent - child setup)

  
  How to reproduce:

  nova evacuate instance_uuid target_host
  ERROR: The server has either erred or is incapable of performing the 
requested operation. (HTTP 500) (Request-ID: 
req-af20-182a-4acd-869a-1b23314b21d4)


  LOG:

  2015-05-12 23:17:27.274 8013 ERROR nova.api.openstack 
[req-af20-182a-4acd-869a-1b23314b21d4 None] Caught error: 'NoneType' object 
has no attribute 'count'
  Traceback (most recent call last):

File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, 
line 134, in _dispatch_and_reply
  incoming.message))

File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, 
line 177, in _dispatch
  return self._do_dispatch(endpoint, method, ctxt, args)

File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, 
line 123, in _do_dispatch
  result = getattr(endpoint, method)(ctxt, **new_args)

File /usr/lib/python2.7/site-packages/nova/cells/manager.py, line 268, in 
service_get_by_compute_host
  service = response.value_or_raise()

File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 406, 
in process
  next_hop = self._get_next_hop()

File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 361, 
in _get_next_hop
  dest_hops = target_cell.count(_PATH_CELL_SEP)

  AttributeError: 'NoneType' object has no attribute 'count'

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1454418/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1448564] [NEW] Rescue using cells fails with: unexpected keyword argument 'expected_task_state'

2015-04-25 Thread Belmiro Moreira
Public bug reported:

Instance rescue gets stuck when using cells.

nova version: 2014.2.2
Using cells (parent - child setup)


How to reproduce:

nova rescue instance_uuid 
- the instance task state stays in rescuing.
- nova cells log of the child shows:

2015-04-26 01:26:09.475 20672 ERROR nova.cells.messaging 
[req-162b3318-70c3-4290-8e09-ffb9fbcef19d None] Error processing message 
locally: save() got an unexpected keyword argument 'expected_task_state'
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging Traceback (most recent 
call last):
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 199, in 
_process_locally
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging resp_value = 
self.msg_runner._process_message_locally(self)
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 1293, in 
_process_message_locally
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return fn(message, 
**message.method_kwargs)
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 698, in 
run_compute_api_method
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return 
fn(message.ctxt, *args, **method_info['method_kwargs'])
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/compute/api.py, line 224, in wrapped
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return func(self, 
context, target, *args, **kwargs)
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/compute/api.py, line 214, in inner
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return 
function(self, context, instance, *args, **kwargs)
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/compute/api.py, line 195, in inner
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return f(self, 
context, instance, *args, **kw)
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/compute/api.py, line 2750, in rescue
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging 
instance.save(expected_task_state=[None])
2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging TypeError: save() got 
an unexpected keyword argument 'expected_task_state'

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: cells

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1448564

Title:
  Rescue using cells fails with: unexpected keyword argument
  'expected_task_state'

Status in OpenStack Compute (Nova):
  New

Bug description:
  Instance rescue gets stuck when using cells.

  nova version: 2014.2.2
  Using cells (parent - child setup)

  
  How to reproduce:

  nova rescue instance_uuid 
  - the instance task state stays in rescuing.
  - nova cells log of the child shows:

  2015-04-26 01:26:09.475 20672 ERROR nova.cells.messaging 
[req-162b3318-70c3-4290-8e09-ffb9fbcef19d None] Error processing message 
locally: save() got an unexpected keyword argument 'expected_task_state'
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging Traceback (most 
recent call last):
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 199, in 
_process_locally
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging resp_value = 
self.msg_runner._process_message_locally(self)
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 1293, in 
_process_message_locally
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return 
fn(message, **message.method_kwargs)
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 698, in 
run_compute_api_method
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return 
fn(message.ctxt, *args, **method_info['method_kwargs'])
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/compute/api.py, line 224, in wrapped
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return 
func(self, context, target, *args, **kwargs)
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/compute/api.py, line 214, in inner
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return 
function(self, context, instance, *args, **kwargs)
  2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging   File 
/usr/lib/python2.7/site-packages/nova/compute/api.py, line 195, in inner
  

[Yahoo-eng-team] [Bug 1417027] [NEW] No disable reason defined for new services when enable_new_services=False

2015-02-02 Thread Belmiro Moreira
Public bug reported:

When a service is added and enable_new_services=False there is no disable
reason specified.
Services can be disabled by several reasons and the admins can use the API to
specify a reason. However, having services disabled with no reason specified 
creates additional checks on the operators side that increases with the 
deployment size.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1417027

Title:
  No disable reason defined for new services when
  enable_new_services=False

Status in OpenStack Compute (Nova):
  New

Bug description:
  When a service is added and enable_new_services=False there is no disable
  reason specified.
  Services can be disabled by several reasons and the admins can use the API to
  specify a reason. However, having services disabled with no reason specified 
  creates additional checks on the operators side that increases with the 
  deployment size.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1417027/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1414480] [NEW] Cell type in “nova-manage cell create” is different from what is used in nova.conf

2015-01-25 Thread Belmiro Moreira
Public bug reported:

The cell_type option is defined in nova.conf as “api” or “compute”.
However, when creating a cell using “nova-manage” the cell type “parent” or 
“child” is expected.
nova-manage cell_type should be consistent with what is allowed in nova.conf.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: cells

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1414480

Title:
  Cell type in “nova-manage cell create” is different from what is used
  in nova.conf

Status in OpenStack Compute (Nova):
  New

Bug description:
  The cell_type option is defined in nova.conf as “api” or “compute”.
  However, when creating a cell using “nova-manage” the cell type “parent” or 
“child” is expected.
  nova-manage cell_type should be consistent with what is allowed in nova.conf.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1414480/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1369518] [NEW] Server Group Anti/Affinity functionality doesn't work with cells

2014-09-15 Thread Belmiro Moreira
Public bug reported:

Server Groups doesn't with cells.
Tested in Icehouse.

Using the API the server group is created in the top cell and not propagated 
to children cells.
At this point booting a VM fails because schedulers in children cells are not 
aware of the server group.

Creating the entries manually in the children cells databases avoid the 
instance scheduling to fail,
however the anti/affinity policy is not correct. 
Server group members are only updated in the TOP cell.  Schedulers at 
children cells are 
not aware of members in the group (empty table in children) so anti/affinity is 
not respected.

** Affects: nova
 Importance: Wishlist
 Status: Confirmed


** Tags: cells

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1369518

Title:
  Server Group Anti/Affinity functionality doesn't work with cells

Status in OpenStack Compute (Nova):
  Confirmed

Bug description:
  Server Groups doesn't with cells.
  Tested in Icehouse.

  Using the API the server group is created in the top cell and not 
propagated to children cells.
  At this point booting a VM fails because schedulers in children cells are not 
aware of the server group.

  Creating the entries manually in the children cells databases avoid the 
instance scheduling to fail,
  however the anti/affinity policy is not correct. 
  Server group members are only updated in the TOP cell.  Schedulers at 
children cells are 
  not aware of members in the group (empty table in children) so anti/affinity 
is not respected.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1369518/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1334278] [NEW] limits with tenant parameter returns wrong maxTotal* values

2014-06-25 Thread Belmiro Moreira
Public bug reported:

When querying for the absolute limits of a specific tenant
the maxTotal* values reported aren't correct.

How to reproduce:
for example using devstack...

OS_TENANT_NAME=demo (11b2b129994844798c98f437d9809a9c)
OS_USERNAME=demo

$nova absolute-limits

+-+---+
| Name| Value |
+-+---+
| maxServerMeta   | 128   |
| maxPersonality  | 5 |
| maxImageMeta| 128   |
| maxPersonalitySize  | 10240 |
| maxTotalRAMSize | 1000  |
| maxSecurityGroupRules   | 20|
| maxTotalKeypairs| 100   |
| totalRAMUsed| 128   |
| maxSecurityGroups   | 10|
| totalFloatingIpsUsed| 0 |
| totalInstancesUsed  | 2 |
| totalSecurityGroupsUsed | 1 |
| maxTotalFloatingIps | 10|
| maxTotalInstances   | 10|---
| totalCoresUsed  | 2 |
| maxTotalCores   | 10|   ---
+-+---+

OS_TENANT_NAME=admin (b0f08277004b43aab516ae7dbf36ff51)
OS_USERNAME=admin

$nova absolute-limits

+-++
| Name| Value  |
+-++
| maxServerMeta   | 128|
| maxPersonality  | 5  |
| maxImageMeta| 128|
| maxPersonalitySize  | 10240  |
| maxTotalRAMSize | 151200 |
| maxSecurityGroupRules   | 20 |
| maxTotalKeypairs| 100|
| totalRAMUsed| 1152   |
| maxSecurityGroups   | 10 |
| totalFloatingIpsUsed| 0  |
| totalInstancesUsed  | 18 |
| totalSecurityGroupsUsed | 1  |
| maxTotalFloatingIps | 10 |
| maxTotalInstances   | 30 |
| totalCoresUsed  | 18 |
| maxTotalCores   | 30 |
+-++

$nova absolute-limits --tenant 11b2b129994844798c98f437d9809a9c

+-++
| Name| Value  |
+-++
| maxServerMeta   | 128|
| maxPersonality  | 5  |
| maxImageMeta| 128|
| maxPersonalitySize  | 10240  |
| maxTotalRAMSize | 151200 |
| maxSecurityGroupRules   | 20 |
| maxTotalKeypairs| 100|
| totalRAMUsed| 128|
| maxSecurityGroups   | 10 |
| totalFloatingIpsUsed| 0  |
| totalInstancesUsed  | 2  |
| totalSecurityGroupsUsed | 1  |
| maxTotalFloatingIps | 10 |
| maxTotalInstances   | 30 | ---
| totalCoresUsed  | 2  |
| maxTotalCores   | 30 |---
+-++

note: arrows show the wrong values.
Seems that maxTotal* shows the values for the current tenant and not what is 
specified by --tenant
as expected.

tested in havana and icehouse-1

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1334278

Title:
  limits with tenant parameter returns wrong maxTotal* values

Status in OpenStack Compute (Nova):
  New

Bug description:
  When querying for the absolute limits of a specific tenant
  the maxTotal* values reported aren't correct.

  How to reproduce:
  for example using devstack...

  OS_TENANT_NAME=demo (11b2b129994844798c98f437d9809a9c)
  OS_USERNAME=demo

  $nova absolute-limits

  +-+---+
  | Name| Value |
  +-+---+
  | maxServerMeta   | 128   |
  | maxPersonality  | 5 |
  | maxImageMeta| 128   |
  | maxPersonalitySize  | 10240 |
  | maxTotalRAMSize | 1000  |
  | maxSecurityGroupRules   | 20|
  | maxTotalKeypairs| 100   |
  | totalRAMUsed| 128   |
  | maxSecurityGroups   | 10|
  | totalFloatingIpsUsed| 0 |
  | totalInstancesUsed  | 2 |
  | totalSecurityGroupsUsed | 1 |
  | maxTotalFloatingIps | 10|
  | maxTotalInstances   | 10|---
  | totalCoresUsed  | 2 |
  | maxTotalCores   | 10|   ---
  +-+---+

  OS_TENANT_NAME=admin (b0f08277004b43aab516ae7dbf36ff51)
  OS_USERNAME=admin

  $nova absolute-limits

  +-++
  | Name| Value  |
  +-++
  | maxServerMeta   | 128|
  | maxPersonality  | 5  |
  | maxImageMeta| 128|
  | maxPersonalitySize  | 10240  |
  | maxTotalRAMSize | 151200 |
  | maxSecurityGroupRules   | 20 |
  | maxTotalKeypairs| 100|
  | totalRAMUsed| 1152   |
  | maxSecurityGroups   | 10 |
  | totalFloatingIpsUsed| 0  |
  | 

[Yahoo-eng-team] [Bug 1307223] [NEW] If target_cell path not valid instance stays in BUILD status

2014-04-13 Thread Belmiro Moreira
Public bug reported:

Using cells and the target_cell filter.
With the scheduler hint target_cell if path is not valid
instance will stay in scheduling task state.

nova cells shows the following trace:

2014-04-13 20:25:40.237 ERROR nova.cells.messaging 
[req-8bc1d2a7-92aa-48b6-afda-42f255e43904 demo demo] Error locating next hop 
for message: Inconsistency in cell routing: Unknown child when routing to 
region!other
2014-04-13 20:25:40.237 TRACE nova.cells.messaging Traceback (most recent call 
last):
2014-04-13 20:25:40.237 TRACE nova.cells.messaging   File 
/opt/stack/nova/nova/cells/messaging.py, line 406, in process
2014-04-13 20:25:40.237 TRACE nova.cells.messaging next_hop = 
self._get_next_hop()
2014-04-13 20:25:40.237 TRACE nova.cells.messaging   File 
/opt/stack/nova/nova/cells/messaging.py, line 387, in _get_next_hop
2014-04-13 20:25:40.237 TRACE nova.cells.messaging raise 
exception.CellRoutingInconsistency(reason=reason)
2014-04-13 20:25:40.237 TRACE nova.cells.messaging CellRoutingInconsistency: 
Inconsistency in cell routing: Unknown child when routing to region!other
2014-04-13 20:25:40.237 TRACE nova.cells.messaging 

Expected:
instance state changes to ERROR status.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: cells

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1307223

Title:
  If target_cell path not valid instance stays in BUILD status

Status in OpenStack Compute (Nova):
  New

Bug description:
  Using cells and the target_cell filter.
  With the scheduler hint target_cell if path is not valid
  instance will stay in scheduling task state.

  nova cells shows the following trace:

  2014-04-13 20:25:40.237 ERROR nova.cells.messaging 
[req-8bc1d2a7-92aa-48b6-afda-42f255e43904 demo demo] Error locating next hop 
for message: Inconsistency in cell routing: Unknown child when routing to 
region!other
  2014-04-13 20:25:40.237 TRACE nova.cells.messaging Traceback (most recent 
call last):
  2014-04-13 20:25:40.237 TRACE nova.cells.messaging   File 
/opt/stack/nova/nova/cells/messaging.py, line 406, in process
  2014-04-13 20:25:40.237 TRACE nova.cells.messaging next_hop = 
self._get_next_hop()
  2014-04-13 20:25:40.237 TRACE nova.cells.messaging   File 
/opt/stack/nova/nova/cells/messaging.py, line 387, in _get_next_hop
  2014-04-13 20:25:40.237 TRACE nova.cells.messaging raise 
exception.CellRoutingInconsistency(reason=reason)
  2014-04-13 20:25:40.237 TRACE nova.cells.messaging CellRoutingInconsistency: 
Inconsistency in cell routing: Unknown child when routing to region!other
  2014-04-13 20:25:40.237 TRACE nova.cells.messaging 

  Expected:
  instance state changes to ERROR status.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1307223/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1286527] [NEW] Quota usages update should check all usage in tenant not only per user

2014-03-01 Thread Belmiro Moreira
Public bug reported:

After Grizzly - Havana upgrade the quota_usages table was 
wiped out due to bug #1245746

Quota_usages is then updated after a user creates/delete an instance.
The problem is that quota_usages is updated per user in a tenant.

For tenants that are shared by different users this means that users that 
didn't have created instances previous are able to use the full quota for the 
tenant.

Example:
instance quota for tenant_X = 10
user_a and user_b can create instances in tenant_X

 - user_a creates 8 instances;
 - user_b didn't have instances;
 - grizzly - havana upgrade (usage_quotas wipe)
 - user_b is able to create 10 instances
Problematic for clouds that rely in tenant quotas and not billing directly 
users.

Even if previous example is associated with bug #1245746
this can happen if a user quota usage for a tenant gets out of sync.


Quota usages should be updated and sync considering all resources in the tenant 
and 
not only the resources of the user that is doing the request.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1286527

Title:
  Quota usages update should check all usage in tenant not only per user

Status in OpenStack Compute (Nova):
  New

Bug description:
  After Grizzly - Havana upgrade the quota_usages table was 
  wiped out due to bug #1245746

  Quota_usages is then updated after a user creates/delete an instance.
  The problem is that quota_usages is updated per user in a tenant.

  For tenants that are shared by different users this means that users that 
  didn't have created instances previous are able to use the full quota for the 
tenant.

  Example:
  instance quota for tenant_X = 10
  user_a and user_b can create instances in tenant_X

   - user_a creates 8 instances;
   - user_b didn't have instances;
   - grizzly - havana upgrade (usage_quotas wipe)
   - user_b is able to create 10 instances
  Problematic for clouds that rely in tenant quotas and not billing directly 
users.

  Even if previous example is associated with bug #1245746
  this can happen if a user quota usage for a tenant gets out of sync.

  
  Quota usages should be updated and sync considering all resources in the 
tenant and 
  not only the resources of the user that is doing the request.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1286527/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1282709] [NEW] Instance names always include the first uuid in cell environment when creating multiple instances

2014-02-20 Thread Belmiro Moreira
Public bug reported:

When launching multiple instances using nova api in a cell environment
(parent-child setup) the display_name always have the uuid of the first
instance.

Example:
1) instance_name-uuid-1-uuid-1
2) instance_name-uuid-1-uuid-2
3) instance_name-uuid-1-uuid-3
4) instance_name-uuid-1-uuid-4

Expected:
1) instance_name-uuid-1
2) instance_name-uuid-2
3) instance_name-uuid-3
4) instance_name-uuid-4

How to reproduce:
* Have cell environment (default devstack with cells enabled is enough)
* nova boot --image image_uuid --flavor flavor_name --num-instances 4  
instance_name

** Affects: nova
 Importance: Undecided
 Assignee: Belmiro Moreira (moreira-belmiro-email-lists)
 Status: New


** Tags: cells

** Description changed:

- When launching multiple instances using nova api in a cell environment 
(parent-child setup) 
+ When launching multiple instances using nova api in a cell environment 
(parent-child setup)
  the hostnames always have the uuid of the first instance.
  
  Example:
  1) instance_name-uuid-1-uuid-1
- 
  2) instance_name-uuid-1-uuid-2
- 
  3) instance_name-uuid-1-uuid-3
- 
  4) instance_name-uuid-1-uuid-4
  
  Expected:
  1) instance_name-uuid-1
- 
  2) instance_name-uuid-2
- 
  3) instance_name-uuid-3
- 
  4) instance_name-uuid-4
  
  How to reproduce:
- 1) Have cell environment (default devstack with cells enabled is enough)
- 2) nova boot --image image_uuid --flavor flavor_name --num-instances 4  
instance_name
+ * Have cell environment (default devstack with cells enabled is enough)
+ * nova boot --image image_uuid --flavor flavor_name --num-instances 4  
instance_name

** Description changed:

- When launching multiple instances using nova api in a cell environment 
(parent-child setup)
- the hostnames always have the uuid of the first instance.
+ When launching multiple instances using nova api in a cell environment
+ (parent-child setup) the display_name always have the uuid of the first
+ instance.
  
  Example:
  1) instance_name-uuid-1-uuid-1
  2) instance_name-uuid-1-uuid-2
  3) instance_name-uuid-1-uuid-3
  4) instance_name-uuid-1-uuid-4
  
  Expected:
  1) instance_name-uuid-1
  2) instance_name-uuid-2
  3) instance_name-uuid-3
  4) instance_name-uuid-4
  
  How to reproduce:
  * Have cell environment (default devstack with cells enabled is enough)
  * nova boot --image image_uuid --flavor flavor_name --num-instances 4  
instance_name

** Changed in: nova
 Assignee: (unassigned) = Belmiro Moreira (moreira-belmiro-email-lists)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1282709

Title:
  Instance names always include the first uuid in cell environment when
  creating multiple instances

Status in OpenStack Compute (Nova):
  New

Bug description:
  When launching multiple instances using nova api in a cell environment
  (parent-child setup) the display_name always have the uuid of the
  first instance.

  Example:
  1) instance_name-uuid-1-uuid-1
  2) instance_name-uuid-1-uuid-2
  3) instance_name-uuid-1-uuid-3
  4) instance_name-uuid-1-uuid-4

  Expected:
  1) instance_name-uuid-1
  2) instance_name-uuid-2
  3) instance_name-uuid-3
  4) instance_name-uuid-4

  How to reproduce:
  * Have cell environment (default devstack with cells enabled is enough)
  * nova boot --image image_uuid --flavor flavor_name --num-instances 4  
instance_name

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1282709/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1274169] [NEW] Nova libvirt driver uses the instance type ID instead the flavor ID when creating instances - problematic with cells

2014-01-29 Thread Belmiro Moreira
Public bug reported:

For flavors in cells is needed to create the same flavor manually in all
available cells using nova API. If for some reason we need to delete a
flavor in a cell the “instance_types” tables will then be out of sync
(different IDs for flavors).

This blocks the instance creation using the libvirt driver because then
the instance type ID for the flavor in the top cell will be different
in the Child.

I would expect that nova use the “flavor ID” that is defined by the
admin instead the instance type ID.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1274169

Title:
  Nova libvirt driver uses the instance type ID instead the flavor
  ID when creating instances - problematic with cells

Status in OpenStack Compute (Nova):
  New

Bug description:
  For flavors in cells is needed to create the same flavor manually in
  all available cells using nova API. If for some reason we need to
  delete a flavor in a cell the “instance_types” tables will then be out
  of sync (different IDs for flavors).

  This blocks the instance creation using the libvirt driver because
  then the instance type ID for the flavor in the top cell will be
  different in the Child.

  I would expect that nova use the “flavor ID” that is defined by the
  admin instead the instance type ID.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1274169/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1274325] [NEW] Security-groups not working with cells using nova-network

2014-01-29 Thread Belmiro Moreira
Public bug reported:

Security groups are not working with cells using nova-network.
Only cell API database is updated when adding rules. These are not propagated 
into the children cells.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: cells

** Description changed:

- Security groups are not working with cells using nova-network
+ Security groups are not working with cells using nova-network.
  Only cell API database is updated when adding rules. These are not propagated 
into the children cells.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1274325

Title:
  Security-groups not working with cells using nova-network

Status in OpenStack Compute (Nova):
  New

Bug description:
  Security groups are not working with cells using nova-network.
  Only cell API database is updated when adding rules. These are not propagated 
into the children cells.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1274325/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1164408] Re: Snapshot doesn't get hypervisor_type and vm_mode properties

2013-05-13 Thread Belmiro Moreira
** Changed in: nova
   Status: Triaged = Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1164408

Title:
  Snapshot doesn't get hypervisor_type and vm_mode properties

Status in OpenStack Compute (Nova):
  Invalid

Bug description:
  When a snapshot is created it only gets some properties from the base image. 
  In fact it only gets the architecture propriety (if defined).

  This is a problem when using the schedule filter ImagesPropertiesFilter 
because 
  it also can filter by hypervisor_type and vm_mode properties.
  I believe we can assume if the base image has requirements of architecture 
or 
  hypervisor_type or vm_mode the snapshot should have them too...

  I'm using the LibvirtDriver.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1164408/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp