On 12/17/2015 9:24 AM, Matt Riedemann wrote:


On 12/17/2015 8:51 AM, Andrea Rosa wrote:

The communication with cinder is async, Nova doesn't wait or check if
the detach on cinder side has been executed correctly.

Yeah, I guess nova gets the 202 back:

http://logs.openstack.org/18/258118/2/check/gate-tempest-dsvm-full-ceph/7a5290d/logs/screen-n-cpu.txt.gz#_2015-12-16_03_30_43_990



Should nova be waiting for detach to complete before it tries deleting
the volume (in the case that delete_on_termination=True in the bdm)?

Should nova be waiting (regardless of volume delete) for the volume
detach to complete - or timeout and fail the instance delete if it
doesn't?

I'll revisit this change next year trying to look at the problem in a
different way.
Thank you all for your time and all the suggestions.
--
Andrea Rosa

__________________________________________________________________________

OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


I had a quick discussion with hemna this morning and he confirmed that
nova should be waiting for os-detach to complete before we try to delete
the volume, because if the volume status isn't 'available' the delete
will fail.

Also, if nova is hitting a failure to delete the volume it's swallowing
it by passing raise_exc=False to _cleanup_volumes here [1]. Then we go
on our merry way and delete the bdms in the nova database [2]. But I'd
think at that point we're orphaning volumes in cinder that think they
are still attached.

If this is passing today it's probably just luck that we're getting the
volume detached fast enough before we try to delete it.

[1]
https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2425-L2426

[2]
https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L909


I've confirmed that we definitely race in the gate with detach of the volume and then deleting it, we fail to delete the volume about 28K times in a week in the gate [1].

I've opened a bug [2] to track fixing this.

[1] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Failed%20to%20delete%20volume%5C%22%20AND%20message:%5C%22due%20to%5C%22%20AND%20tags:%5C%22screen-n-cpu.txt%5C%22
[2] https://bugs.launchpad.net/nova/+bug/1527623

--

Thanks,

Matt Riedemann


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to