Thanks to Andrea for clarifying this complex and long history problem. As he says, we have talked about many solutions, and finally we found this using nova-mange to call the terminate_connection and clean the bdm entry. It's simple and clear IMO and no bad impact to Nova API.
2015-12-03 1:37 GMT+08:00 Rosa, Andrea (HP Cloud Services) <[email protected]>: > Hi > > thanks Sean for bringing this point, I have been working on the change and on > the (abandoned) spec. > I'll try here to summarize all the discussions we had and what we decided. > >> From: Sean Dague [mailto:[email protected]] >> Sent: 02 December 2015 13:31 >> To: OpenStack Development Mailing List (not for usage questions) >> Subject: [openstack-dev] [nova] what are the key errors with volume detach >> >> This patch to add a bunch of logic to nova-manage for forcing volume detach >> raised a bunch of questions >> https://review.openstack.org/#/c/184537/24/nova/cmd/manage.py,cm > > On this specific review there are some valid concerns that I am happy to > address, but first we need to understand if we want this change. > FWIW I think it is still a valid change, please see below. > >> In thinking about this for the last day, I think the real concern is that we >> have >> so many safety checks on volume delete, that if we failed with a partially >> setup volume, we have too many safety latches to tear it down again. >> >> Do we have some detailed bugs about how that happens? Is it possible to >> just fix DELETE to work correctly even when we're in these odd states? > > In a simplified view of a detach volume we can say that the nova code does: > 1 detach the volume from the instance > 2 Inform cinder about the detach and call the terminate_connection on the > cinder API. > 3 delete the dbm recod in the nova DB > > If 2 fails the volumes get stuck in a detaching status and any further > attempt to delete or detach the volume will fail: > "Delete for volume <volume_id> failed: Volume <volume_id> is still attached, > detach volume first. (HTTP 400)" > > And if you try to detach: > "EROR (BadRequest): Invalid input received: Invalid volume: Unable to detach > volume. Volume status must be 'in-use' and attach_status must be 'attached' > to detach. Currently: status: 'detaching', attach_status: 'attached.' (HTTP > 400)" > > at the moment the only way to clean up the situation is to hack the nova DB > for deleting the bdm record and do some hack on the cinder side as well. > We wanted a way to clean up the situation avoiding the manual hack to the > nova DB. > > Solution proposed #1 > Move the deletion of the bdm record so as it happens before calling cinder, I > thought that was ok as from nova side we have done, no leaking bdm and the > problem was just in the cinder side, but I was wrong. > We have to call the terminate_connection otherwise the device may show back > on the nova host, for example that is true for iSCSI volumes: > "if an iSCSI session from the compute host to the storage backend still > exists (because other volumes are connected), then the volume you just > removed will show back up on the next scsi bus rescan." > The key point here is that Nova must call the terminate_connection because > just Nova has the "connector info" to call the terminate connection method, > so Cinder can't fix it. > > Solution proposed #2 > Then I thought, ok, so let's expose a new nova API called force delete volume > which skips all the check and allow to detach a volume in "detaching" status, > I thought it was ok but I was wrong (again). > The main concern here is that we do not want to have the concept of "force > delete", the user already asked for detaching the volume and the call should > be idempotent and just work. > So adding a new API was just adding a technical debt in the RESP API for a > buggy/weak interaction between the Cinder API and Nova, or in other words we > are adding a Nova API for fixing a bug in Cinder, which is very odd. > > Solution proposed #3 > Ok, so the solution is to fix the Cinder API and makes the interaction > between Nova volume manager and that API robust. > This time I was right (YAY) but as you can imagine this fix is not going to > be an easy one and after talking with Cinder guys they clearly told me that > thatt is going to be a massive change in the Cinder API and it is unlikely to > land in the N(utella) or O(melette) release. > > Solution proposed #4: The trade-off > This solution is the solution I am proposing in the patch you mentioned, the > idea is to have a temporary solution which allows us to give a handy tool for > the operators to easily fix a problem which can occur quite often. > The solution should be something with a low impact on the nova code and easy > to remove when we will have the proper solution for the root cause. > "quick", "useful", "dirty", "tool", "trade-off", "db"....we call it > nova-manage! > The idea is to put a new method in the compute/api.py which skips all the > checks on the volume status and go ahead with calling the detach_volume on > the compute manager to detach the volume, call the terminate_connection and > clean the bdm entry. > Nova-manage will have a command to call directly that new method. > > That is a recap that I hope can help to understand how we decided to use > nova-manage instead of other solutions, it was a bit long but I tried to > condensate the comments from 53 spec patches + 24 code change patches (and > counting). > > PS: I added the cinder tag as they are interested in this change as well. > > Thanks > -- > Andrea Rosa > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
