Re: [openstack-dev] [nova] Thoughs please on how to address a problem with mutliple deletes leading to a nova-compute thread pool problem

Day, Phil Fri, 25 Oct 2013 05:42:59 -0700

>There may be multiple API servers; global state in an API server seems fraught 
>with issues.


No, the state would be in the DB (it would either be a task_state of Deleteing 
or some new "delete_stated_at" timestamp

I agree that i) is nice and simple - it just has the minor risks that the 
delete itself could hang, and/or that we might find some other issues with bits 
of the code that can't cope at the moment with the instance being deleted from 
underneath them

-----Original Message-----
From: Robert Collins [mailto:robe...@robertcollins.net] 
Sent: 25 October 2013 12:21
To: OpenStack Development Mailing List
Subject: Re: [openstack-dev] [nova] Thoughs please on how to address a problem 
with mutliple deletes leading to a nova-compute thread pool problem

On 25 October 2013 23:46, Day, Phil <philip....@hp.com> wrote:
> Hi Folks,
>
> We're very occasionally seeing problems where a thread processing a create 
> hangs (and we've seen when taking to Cinder and Glance).  Whilst those issues 
> need to be hunted down in their own rights, they do show up what seems to me 
> to be a weakness in the processing of delete requests that I'd like to get 
> some feedback on.
>
> Delete is the one operation that is allowed regardless of the Instance state 
> (since it's a one-way operation, and users should always be able to free up 
> their quota).   However when we get a create thread hung in one of these 
> states, the delete requests when they hit the manager will also block as they 
> are synchronized on the uuid.   Because the user making the delete request 
> doesn't see anything happen they tend to submit more delete requests.   The 
> Service is still up, so these go to the computer manager as well, and 
> eventually all of the threads will be waiting for the lock, and the compute 
> manager will stop consuming new messages.
>
> The problem isn't limited to deletes - although in most cases the change of 
> state in the API means that you have to keep making different calls to get 
> past the state checker logic to do it with an instance stuck in another 
> state.   Users also seem to be more impatient with deletes, as they are 
> trying to free up quota for other things.
>
> So while I know that we should never get a thread into a hung state into the 
> first place, I was wondering about one of the following approaches to address 
> just the delete case:
>
> i) Change the delete call on the manager so it doesn't wait for the uuid 
> lock.  Deletes should be coded so that they work regardless of the state of 
> the VM, and other actions should be able to cope with a delete being 
> performed from under them.  There is of course no guarantee that the delete 
> itself won't block as well.

I like this.

> ii) Record in the API server that a delete has been started (maybe enough to 
> use the task state being set to DELETEING in the API if we're sure this 
> doesn't get cleared), and add a periodic task in the compute manager to check 
> for and delete instances that are in a "DELETING" state for more than some 
> timeout. Then the API, knowing that the delete will be processes eventually 
> can just no-op any further delete requests.

There may be multiple API servers; global state in an API server seems fraught 
with issues.

> iii) Add some hook into the ServiceGroup API so that the timer could depend 
> on getting a free thread from the compute manager pool (ie run some no-op 
> task) - so that of there are no free threads then the service becomes down. 
> That would (eventually) stop the scheduler from sending new requests to it, 
> and make deleted be processed in the API server but won't of course help with 
> commands for other instances on the same host.

This seems a little kludgy to me.

> iv) Move away from having a general topic and thread pool for all requests, 
> and start a listener on an instance specific topic for each running instance 
> on a host (leaving the general topic and pool just for creates and other 
> non-instance calls like the hypervisor API).   Then a blocked task would only 
> affect request for a specific instance.

That seems to suggest instance  # topics? Aieee. I don't think that solves the 
problem anyway, because either a) you end up with a tonne of threads, or b) you 
have a multiplexing thread with the same potential issue.

You could more simply just have a dedicated thread pool for deletes, and have 
no thread limit on the pool. Of course, this will fail when you OOM :). You 
could do a dict with instance -> thread for deletes instead, without creating 
lots of queues.

> I'm tending towards ii) as a simple and pragmatic solution in the near term, 
> although I like both iii) and iv) as being both generally good enhancments - 
> but iv) in particular feels like a pretty seismic change.
>

My inclination would be (i) - make deletes nonblocking idempotent with lazy 
cleanup if resources take a while to tear down.

-Rob

--
Robert Collins <rbtcoll...@hp.com>
Distinguished Technologist
HP Converged Cloud

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Thoughs please on how to address a problem with mutliple deletes leading to a nova-compute thread pool problem

Reply via email to