I've been chasing something weird I was seeing in devstack when creating hundreds of instances in a single request where at some limit, things blow up in an unexpected way during scheduling and all instances were put into ERROR state. Given the environment I was running in, this shouldn't have been happening, and today we figured out what was actually happening. To summarize, we retry scheduling requests on RPC timeout so you can have scheduler_max_attempts greenthreads running concurrently trying to schedule 1000 instances and melt your scheduler.

I've started a spec which goes into the details of the actual issue:

https://review.openstack.org/#/c/510235/

It also proposes a solution, but I don't feel it's the greatest solution, so there are also some alternatives in there.

I'm really interested in operator feedback on this because I assume that people are dealing with stuff like this in production already, and have had to come up with ways to solve it.

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to