Re: [openstack-dev] Scheduler proposal

Alec Hothan (ahothan) Fri, 09 Oct 2015 21:23:37 -0700



On 10/9/15, 6:29 PM, "Clint Byrum" <cl...@fewbar.com> wrote:

>Excerpts from Chris Friesen's message of 2015-10-09 17:33:38 -0700:
>> On 10/09/2015 03:36 PM, Ian Wells wrote:
>> > On 9 October 2015 at 12:50, Chris Friesen <chris.frie...@windriver.com
>> > <mailto:chris.frie...@windriver.com>> wrote:
>> >
>> >     Has anybody looked at why 1 instance is too slow and what it would 
>> > take to
>> >
>> >         make 1 scheduler instance work fast enough? This does not preclude 
>> > the
>> >         use of
>> >         concurrency for finer grain tasks in the background.
>> >
>> >
>> >     Currently we pull data on all (!) of the compute nodes out of the 
>> > database
>> >     via a series of RPC calls, then evaluate the various filters in python 
>> > code.
>> >
>> >
>> > I'll say again: the database seems to me to be the problem here.  Not to
>> > mention, you've just explained that they are in practice holding all the 
>> > data in
>> > memory in order to do the work so the benefit we're getting here is really 
>> > a
>> > N-to-1-to-M pattern with a DB in the middle (the store-to-DB is rather
>> > secondary, in fact), and that without incremental updates to the receivers.
>> 
>> I don't see any reason why you couldn't have an in-memory scheduler.
>> 
>> Currently the database serves as the persistant storage for the resource 
>> usage, 
>> so if we take it out of the picture I imagine you'd want to have some way of 
>> querying the compute nodes for their current state when the scheduler first 
>> starts up.
>> 
>> I think the current code uses the fact that objects are remotable via the 
>> conductor, so changing that to do explicit posts to a known scheduler topic 
>> would take some work.
>> 
>
>Funny enough, I think thats exactly what Josh's "just use Zookeeper"
>message is about. Except in memory, it is "in an observable storage
>location".
>
>Instead of having the scheduler do all of the compute node inspection
>and querying though, you have the nodes push their stats into something
>like Zookeeper or consul, and then have schedulers watch those stats
>for changes to keep their in-memory version of the data up to date. So
>when you bring a new one online, you don't have to query all the nodes,
>you just scrape the data store, which all of these stores (etcd, consul,
>ZK) are built to support atomically querying and watching at the same
>time, so you can have a reasonable expectation of correctness.
>
>Even if you figured out how to make the in-memory scheduler crazy fast,
>There's still value in concurrency for other reasons. No matter how
>fast you make the scheduler, you'll be slave to the response time of
>a single scheduling request. If you take 1ms to schedule each node
>(including just reading the request and pushing out your scheduling
>result!) you will never achieve greater than 1000/s. 1ms is way lower
>than it's going to take just to shove a tiny message into RabbitMQ or
>even 0mq.

That is not what I have seen, measurements that I did or done by others show 
between 5000 and 10000 send *per sec* (depending on mirroring, up to 1KB msg 
size) using oslo messaging/kombu over rabbitMQ.
And this is unmodified/highly unoptimized oslo messaging code.
If you remove the oslo messaging layer, you get 25000 to 45000 msg/sec with 
kombu/rabbitMQ (which shows how inefficient is oslo messaging layer itself)


> So I'm pretty sure this is o-k for small clouds, but would be
>a disaster for a large, busy cloud.

It all depends on how many sched/sec for the "large busy cloud"...

>
>If, however, you can have 20 schedulers that all take 10ms on average,
>and have the occasional lock contention for a resource counter resulting
>in 100ms, now you're at 2000/s minus the lock contention rate. This
>strategy would scale better with the number of compute nodes, since
>more nodes means more distinct locks, so you can scale out the number
>of running servers separate from the number of scheduling requests.

How many compute nodes are we talking about max? How many scheduling per second 
is the requirement? And where are we today with the latest nova scheduler?
My point is that without these numbers we could end up under-shooting, 
over-shooting or over-engineering along with the cost of maintaining that extra 
complexity over the lifetime of openstack.

I'll just make up some numbers for the sake of this discussion:

nova scheduler latest can do only 100 sched/sec for 1 instance (I guess the 
10ms average you bring out may not be that unrealistic)
the requirement is a sustained 500 sched/sec worst case with 10K nodes (that is 
5% of 10K and today we can barely launch 100VM/sec sustained)

Are we going to achieve 5x with just 3 instances which is what most people 
deploy? Not likely.
Will using more elaborate distributed infra/DLM like consul/zk/etcd going to 
get us to that 500 mark with 3 instances? Maybe but it will be at the expense 
of added complexity of the overall solution.
Can we instead optimize nova scheduler with single instance to do 500/sec? 
Maybe but if we succeed we'll get a lot more simple solution.


Not saying that high concurrency and distributed schedulers is not the solution 
and maybe we really need a distributed solution, but it'd be good to have some 
numbers to frame the discussion.



>
>__________________________________________________________________________
>OpenStack Development Mailing List (not for usage questions)
>Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Scheduler proposal

Reply via email to