Re: [openstack-dev] A simple way to improve nova scheduler

2013-09-27 Thread Soren Hansen
2013/9/26 Joe Gordon :
>> Yes, when moving beyond simple flavours, the idea as initially proposed
>> falls apart.  I see two ways to fix that:
>>
>>  * Don't move beyond simple flavours. Seriously. Amazon have been pretty
>>darn succesful with just their simple instance types.
> Who says we have to support one scheduler model?  I can see room for several
> scheduler models that have different tradeoffs, such as performance / scale
> vs features/

Sure. I didn't mean necessarily removing the support for richer
instance configurations, but simply making them simple to disable and
thus enable the O(1) scheduler.




-- 
Soren Hansen | http://linux2go.dk/
Ubuntu Developer | http://www.ubuntu.com/
OpenStack Developer  | http://www.openstack.org/

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-09-26 Thread Joe Gordon
On Thu, Sep 26, 2013 at 1:53 PM, Soren Hansen  wrote:

> Hey, sorry for necroposting. I completely missed this thread when it was
> active, but Russel just pointed it out to me on Twitter earlier today and
> I couldn't help myself.
>
>
> 2013/7/19 Sandy Walsh :
> > On 07/19/2013 05:01 PM, Boris Pavlovic wrote:
> > Sorry, I was commenting on Soren's suggestion from way back (essentially
> > listening on a separate exchange for each unique flavor ... so no
> > scheduler was needed at all). It was a great idea, but fell apart rather
> > quickly.
>
> I don't recall we ever really had the discussion, but it's been a while :)
>
> Yes, when moving beyond simple flavours, the idea as initially proposed
> falls apart.  I see two ways to fix that:
>
>  * Don't move beyond simple flavours. Seriously. Amazon have been pretty
>darn succesful with just their simple instance types.
>

Who says we have to support one scheduler model?  I can see room for
several scheduler models that have different tradeoffs, such as performance
/ scale  vs features/


>
>  * If you must make things complicated, use fanout to send a reservation
>request:
>
>- Send out reservation requests to everyone listening (*)
>
>- Compute nodes able to accommodate the request reserve the resources
> in question and respond directly to the requestor. Those unable to
>  accommodate the request do nothing.
>
>- Requestor (scheduler, API server, whatever) picks a winner amongst
> the repondants and broadcasts a message announcing the winner of
>  the request.
>
>- The winning node acknowledges acceptance of the task to the
>  requestor and gets to work.
>
>- Every other node that responded also sees the broadcast and cancels
>  the reservation.
>
>- Reservations time out after 5 seconds, so a lost broadcast doesn't
>  result in reserved-but-never-used resources.
>
>- If noone has volunteered to accept the reservation request within a
> couple of seconds, broadcast wider.
>
> (*) "Everyone listening" isn't necessarily every node. Maybe you have
> topics for nodes that are at less than 10% utilisation, one for less
> than 25% utilisation, etc. First broadcast to those at 10% or less, move
> on to 20%, etc.
>
> This is just off the top of my head. I'm sure it can be improved upon. A
> lot. My point is just that there's plenty of alternatives to the
> omniscient schedulers that we've been used to for 3 years now.
>
> --
> Soren Hansen | http://linux2go.dk/
> Ubuntu Developer | http://www.ubuntu.com/
> OpenStack Developer  | http://www.openstack.org/
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-09-26 Thread Soren Hansen
Hey, sorry for necroposting. I completely missed this thread when it was
active, but Russel just pointed it out to me on Twitter earlier today and
I couldn't help myself.


2013/7/19 Sandy Walsh :
> On 07/19/2013 05:01 PM, Boris Pavlovic wrote:
> Sorry, I was commenting on Soren's suggestion from way back (essentially
> listening on a separate exchange for each unique flavor ... so no
> scheduler was needed at all). It was a great idea, but fell apart rather
> quickly.

I don't recall we ever really had the discussion, but it's been a while :)

Yes, when moving beyond simple flavours, the idea as initially proposed
falls apart.  I see two ways to fix that:

 * Don't move beyond simple flavours. Seriously. Amazon have been pretty
   darn succesful with just their simple instance types.

 * If you must make things complicated, use fanout to send a reservation
   request:

   - Send out reservation requests to everyone listening (*)

   - Compute nodes able to accommodate the request reserve the resources
in question and respond directly to the requestor. Those unable to
 accommodate the request do nothing.

   - Requestor (scheduler, API server, whatever) picks a winner amongst
the repondants and broadcasts a message announcing the winner of
 the request.

   - The winning node acknowledges acceptance of the task to the
 requestor and gets to work.

   - Every other node that responded also sees the broadcast and cancels
 the reservation.

   - Reservations time out after 5 seconds, so a lost broadcast doesn't
 result in reserved-but-never-used resources.

   - If noone has volunteered to accept the reservation request within a
couple of seconds, broadcast wider.

(*) "Everyone listening" isn't necessarily every node. Maybe you have
topics for nodes that are at less than 10% utilisation, one for less
than 25% utilisation, etc. First broadcast to those at 10% or less, move
on to 20%, etc.

This is just off the top of my head. I'm sure it can be improved upon. A
lot. My point is just that there's plenty of alternatives to the
omniscient schedulers that we've been used to for 3 years now.

-- 
Soren Hansen | http://linux2go.dk/
Ubuntu Developer | http://www.ubuntu.com/
OpenStack Developer  | http://www.openstack.org/

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-31 Thread Wang, Shane
Thank you for your comments and questions, Boris.
We will test it asap and get back to you.

Thanks.
--
Shane
From: Boris Pavlovic [mailto:bo...@pavlovic.me]
Sent: Wednesday, July 31, 2013 7:00 PM
To: OpenStack Development Mailing List
Subject: Re: [openstack-dev] A simple way to improve nova scheduler

Hi Shane,

Thanks for implementing this one new approach.
Yes, I agree that it solves problems with "JOIN".

But now I am worry about new problem db.compute_node_update() that changes
every time field with "TEXT" type which means that this should work really slow.

So I have some question about testing time, did you test just joins or
joins with parallel N/60 updates/sec of compute_node_update() calls?

Also we will need Russell confirmation to merge such a big change right before 
release.
Russell what do you think?

>From what I know since we don't have a clear solution for this issue community 
>agreed that it would be discussed on the coming summit.


Best regards,
Boris Pavlovic
--
Mirantis Inc.



On Wed, Jul 31, 2013 at 9:36 AM, Wang, Shane 
mailto:shane.w...@intel.com>> wrote:
Hi,

I have a patchset ready for your review https://review.openstack.org/#/c/38802/
This patchset is to remove table compute_node_stats and add one more column 
"stats" in table compute_nodes as JSON dict. With that, compute_node_get_all() 
doesn't need to join another table when nova schedulers call it.

My team has done some preliminary tests. The performance could be reduced to 
~1.32 seconds from ~16.89 seconds, where we suppose there are 10K compute nodes 
and each node has 20 stats records in compute_node_stats.

Thank you for your review, and what do you think?

Thanks.
--
Shane
From: Joshua Harlow 
[mailto:harlo...@yahoo-inc.com<mailto:harlo...@yahoo-inc.com>]
Sent: Thursday, July 25, 2013 5:36 AM
To: OpenStack Development Mailing List; Boris Pavlovic

Subject: Re: [openstack-dev] A simple way to improve nova scheduler

As far as the send only when you have to. That reminds me of this piece of work 
that could be resurrected that slowed down the periodic updates when nothing 
was changing.

https://review.openstack.org/#/c/26291/

Could be brought back, the concept still feels useful imho. But maybe not to 
others :-P

From: Boris Pavlovic mailto:bo...@pavlovic.me>>
Reply-To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Date: Wednesday, July 24, 2013 12:12 PM
To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Subject: Re: [openstack-dev] A simple way to improve nova scheduler

Hi Mike,

On Wed, Jul 24, 2013 at 1:01 AM, Mike Wilson 
mailto:geekinu...@gmail.com>> wrote:
Again I can only speak for qpid, but it's not really a big load on the qpidd 
server itself. I think the issue is that the updates come in serially into each 
scheduler that you have running. We don't process those quickly enough for it 
to do any good, which is why the lookup from db. You can see this for yourself 
using the fake hypervisor, launch yourself a bunch of simulated nova-compute, 
launch a nova-scheduler on the same host and even with 1k or so you will notice 
the latency between the update being sent and the update actually meaning 
anything for the scheduler.

I think a few points that have been brought up could mitigate this quite a bit. 
My personal view is the following:

-Only update when you have to (ie. 10k nodes all sending update every periodic 
interval is heavy, only send when you have to)
-Don't fanout to schedulers, update a single scheduler which in turn updates a 
shared store that is fast such as memcache

I guess that effectively is what you are proposing with the added twist of the 
shared store.


Absolutely agree with this. Especially with using memcached (or redis) as 
common storage for all schedulers.

Best regards,
Boris Pavlovic
---
Mirantis Inc.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org<mailto:OpenStack-dev@lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-31 Thread Boris Pavlovic
Hi Shane,

Thanks for implementing this one new approach.
Yes, I agree that it solves problems with "JOIN".

But now I am worry about new problem db.compute_node_update() that changes
every time field with "TEXT" type which means that this should work really
slow.

So I have some question about testing time, did you test just joins or
joins with parallel N/60 updates/sec of compute_node_update() calls?

Also we will need Russell confirmation to merge such a big change right
before release.
Russell what do you think?

>From what I know since we don't have a clear solution for this issue
community agreed that it would be discussed on the coming summit.


Best regards,
Boris Pavlovic
--
Mirantis Inc.




On Wed, Jul 31, 2013 at 9:36 AM, Wang, Shane  wrote:

>   Hi,
>
> ** **
>
> I have a patchset ready for your review
> https://review.openstack.org/#/c/38802/
>
> This patchset is to remove table compute_node_stats and add one more
> column “stats” in table compute_nodes as JSON dict. With that,
> compute_node_get_all() doesn’t need to join another table when nova
> schedulers call it.
>
> ** **
>
> My team has done some preliminary tests. The performance could be reduced
> to ~1.32 seconds from ~16.89 seconds, where we suppose there are 10K
> compute nodes and each node has 20 stats records in compute_node_stats.***
> *
>
> ** **
>
> Thank you for your review, and what do you think?
>
> ** **
>
> Thanks.
>
> --
>
> Shane
>
> *From:* Joshua Harlow [mailto:harlo...@yahoo-inc.com]
> *Sent:* Thursday, July 25, 2013 5:36 AM
> *To:* OpenStack Development Mailing List; Boris Pavlovic
>
> *Subject:* Re: [openstack-dev] A simple way to improve nova scheduler
>
>  ** **
>
> As far as the send only when you have to. That reminds me of this piece of
> work that could be resurrected that slowed down the periodic updates when
> nothing was changing.
>
> ** **
>
> https://review.openstack.org/#/c/26291/
>
> ** **
>
> Could be brought back, the concept still feels useful imho. But maybe not
> to others :-P
>
> ** **
>
> *From: *Boris Pavlovic 
> *Reply-To: *OpenStack Development Mailing List <
> openstack-dev@lists.openstack.org>
> *Date: *Wednesday, July 24, 2013 12:12 PM
> *To: *OpenStack Development Mailing List <
> openstack-dev@lists.openstack.org>
> *Subject: *Re: [openstack-dev] A simple way to improve nova scheduler
>
> ** **
>
> Hi Mike,
>
> ** **
>
> On Wed, Jul 24, 2013 at 1:01 AM, Mike Wilson  wrote:
> 
>
> Again I can only speak for qpid, but it's not really a big load on the
> qpidd server itself. I think the issue is that the updates come in serially
> into each scheduler that you have running. We don't process those quickly
> enough for it to do any good, which is why the lookup from db. You can see
> this for yourself using the fake hypervisor, launch yourself a bunch of
> simulated nova-compute, launch a nova-scheduler on the same host and even
> with 1k or so you will notice the latency between the update being sent and
> the update actually meaning anything for the scheduler. 
>
> ** **
>
> I think a few points that have been brought up could mitigate this quite a
> bit. My personal view is the following:
>
> ** **
>
> -Only update when you have to (ie. 10k nodes all sending update every
> periodic interval is heavy, only send when you have to)
>
> -Don't fanout to schedulers, update a single scheduler which in turn
> updates a shared store that is fast such as memcache
>
> ** **
>
> I guess that effectively is what you are proposing with the added twist of
> the shared store.
>
> ** **
>
> ** **
>
> Absolutely agree with this. Especially with using memcached (or redis) as
> common storage for all schedulers. 
>
> ** **
>
> Best regards,
>
> Boris Pavlovic
>
> ---
>
> Mirantis Inc. 
>
> ** **
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-30 Thread Wang, Shane
Hi,

I have a patchset ready for your review https://review.openstack.org/#/c/38802/
This patchset is to remove table compute_node_stats and add one more column 
"stats" in table compute_nodes as JSON dict. With that, compute_node_get_all() 
doesn't need to join another table when nova schedulers call it.

My team has done some preliminary tests. The performance could be reduced to 
~1.32 seconds from ~16.89 seconds, where we suppose there are 10K compute nodes 
and each node has 20 stats records in compute_node_stats.

Thank you for your review, and what do you think?

Thanks.
--
Shane
From: Joshua Harlow [mailto:harlo...@yahoo-inc.com]
Sent: Thursday, July 25, 2013 5:36 AM
To: OpenStack Development Mailing List; Boris Pavlovic
Subject: Re: [openstack-dev] A simple way to improve nova scheduler

As far as the send only when you have to. That reminds me of this piece of work 
that could be resurrected that slowed down the periodic updates when nothing 
was changing.

https://review.openstack.org/#/c/26291/

Could be brought back, the concept still feels useful imho. But maybe not to 
others :-P

From: Boris Pavlovic mailto:bo...@pavlovic.me>>
Reply-To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Date: Wednesday, July 24, 2013 12:12 PM
To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Subject: Re: [openstack-dev] A simple way to improve nova scheduler

Hi Mike,

On Wed, Jul 24, 2013 at 1:01 AM, Mike Wilson 
mailto:geekinu...@gmail.com>> wrote:
Again I can only speak for qpid, but it's not really a big load on the qpidd 
server itself. I think the issue is that the updates come in serially into each 
scheduler that you have running. We don't process those quickly enough for it 
to do any good, which is why the lookup from db. You can see this for yourself 
using the fake hypervisor, launch yourself a bunch of simulated nova-compute, 
launch a nova-scheduler on the same host and even with 1k or so you will notice 
the latency between the update being sent and the update actually meaning 
anything for the scheduler.

I think a few points that have been brought up could mitigate this quite a bit. 
My personal view is the following:

-Only update when you have to (ie. 10k nodes all sending update every periodic 
interval is heavy, only send when you have to)
-Don't fanout to schedulers, update a single scheduler which in turn updates a 
shared store that is fast such as memcache

I guess that effectively is what you are proposing with the added twist of the 
shared store.


Absolutely agree with this. Especially with using memcached (or redis) as 
common storage for all schedulers.

Best regards,
Boris Pavlovic
---
Mirantis Inc.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-26 Thread Clint Byrum
Excerpts from Joe Gordon's message of 2013-07-24 11:43:46 -0700:
> On Wed, Jul 24, 2013 at 12:24 PM, Russell Bryant  wrote:
> 
> > On 07/23/2013 06:00 PM, Clint Byrum wrote:
> > > This is really interesting work, thanks for sharing it with us. The
> > > discussion that has followed has brought up some thoughts I've had for
> > > a while about this choke point in what is supposed to be an extremely
> > > scalable cloud platform (OpenStack).
> > >
> > > I feel like the discussions have all been centered around making "the"
> > > scheduler(s) intelligent.  There seems to be a commonly held belief that
> > > scheduling is a single step, and should be done with as much knowledge
> > > of the system as possible by a well informed entity.
> > >
> > > Can you name for me one large scale system that has a single entity,
> > > human or computer, that knows everything about the system and can make
> > > good decisions quickly?
> > >
> > > This problem is screaming to be broken up, de-coupled, and distributed.
> > >
> > > I keep asking myself these questions:
> > >
> > > Why are all of the compute nodes informing all of the schedulers?
> >
>  >
> > > Why are all of the schedulers expecting to know about all of the compute
> > nodes?
> >
> 
> So the scheduler can try to find the globally optimum solution, see below.
> 

Right, that seems like a costly requirement that most won't need.

> > >
> > > Can we break this problem up into simpler problems and distribute the
> > load to
> > > the entire system?
> > >
> > > This has been bouncing around in my head for a while now, but as a
> > > shallow observer of nova dev, I feel like there are some well known
> > > scaling techniques which have not been brought up. Here is my idea,
> > > forgive me if I have glossed over something or missed a huge hole:
> > >
> > > * Schedulers break up compute nodes by hash table, only caring about
> > >   those in their hash table.
> > > * Schedulers, upon claiming a compute node by hash table, poll compute
> > >   node directly for its information.
> >
> 
> For people who want to schedule on information that is constantly changing
> (such as CPU load, memory usage etc).  How often would you poll?
> 

Thats a great question. The initial poll is mostly "how are you
now?". After that I'm not sure polling would be the best strategy,
so perhaps a broadcast topic per-scheduler would still make sense.
And perhaps that broadcast topic would be enough to not even need the
initial handshake.

> > > * Requests to boot go into fanout.
> > > * Schedulers get request and try to satisfy using only their own compute
> > >   nodes.
> > > * Failure to boot results in re-insertion in the fanout.
> >
> 
> With this model we loose the ability to find the global optimum host to
> schedule on, and can only find an optimal solution.  Which sounds like a
> reasonable scale trade off.  Going forward I can image nova having several
> different schedulers for different requirements.  As someone who is
> deploying at a massive scale will probably accept an optimal solution (and
> a scheduler that scales better) but someone with a smaller cloud will want
> the globally optimum solution.
> 

What you may have missed on first pass, is that if you just have 1
scheduler, you do have globally optimum scheduling. So it is not lost,
it is just factored out as you add schedulers. It is also quite simple
to know when to add schedulers.. when your boot request latency gets
too high.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-24 Thread Joshua Harlow
As far as the send only when you have to. That reminds me of this piece of work 
that could be resurrected that slowed down the periodic updates when nothing 
was changing.

https://review.openstack.org/#/c/26291/

Could be brought back, the concept still feels useful imho. But maybe not to 
others :-P

From: Boris Pavlovic mailto:bo...@pavlovic.me>>
Reply-To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Date: Wednesday, July 24, 2013 12:12 PM
To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Subject: Re: [openstack-dev] A simple way to improve nova scheduler

Hi Mike,


On Wed, Jul 24, 2013 at 1:01 AM, Mike Wilson 
mailto:geekinu...@gmail.com>> wrote:
Again I can only speak for qpid, but it's not really a big load on the qpidd 
server itself. I think the issue is that the updates come in serially into each 
scheduler that you have running. We don't process those quickly enough for it 
to do any good, which is why the lookup from db. You can see this for yourself 
using the fake hypervisor, launch yourself a bunch of simulated nova-compute, 
launch a nova-scheduler on the same host and even with 1k or so you will notice 
the latency between the update being sent and the update actually meaning 
anything for the scheduler.

I think a few points that have been brought up could mitigate this quite a bit. 
My personal view is the following:

-Only update when you have to (ie. 10k nodes all sending update every periodic 
interval is heavy, only send when you have to)
-Don't fanout to schedulers, update a single scheduler which in turn updates a 
shared store that is fast such as memcache

I guess that effectively is what you are proposing with the added twist of the 
shared store.


Absolutely agree with this. Especially with using memcached (or redis) as 
common storage for all schedulers.

Best regards,
Boris Pavlovic
---
Mirantis Inc.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-24 Thread Boris Pavlovic
Hi Mike,


On Wed, Jul 24, 2013 at 1:01 AM, Mike Wilson  wrote:

> Again I can only speak for qpid, but it's not really a big load on the
> qpidd server itself. I think the issue is that the updates come in serially
> into each scheduler that you have running. We don't process those quickly
> enough for it to do any good, which is why the lookup from db. You can see
> this for yourself using the fake hypervisor, launch yourself a bunch of
> simulated nova-compute, launch a nova-scheduler on the same host and even
> with 1k or so you will notice the latency between the update being sent and
> the update actually meaning anything for the scheduler.
>
> I think a few points that have been brought up could mitigate this quite a
> bit. My personal view is the following:
>
> -Only update when you have to (ie. 10k nodes all sending update every
> periodic interval is heavy, only send when you have to)
> -Don't fanout to schedulers, update a single scheduler which in turn
> updates a shared store that is fast such as memcache
>
> I guess that effectively is what you are proposing with the added twist of
> the shared store.
>


Absolutely agree with this. Especially with using memcached (or redis) as
common storage for all schedulers.

Best regards,
Boris Pavlovic
---
Mirantis Inc.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-24 Thread Joe Gordon
On Wed, Jul 24, 2013 at 12:24 PM, Russell Bryant  wrote:

> On 07/23/2013 06:00 PM, Clint Byrum wrote:
> > This is really interesting work, thanks for sharing it with us. The
> > discussion that has followed has brought up some thoughts I've had for
> > a while about this choke point in what is supposed to be an extremely
> > scalable cloud platform (OpenStack).
> >
> > I feel like the discussions have all been centered around making "the"
> > scheduler(s) intelligent.  There seems to be a commonly held belief that
> > scheduling is a single step, and should be done with as much knowledge
> > of the system as possible by a well informed entity.
> >
> > Can you name for me one large scale system that has a single entity,
> > human or computer, that knows everything about the system and can make
> > good decisions quickly?
> >
> > This problem is screaming to be broken up, de-coupled, and distributed.
> >
> > I keep asking myself these questions:
> >
> > Why are all of the compute nodes informing all of the schedulers?
>
 >
> > Why are all of the schedulers expecting to know about all of the compute
> nodes?
>

So the scheduler can try to find the globally optimum solution, see below.


> >
> > Can we break this problem up into simpler problems and distribute the
> load to
> > the entire system?
> >
> > This has been bouncing around in my head for a while now, but as a
> > shallow observer of nova dev, I feel like there are some well known
> > scaling techniques which have not been brought up. Here is my idea,
> > forgive me if I have glossed over something or missed a huge hole:
> >
> > * Schedulers break up compute nodes by hash table, only caring about
> >   those in their hash table.
> > * Schedulers, upon claiming a compute node by hash table, poll compute
> >   node directly for its information.
>

For people who want to schedule on information that is constantly changing
(such as CPU load, memory usage etc).  How often would you poll?


> > * Requests to boot go into fanout.
> > * Schedulers get request and try to satisfy using only their own compute
> >   nodes.
> > * Failure to boot results in re-insertion in the fanout.
>

With this model we loose the ability to find the global optimum host to
schedule on, and can only find an optimal solution.  Which sounds like a
reasonable scale trade off.  Going forward I can image nova having several
different schedulers for different requirements.  As someone who is
deploying at a massive scale will probably accept an optimal solution (and
a scheduler that scales better) but someone with a smaller cloud will want
the globally optimum solution.


> >
> > This gives up the certainty that the scheduler will find a compute node
> > for a boot request on the first try. It is also possible that a request
> > gets unlucky and takes a long time to find the one scheduler that has
> > the one last "X" resource that it is looking for. There are some further
> > optimization strategies that can be employed (like queues based on hashes
> > already tried.. etc).
> >
> > Anyway, I don't see any point in trying to hot-rod the intelligent
> > scheduler to go super fast, when we can just optimize for having many
> > many schedulers doing the same body of work without blocking and without
> > pounding a database.
>
> These are some *very* good observations.  I'd like all of the nova folks
> interested in this are to give some deep consideration of this type of
> approach.
>
>
I agree an approach like this is very interesting and is something worth
exploring, especially at the summit.   There are some clear pros and cons
to an approach like this.  For example this will scale better, but cannot
find the optimum node to schedule on.  My question is, at what scale does
it make sense to adopt an approach like this?  And how can we improve our
current scheduler to scale better, not that it will ever scale better then
the idea proposed here.

While talking about scale there are some other big issues, such as RPC that
need be be sorted out as well.


>  --
> Russell Bryant
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-24 Thread Russell Bryant
On 07/23/2013 06:00 PM, Clint Byrum wrote:
> This is really interesting work, thanks for sharing it with us. The
> discussion that has followed has brought up some thoughts I've had for
> a while about this choke point in what is supposed to be an extremely
> scalable cloud platform (OpenStack).
> 
> I feel like the discussions have all been centered around making "the"
> scheduler(s) intelligent.  There seems to be a commonly held belief that
> scheduling is a single step, and should be done with as much knowledge
> of the system as possible by a well informed entity.
> 
> Can you name for me one large scale system that has a single entity,
> human or computer, that knows everything about the system and can make
> good decisions quickly?
> 
> This problem is screaming to be broken up, de-coupled, and distributed.
> 
> I keep asking myself these questions:
> 
> Why are all of the compute nodes informing all of the schedulers?
> 
> Why are all of the schedulers expecting to know about all of the compute 
> nodes?
> 
> Can we break this problem up into simpler problems and distribute the load to
> the entire system?
> 
> This has been bouncing around in my head for a while now, but as a
> shallow observer of nova dev, I feel like there are some well known
> scaling techniques which have not been brought up. Here is my idea,
> forgive me if I have glossed over something or missed a huge hole:
> 
> * Schedulers break up compute nodes by hash table, only caring about
>   those in their hash table.
> * Schedulers, upon claiming a compute node by hash table, poll compute
>   node directly for its information.
> * Requests to boot go into fanout.
> * Schedulers get request and try to satisfy using only their own compute
>   nodes.
> * Failure to boot results in re-insertion in the fanout.
> 
> This gives up the certainty that the scheduler will find a compute node
> for a boot request on the first try. It is also possible that a request
> gets unlucky and takes a long time to find the one scheduler that has
> the one last "X" resource that it is looking for. There are some further
> optimization strategies that can be employed (like queues based on hashes
> already tried.. etc).
> 
> Anyway, I don't see any point in trying to hot-rod the intelligent
> scheduler to go super fast, when we can just optimize for having many
> many schedulers doing the same body of work without blocking and without
> pounding a database.

These are some *very* good observations.  I'd like all of the nova folks
interested in this are to give some deep consideration of this type of
approach.

-- 
Russell Bryant

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Joshua Harlow
I like the idea clint.

It appears to me that the kind of scheduler 'buckets' that are being
established allow for different kind of policies around how accurate and
how 'global' the deployer wants scheduling to be (which might be a
differing policies depending on the deployer). All of these kind of
reasons start to get even more problematic when you start to do
cross-resource scheduling (volumes near compute nodes) which is I think
there was proposals for a kind of unified scheduling 'framework' (its own
project?) that focuses on this type of work. Said project stills seems
appropriate in my mind (and is desperately needed to handle the
cross-resource scheduling concerns).

- https://etherpad.openstack.org/UnifiedResourcePlacement

I'm unsure what the nova (and other projects that have similar scheduling
concepts) folks think about such a thing existing but from the last summit
there was talk about possibly figuring out how to do that. It is of course
a lot of refactoring (and cross-project refactoring) to get there but
it seems like it would be very beneficial if all projects that were
involved with resource scheduling could use a single 'thing' to update
resource information and to ask for scheduling decisions (aka, providing a
list of desired resources and getting back where those resources are, aka
a reservation on those resources, with a later commit of those resources,
so that the resources are freed if the process asking for them fails).

-Josh

On 7/23/13 3:00 PM, "Clint Byrum"  wrote:

>Excerpts from Boris Pavlovic's message of 2013-07-19 07:52:55 -0700:
>> Hi all,
>> 
>> 
>> 
>> In Mirantis Alexey Ovtchinnikov and me are working on nova scheduler
>> improvements.
>> 
>> As far as we can see the problem, now scheduler has two major issues:
>> 
>> 
>> 1) Scalability. Factors that contribute to bad scalability are these:
>> 
>> *) Each compute node every periodic task interval (60 sec by default)
>> updates resources state in DB.
>> 
>> *) On every boot request scheduler has to fetch information about all
>> compute nodes from DB.
>> 
>> 2) Flexibility. Flexibility perishes due to problems with:
>> 
>> *) Addiing new complex resources (such as big lists of complex objects
>>e.g.
>> required by PCI Passthrough
>> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
>> 
>> *) Using different sources of data in Scheduler for example from cinder
>>or
>> ceilometer.
>> 
>> (as required by Volume Affinity Filter
>> https://review.openstack.org/#/c/29343/)
>> 
>> 
>> We found a simple way to mitigate this issues by avoiding of DB usage
>>for
>> host state storage.
>> 
>> 
>> A more detailed discussion of the problem state and one of a possible
>> solution can be found here:
>> 
>> 
>>https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsW
>>f0UWiQ/edit#
>> 
>
>This is really interesting work, thanks for sharing it with us. The
>discussion that has followed has brought up some thoughts I've had for
>a while about this choke point in what is supposed to be an extremely
>scalable cloud platform (OpenStack).
>
>I feel like the discussions have all been centered around making "the"
>scheduler(s) intelligent.  There seems to be a commonly held belief that
>scheduling is a single step, and should be done with as much knowledge
>of the system as possible by a well informed entity.
>
>Can you name for me one large scale system that has a single entity,
>human or computer, that knows everything about the system and can make
>good decisions quickly?
>
>This problem is screaming to be broken up, de-coupled, and distributed.
>
>I keep asking myself these questions:
>
>Why are all of the compute nodes informing all of the schedulers?
>
>Why are all of the schedulers expecting to know about all of the compute
>nodes?
>
>Can we break this problem up into simpler problems and distribute the
>load to
>the entire system?
>
>This has been bouncing around in my head for a while now, but as a
>shallow observer of nova dev, I feel like there are some well known
>scaling techniques which have not been brought up. Here is my idea,
>forgive me if I have glossed over something or missed a huge hole:
>
>* Schedulers break up compute nodes by hash table, only caring about
>  those in their hash table.
>* Schedulers, upon claiming a compute node by hash table, poll compute
>  node directly for its information.
>* Requests to boot go into fanout.
>* Schedulers get request and try to satisfy using only their own compute
>  nodes.
>* Failure to boot results in re-insertion in the fanout.
>
>This gives up the certainty that the scheduler will find a compute node
>for a boot request on the first try. It is also possible that a request
>gets unlucky and takes a long time to find the one scheduler that has
>the one last "X" resource that it is looking for. There are some further
>optimization strategies that can be employed (like queues based on hashes
>already tried.. etc).
>
>Anyway, I don't see any 

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Clint Byrum
Excerpts from Boris Pavlovic's message of 2013-07-19 07:52:55 -0700:
> Hi all,
> 
> 
> 
> In Mirantis Alexey Ovtchinnikov and me are working on nova scheduler
> improvements.
> 
> As far as we can see the problem, now scheduler has two major issues:
> 
> 
> 1) Scalability. Factors that contribute to bad scalability are these:
> 
> *) Each compute node every periodic task interval (60 sec by default)
> updates resources state in DB.
> 
> *) On every boot request scheduler has to fetch information about all
> compute nodes from DB.
> 
> 2) Flexibility. Flexibility perishes due to problems with:
> 
> *) Addiing new complex resources (such as big lists of complex objects e.g.
> required by PCI Passthrough
> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
> 
> *) Using different sources of data in Scheduler for example from cinder or
> ceilometer.
> 
> (as required by Volume Affinity Filter
> https://review.openstack.org/#/c/29343/)
> 
> 
> We found a simple way to mitigate this issues by avoiding of DB usage for
> host state storage.
> 
> 
> A more detailed discussion of the problem state and one of a possible
> solution can be found here:
> 
> https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit#
> 

This is really interesting work, thanks for sharing it with us. The
discussion that has followed has brought up some thoughts I've had for
a while about this choke point in what is supposed to be an extremely
scalable cloud platform (OpenStack).

I feel like the discussions have all been centered around making "the"
scheduler(s) intelligent.  There seems to be a commonly held belief that
scheduling is a single step, and should be done with as much knowledge
of the system as possible by a well informed entity.

Can you name for me one large scale system that has a single entity,
human or computer, that knows everything about the system and can make
good decisions quickly?

This problem is screaming to be broken up, de-coupled, and distributed.

I keep asking myself these questions:

Why are all of the compute nodes informing all of the schedulers?

Why are all of the schedulers expecting to know about all of the compute nodes?

Can we break this problem up into simpler problems and distribute the load to
the entire system?

This has been bouncing around in my head for a while now, but as a
shallow observer of nova dev, I feel like there are some well known
scaling techniques which have not been brought up. Here is my idea,
forgive me if I have glossed over something or missed a huge hole:

* Schedulers break up compute nodes by hash table, only caring about
  those in their hash table.
* Schedulers, upon claiming a compute node by hash table, poll compute
  node directly for its information.
* Requests to boot go into fanout.
* Schedulers get request and try to satisfy using only their own compute
  nodes.
* Failure to boot results in re-insertion in the fanout.

This gives up the certainty that the scheduler will find a compute node
for a boot request on the first try. It is also possible that a request
gets unlucky and takes a long time to find the one scheduler that has
the one last "X" resource that it is looking for. There are some further
optimization strategies that can be employed (like queues based on hashes
already tried.. etc).

Anyway, I don't see any point in trying to hot-rod the intelligent
scheduler to go super fast, when we can just optimize for having many
many schedulers doing the same body of work without blocking and without
pounding a database.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Mike Wilson
Again I can only speak for qpid, but it's not really a big load on the
qpidd server itself. I think the issue is that the updates come in serially
into each scheduler that you have running. We don't process those quickly
enough for it to do any good, which is why the lookup from db. You can see
this for yourself using the fake hypervisor, launch yourself a bunch of
simulated nova-compute, launch a nova-scheduler on the same host and even
with 1k or so you will notice the latency between the update being sent and
the update actually meaning anything for the scheduler.

I think a few points that have been brought up could mitigate this quite a
bit. My personal view is the following:

-Only update when you have to (ie. 10k nodes all sending update every
periodic interval is heavy, only send when you have to)
-Don't fanout to schedulers, update a single scheduler which in turn
updates a shared store that is fast such as memcache

I guess that effectively is what you are proposing with the added twist of
the shared store.

-Mike


On Tue, Jul 23, 2013 at 2:25 PM, Boris Pavlovic  wrote:

> Joe,
> Sure we will.
>
> Mike,
> Thanks for sharing information about scalability problems, presentation
> was great.
> Also could you say what do you think is 150 req/sec is it big load for
> qpid or rabbit? I think it is just nothing..
>
>
> Best regards,
> Boris Pavlovic
> ---
> Mirantis Inc.
>
>
>
> On Wed, Jul 24, 2013 at 12:17 AM, Joe Gordon wrote:
>
>>
>>
>>
>> On Tue, Jul 23, 2013 at 1:09 PM, Boris Pavlovic wrote:
>>
>>> Ian,
>>>
>>> There are serious scalability and performance problems with DB usage in
>>> current scheduler.
>>> Rapid Updates + Joins makes current solution absolutely not scalable.
>>>
>>> Bleuhost example just shows personally for me just a trivial thing. (It
>>> just won't work)
>>>
>>> We will add tomorrow antother graphic:
>>> Avg user req / sec in current and our approaches.
>>>
>>
>> Will you be releasing your code to generate the results? Without that the
>> graphic isn't very useful
>>
>>
>>> I hope it will help you to better understand situation.
>>>
>>>
>>> Joshua,
>>>
>>> Our current discussion is about could we remove information about
>>> compute nodes from Nova saftly.
>>> Both our and your approach will remove data from nova DB.
>>>
>>> Also your approach had much more:
>>> 1) network load
>>> 2) latency
>>> 3) one more service (memcached)
>>>
>>> So I am not sure that it is better then just send directly to scheduler
>>> information.
>>>
>>>
>>> Best regards,
>>> Boris Pavlovic
>>> ---
>>> Mirantis Inc.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 23, 2013 at 11:56 PM, Joe Gordon wrote:
>>>

 On Jul 23, 2013 3:44 PM, "Ian Wells"  wrote:
 >
 > > * periodic updates can overwhelm things.  Solution: remove unneeded
 updates,
 > > most scheduling data only changes when an instance does some state
 change.
 >
 > It's not clear that periodic updates do overwhelm things, though.
 > Boris ran the tests.  Apparently 10k nodes updating once a minute
 > extend the read query by ~10% (the main problem being the read query
 > is abysmal in the first place).  I don't know how much of the rest of
 > the infrastructure was involved in his test, though (RabbitMQ,
 > Conductor).

 A great openstack at scale talk, that covers the scheduler
 http://www.bluehost.com/blog/bluehost/bluehost-presents-operational-case-study-at-openstack-summit-2111

 >
 > There are reasonably solid reasons why we would want an alternative to
 > the DB backend, but I'm not sure the update rate is one of them.   If
 > we were going for an alternative the obvious candidate to my mind
 > would be something like ZooKeeper (particularly since in some setups
 > it's already a channel between the compute hosts and the control
 > server).
 > --
 > Ian.
 >
 > ___
 > OpenStack-dev mailing list
 > OpenStack-dev@lists.openstack.org
 > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


>>>
>>> ___
>>> OpenStack-dev mailing list
>>> OpenStack-dev@lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.o

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Boris Pavlovic
Joe,
Sure we will.

Mike,
Thanks for sharing information about scalability problems, presentation was
great.
Also could you say what do you think is 150 req/sec is it big load for qpid
or rabbit? I think it is just nothing..


Best regards,
Boris Pavlovic
---
Mirantis Inc.



On Wed, Jul 24, 2013 at 12:17 AM, Joe Gordon  wrote:

>
>
>
> On Tue, Jul 23, 2013 at 1:09 PM, Boris Pavlovic  wrote:
>
>> Ian,
>>
>> There are serious scalability and performance problems with DB usage in
>> current scheduler.
>> Rapid Updates + Joins makes current solution absolutely not scalable.
>>
>> Bleuhost example just shows personally for me just a trivial thing. (It
>> just won't work)
>>
>> We will add tomorrow antother graphic:
>> Avg user req / sec in current and our approaches.
>>
>
> Will you be releasing your code to generate the results? Without that the
> graphic isn't very useful
>
>
>> I hope it will help you to better understand situation.
>>
>>
>> Joshua,
>>
>> Our current discussion is about could we remove information about compute
>> nodes from Nova saftly.
>> Both our and your approach will remove data from nova DB.
>>
>> Also your approach had much more:
>> 1) network load
>> 2) latency
>> 3) one more service (memcached)
>>
>> So I am not sure that it is better then just send directly to scheduler
>> information.
>>
>>
>> Best regards,
>> Boris Pavlovic
>> ---
>> Mirantis Inc.
>>
>>
>>
>>
>>
>>
>> On Tue, Jul 23, 2013 at 11:56 PM, Joe Gordon wrote:
>>
>>>
>>> On Jul 23, 2013 3:44 PM, "Ian Wells"  wrote:
>>> >
>>> > > * periodic updates can overwhelm things.  Solution: remove unneeded
>>> updates,
>>> > > most scheduling data only changes when an instance does some state
>>> change.
>>> >
>>> > It's not clear that periodic updates do overwhelm things, though.
>>> > Boris ran the tests.  Apparently 10k nodes updating once a minute
>>> > extend the read query by ~10% (the main problem being the read query
>>> > is abysmal in the first place).  I don't know how much of the rest of
>>> > the infrastructure was involved in his test, though (RabbitMQ,
>>> > Conductor).
>>>
>>> A great openstack at scale talk, that covers the scheduler
>>> http://www.bluehost.com/blog/bluehost/bluehost-presents-operational-case-study-at-openstack-summit-2111
>>>
>>> >
>>> > There are reasonably solid reasons why we would want an alternative to
>>> > the DB backend, but I'm not sure the update rate is one of them.   If
>>> > we were going for an alternative the obvious candidate to my mind
>>> > would be something like ZooKeeper (particularly since in some setups
>>> > it's already a channel between the compute hosts and the control
>>> > server).
>>> > --
>>> > Ian.
>>> >
>>> > ___
>>> > OpenStack-dev mailing list
>>> > OpenStack-dev@lists.openstack.org
>>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>> ___
>>> OpenStack-dev mailing list
>>> OpenStack-dev@lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Joe Gordon
On Tue, Jul 23, 2013 at 1:09 PM, Boris Pavlovic  wrote:

> Ian,
>
> There are serious scalability and performance problems with DB usage in
> current scheduler.
> Rapid Updates + Joins makes current solution absolutely not scalable.
>
> Bleuhost example just shows personally for me just a trivial thing. (It
> just won't work)
>
> We will add tomorrow antother graphic:
> Avg user req / sec in current and our approaches.
>

Will you be releasing your code to generate the results? Without that the
graphic isn't very useful


> I hope it will help you to better understand situation.
>
>
> Joshua,
>
> Our current discussion is about could we remove information about compute
> nodes from Nova saftly.
> Both our and your approach will remove data from nova DB.
>
> Also your approach had much more:
> 1) network load
> 2) latency
> 3) one more service (memcached)
>
> So I am not sure that it is better then just send directly to scheduler
> information.
>
>
> Best regards,
> Boris Pavlovic
> ---
> Mirantis Inc.
>
>
>
>
>
>
> On Tue, Jul 23, 2013 at 11:56 PM, Joe Gordon wrote:
>
>>
>> On Jul 23, 2013 3:44 PM, "Ian Wells"  wrote:
>> >
>> > > * periodic updates can overwhelm things.  Solution: remove unneeded
>> updates,
>> > > most scheduling data only changes when an instance does some state
>> change.
>> >
>> > It's not clear that periodic updates do overwhelm things, though.
>> > Boris ran the tests.  Apparently 10k nodes updating once a minute
>> > extend the read query by ~10% (the main problem being the read query
>> > is abysmal in the first place).  I don't know how much of the rest of
>> > the infrastructure was involved in his test, though (RabbitMQ,
>> > Conductor).
>>
>> A great openstack at scale talk, that covers the scheduler
>> http://www.bluehost.com/blog/bluehost/bluehost-presents-operational-case-study-at-openstack-summit-2111
>>
>> >
>> > There are reasonably solid reasons why we would want an alternative to
>> > the DB backend, but I'm not sure the update rate is one of them.   If
>> > we were going for an alternative the obvious candidate to my mind
>> > would be something like ZooKeeper (particularly since in some setups
>> > it's already a channel between the compute hosts and the control
>> > server).
>> > --
>> > Ian.
>> >
>> > ___
>> > OpenStack-dev mailing list
>> > OpenStack-dev@lists.openstack.org
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Mike Wilson
Just some added info for that talk, we are using qpid as our messaging
backend. I have no data for RabbitMQ, but our schedulers are _always_
behind on processing updates. It may be different with rabbit.

-Mike


On Tue, Jul 23, 2013 at 1:56 PM, Joe Gordon  wrote:

>
> On Jul 23, 2013 3:44 PM, "Ian Wells"  wrote:
> >
> > > * periodic updates can overwhelm things.  Solution: remove unneeded
> updates,
> > > most scheduling data only changes when an instance does some state
> change.
> >
> > It's not clear that periodic updates do overwhelm things, though.
> > Boris ran the tests.  Apparently 10k nodes updating once a minute
> > extend the read query by ~10% (the main problem being the read query
> > is abysmal in the first place).  I don't know how much of the rest of
> > the infrastructure was involved in his test, though (RabbitMQ,
> > Conductor).
>
> A great openstack at scale talk, that covers the scheduler
> http://www.bluehost.com/blog/bluehost/bluehost-presents-operational-case-study-at-openstack-summit-2111
>
> >
> > There are reasonably solid reasons why we would want an alternative to
> > the DB backend, but I'm not sure the update rate is one of them.   If
> > we were going for an alternative the obvious candidate to my mind
> > would be something like ZooKeeper (particularly since in some setups
> > it's already a channel between the compute hosts and the control
> > server).
> > --
> > Ian.
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Boris Pavlovic
Ian,

There are serious scalability and performance problems with DB usage in
current scheduler.
Rapid Updates + Joins makes current solution absolutely not scalable.

Bleuhost example just shows personally for me just a trivial thing. (It
just won't work)

We will add tomorrow antother graphic:
Avg user req / sec in current and our approaches.

I hope it will help you to better understand situation.


Joshua,

Our current discussion is about could we remove information about compute
nodes from Nova saftly.
Both our and your approach will remove data from nova DB.

Also your approach had much more:
1) network load
2) latency
3) one more service (memcached)

So I am not sure that it is better then just send directly to scheduler
information.


Best regards,
Boris Pavlovic
---
Mirantis Inc.






On Tue, Jul 23, 2013 at 11:56 PM, Joe Gordon  wrote:

>
> On Jul 23, 2013 3:44 PM, "Ian Wells"  wrote:
> >
> > > * periodic updates can overwhelm things.  Solution: remove unneeded
> updates,
> > > most scheduling data only changes when an instance does some state
> change.
> >
> > It's not clear that periodic updates do overwhelm things, though.
> > Boris ran the tests.  Apparently 10k nodes updating once a minute
> > extend the read query by ~10% (the main problem being the read query
> > is abysmal in the first place).  I don't know how much of the rest of
> > the infrastructure was involved in his test, though (RabbitMQ,
> > Conductor).
>
> A great openstack at scale talk, that covers the scheduler
> http://www.bluehost.com/blog/bluehost/bluehost-presents-operational-case-study-at-openstack-summit-2111
>
> >
> > There are reasonably solid reasons why we would want an alternative to
> > the DB backend, but I'm not sure the update rate is one of them.   If
> > we were going for an alternative the obvious candidate to my mind
> > would be something like ZooKeeper (particularly since in some setups
> > it's already a channel between the compute hosts and the control
> > server).
> > --
> > Ian.
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Joe Gordon
On Jul 23, 2013 3:44 PM, "Ian Wells"  wrote:
>
> > * periodic updates can overwhelm things.  Solution: remove unneeded
updates,
> > most scheduling data only changes when an instance does some state
change.
>
> It's not clear that periodic updates do overwhelm things, though.
> Boris ran the tests.  Apparently 10k nodes updating once a minute
> extend the read query by ~10% (the main problem being the read query
> is abysmal in the first place).  I don't know how much of the rest of
> the infrastructure was involved in his test, though (RabbitMQ,
> Conductor).

A great openstack at scale talk, that covers the scheduler
http://www.bluehost.com/blog/bluehost/bluehost-presents-operational-case-study-at-openstack-summit-2111

>
> There are reasonably solid reasons why we would want an alternative to
> the DB backend, but I'm not sure the update rate is one of them.   If
> we were going for an alternative the obvious candidate to my mind
> would be something like ZooKeeper (particularly since in some setups
> it's already a channel between the compute hosts and the control
> server).
> --
> Ian.
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Ian Wells
> * periodic updates can overwhelm things.  Solution: remove unneeded updates,
> most scheduling data only changes when an instance does some state change.

It's not clear that periodic updates do overwhelm things, though.
Boris ran the tests.  Apparently 10k nodes updating once a minute
extend the read query by ~10% (the main problem being the read query
is abysmal in the first place).  I don't know how much of the rest of
the infrastructure was involved in his test, though (RabbitMQ,
Conductor).

There are reasonably solid reasons why we would want an alternative to
the DB backend, but I'm not sure the update rate is one of them.   If
we were going for an alternative the obvious candidate to my mind
would be something like ZooKeeper (particularly since in some setups
it's already a channel between the compute hosts and the control
server).
-- 
Ian.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Joe Gordon
On Jul 22, 2013 7:13 PM, "Joshua Harlow"  wrote:
>
> An interesting idea, I'm not sure how useful it is but it could be.
>
> If you think of the compute node capability information as an 'event
stream' then you could imagine using something like apache flume (
http://flume.apache.org/) or storm (http://storm-project.net/) to be able
to sit on this stream and perform real-time analytics of said stream to
update how scheduling can be performed. Maybe the MQ or ceilometer can be
the same 'stream' source but it doesn't seem like it is needed to 'tie' the
impl to those methods. If you consider compute nodes as producers of said
data and then hook a real-time processing engine on-top that can adjust
some scheduling database used by a scheduler then it seems like u could
vary how often compute nodes produce said stream info, where and how said
stream info is stored and analyzed which will allow you to then adjust how
'real-time' you want said compute scheduling capability information to be
up to date.

Interesting idea, but not sure if its the right solution.  There are two
known issues today
* periodic updates can overwhelm things.  Solution: remove unneeded
updates, most scheduling data only changes when an instance does some state
change.
* according to Boris doing a get all hosts from the db doesn't scale.
Solution: there are several possibilities.

Neither scale issue today is helped with flume.  But this concept may be
useful in the future

>
> Just seems that real-time processing  is a similar model as what is
needed here.
>
> Maybe something like that is where this should end up?
>
> -Josh
>
> From: Joe Gordon 
> Reply-To: OpenStack Development Mailing List <
openstack-dev@lists.openstack.org>
> Date: Monday, July 22, 2013 3:47 PM
> To: OpenStack Development Mailing List 
>
> Subject: Re: [openstack-dev] A simple way to improve nova scheduler
>
>
>
>
> On Mon, Jul 22, 2013 at 5:16 AM, Boris Pavlovic  wrote:
>>
>> Joe,
>>
>> >> Speaking of Chris Beherns  "Relying on anything but the DB for
current memory free, etc, is just too laggy… so we need to stick with it,
IMO."
http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html
>>
>> It doesn't scale, use tons of resources, works slow and is hard to
extend.
>> Also the mechanism of getting free and used memory is done by virt
layer.
>> And only thing that could be laggy is rpc (but it is used also by
compute node update)
>
>
> You say it doesn't scale and uses tons of resources can you show to
reproduce your findings.  Also just because the current implementation of
the scheduler is non-optimal doesn't mean the no DB is the only solution, I
am interested in seeing other possible solutions before going down such a
drastically different road (no-db).  Such as pushing more of the logic into
the DB and not searching through all compute nodes in python space or
looking at removing the periodic updates all  together or ???.
>
>>
>>
>>
>> >> * How do you bring a new scheduler up in an existing deployment and
make it get the full state of the system?
>>
>> You should wait for a one periodic task time. And you will get full
information about all compute nodes.
>
>
> sure, that may work we need to add logic in to handle this.
>
>>
>> >> *  Broadcasting RPC updates from compute nodes to the scheduler means
every scheduler has to process  the same RPC message.  And if a deployment
hits the point where the number of compute updates is consuming 99 percent
of the scheduler's time just adding another scheduler won't fix anything as
it will get bombarded too.
>>
>>
>> If we are speaking about numbers. You are able to see our doc, where
they are counted.
>> If we have 10k nodes it will make only 150rpc calls/sec (which means
nothing for cpu). By the way we way we will remove 150 calls/s from
conductor. One more thing currently in 10nodes deployment I think we will
spend almost all time fro waiting DB (compute_nodes_get_all()). And also
when we are calling this method in this moment we should process all data
for 60 sec. (So in this case in numbers we are doing on scheduler side
60*request_pro_sec of our approach. Which means if we get more then 1
request pro sec we will do more CPU load.)
>
>
> There are deployments in production (bluehost) that are already bigger
then 10k nodes, AFAIK the last numbers I heard were 16k nodes and they
didn't use our scheduler at all. So a better upper limit would be something
like 30k nodes.  At that scale we get 500 RPC broadcasts per second
(assuming 60 second periodic update) from periodic updates, plus updates
from state changes.  If we assume only 1% of compu

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-23 Thread Joshua Harlow
Or another idea:

Have each compute node write into redis (thus avoiding saturating the MQ & 
broker/DB with capabilities information) under 2 keys, one that is updated over 
longer periods and one that is updated frequently.

- Possibly like the following

compute-$hostname.slow
compute-$hostname.fast

Now schedulers can either pull from said slow key to get less frequent updates, 
or they can subscribe (yes redis has a subscribe model) to get updates about 
the 'fast' information which will be more accurate.

Since this information is pretty transient, it doesn't seem like we need to use 
a DB and since the MQ is used for control traffic it doesn't seem so good to 
use the MQ for this transient information either.

For the problem of when a new scheduler comes online they can basically query 
the database for the compute hostnames, then query redis (slow or fast keys) 
and setup there own internal state accordingly.

Since redis can be scaled/partitioned pretty easily it seems like it could be a 
useful way to store this type of information.

Thoughts?

From: Joshua Harlow mailto:harlo...@yahoo-inc.com>>
Reply-To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Date: Monday, July 22, 2013 4:12 PM
To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>, 
Joe Gordon mailto:joe.gord...@gmail.com>>
Subject: Re: [openstack-dev] A simple way to improve nova scheduler

An interesting idea, I'm not sure how useful it is but it could be.

If you think of the compute node capability information as an 'event stream' 
then you could imagine using something like apache flume 
(http://flume.apache.org/) or storm (http://storm-project.net/) to be able to 
sit on this stream and perform real-time analytics of said stream to update how 
scheduling can be performed. Maybe the MQ or ceilometer can be the same 
'stream' source but it doesn't seem like it is needed to 'tie' the impl to 
those methods. If you consider compute nodes as producers of said data and then 
hook a real-time processing engine on-top that can adjust some scheduling 
database used by a scheduler then it seems like u could vary how often compute 
nodes produce said stream info, where and how said stream info is stored and 
analyzed which will allow you to then adjust how 'real-time' you want said 
compute scheduling capability information to be up to date.

Just seems that real-time processing  is a similar model as what is needed here.

Maybe something like that is where this should end up?

-Josh

From: Joe Gordon mailto:joe.gord...@gmail.com>>
Reply-To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Date: Monday, July 22, 2013 3:47 PM
To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Subject: Re: [openstack-dev] A simple way to improve nova scheduler




On Mon, Jul 22, 2013 at 5:16 AM, Boris Pavlovic 
mailto:bo...@pavlovic.me>> wrote:
Joe,

>> Speaking of Chris Beherns  "Relying on anything but the DB for current 
>> memory free, etc, is just too laggy… so we need to stick with it, IMO." 
>> http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html

It doesn't scale, use tons of resources, works slow and is hard to extend.
Also the mechanism of getting free and used memory is done by virt layer.
And only thing that could be laggy is rpc (but it is used also by compute node 
update)

You say it doesn't scale and uses tons of resources can you show to reproduce 
your findings.  Also just because the current implementation of the scheduler 
is non-optimal doesn't mean the no DB is the only solution, I am interested in 
seeing other possible solutions before going down such a drastically different 
road (no-db).  Such as pushing more of the logic into the DB and not searching 
through all compute nodes in python space or looking at removing the periodic 
updates all  together or ???.



>> * How do you bring a new scheduler up in an existing deployment and make it 
>> get the full state of the system?

You should wait for a one periodic task time. And you will get full information 
about all compute nodes.

sure, that may work we need to add logic in to handle this.


>> *  Broadcasting RPC updates from compute nodes to the scheduler means every 
>> scheduler has to process  the same RPC message.  And if a deployment hits 
>> the point where the number of compute updates is consuming 99 percent of the 
>> scheduler's time just adding another scheduler won't fix anything as it will 
>> get bombarded too.


If we are speaking about numbers. You are able to see our doc, where they are 
counted.
If we have 10k nodes it will make only 150rpc calls/sec (which means nothing 
for cpu). By the way we way we w

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-22 Thread Joshua Harlow
An interesting idea, I'm not sure how useful it is but it could be.

If you think of the compute node capability information as an 'event stream' 
then you could imagine using something like apache flume 
(http://flume.apache.org/) or storm (http://storm-project.net/) to be able to 
sit on this stream and perform real-time analytics of said stream to update how 
scheduling can be performed. Maybe the MQ or ceilometer can be the same 
'stream' source but it doesn't seem like it is needed to 'tie' the impl to 
those methods. If you consider compute nodes as producers of said data and then 
hook a real-time processing engine on-top that can adjust some scheduling 
database used by a scheduler then it seems like u could vary how often compute 
nodes produce said stream info, where and how said stream info is stored and 
analyzed which will allow you to then adjust how 'real-time' you want said 
compute scheduling capability information to be up to date.

Just seems that real-time processing  is a similar model as what is needed here.

Maybe something like that is where this should end up?

-Josh

From: Joe Gordon mailto:joe.gord...@gmail.com>>
Reply-To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Date: Monday, July 22, 2013 3:47 PM
To: OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>
Subject: Re: [openstack-dev] A simple way to improve nova scheduler




On Mon, Jul 22, 2013 at 5:16 AM, Boris Pavlovic 
mailto:bo...@pavlovic.me>> wrote:
Joe,

>> Speaking of Chris Beherns  "Relying on anything but the DB for current 
>> memory free, etc, is just too laggy… so we need to stick with it, IMO." 
>> http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html

It doesn't scale, use tons of resources, works slow and is hard to extend.
Also the mechanism of getting free and used memory is done by virt layer.
And only thing that could be laggy is rpc (but it is used also by compute node 
update)

You say it doesn't scale and uses tons of resources can you show to reproduce 
your findings.  Also just because the current implementation of the scheduler 
is non-optimal doesn't mean the no DB is the only solution, I am interested in 
seeing other possible solutions before going down such a drastically different 
road (no-db).  Such as pushing more of the logic into the DB and not searching 
through all compute nodes in python space or looking at removing the periodic 
updates all  together or ???.



>> * How do you bring a new scheduler up in an existing deployment and make it 
>> get the full state of the system?

You should wait for a one periodic task time. And you will get full information 
about all compute nodes.

sure, that may work we need to add logic in to handle this.


>> *  Broadcasting RPC updates from compute nodes to the scheduler means every 
>> scheduler has to process  the same RPC message.  And if a deployment hits 
>> the point where the number of compute updates is consuming 99 percent of the 
>> scheduler's time just adding another scheduler won't fix anything as it will 
>> get bombarded too.


If we are speaking about numbers. You are able to see our doc, where they are 
counted.
If we have 10k nodes it will make only 150rpc calls/sec (which means nothing 
for cpu). By the way we way we will remove 150 calls/s from conductor. One more 
thing currently in 10nodes deployment I think we will spend almost all time fro 
waiting DB (compute_nodes_get_all()). And also when we are calling this method 
in this moment we should process all data for 60 sec. (So in this case in 
numbers we are doing on scheduler side 60*request_pro_sec of our approach. 
Which means if we get more then 1 request pro sec we will do more CPU load.)

There are deployments in production (bluehost) that are already bigger then 10k 
nodes, AFAIK the last numbers I heard were 16k nodes and they didn't use our 
scheduler at all. So a better upper limit would be something like 30k nodes.  
At that scale we get 500 RPC broadcasts per second (assuming 60 second periodic 
update) from periodic updates, plus updates from state changes.  If we assume 
only 1% of compute nodes have instances that are changing state that is an 
additional 300 RPC broadcasts to the schedulers per second.  So now we have 800 
per second.  How many RPC updates (from compute node to scheduler) per second 
can a single python thread handle without DB access? With DB Access?

As for your second point, I don't follow can you elaborate.






>> Also OpenStack is already deeply invested in using the central DB model for 
>> the state of the 'world' and while I am not against changing that, I think 
>> we should evaluate that switch in a larger context.

Step by step. As first step

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-22 Thread Joe Gordon
On Mon, Jul 22, 2013 at 5:16 AM, Boris Pavlovic  wrote:

> Joe,
>
> >> Speaking of Chris Beherns  "Relying on anything but the DB for current
> memory free, etc, is just too laggy… so we need to stick with it, IMO."
> http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html
>
> It doesn't scale, use tons of resources, works slow and is hard to extend.
> Also the mechanism of getting free and used memory is done by virt layer.
> And only thing that could be laggy is rpc (but it is used also by compute
> node update)
>

You say it doesn't scale and uses tons of resources can you show to
reproduce your findings.  Also just because the current implementation of
the scheduler is non-optimal doesn't mean the no DB is the only solution, I
am interested in seeing other possible solutions before going down such a
drastically different road (no-db).  Such as pushing more of the logic into
the DB and not searching through all compute nodes in python space or
looking at removing the periodic updates all  together or ???.


>
>
> >> * How do you bring a new scheduler up in an existing deployment and
> make it get the full state of the system?
>
> You should wait for a one periodic task time. And you will get full
> information about all compute nodes.
>

sure, that may work we need to add logic in to handle this.


> >> *  Broadcasting RPC updates from compute nodes to the scheduler means
> every scheduler has to process  the same RPC message.  And if a deployment
> hits the point where the number of compute updates is consuming 99 percent
> of the scheduler's time just adding another scheduler won't fix anything as
> it will get bombarded too.
>
>
> If we are speaking about numbers. You are able to see our doc, where they
> are counted.
> If we have 10k nodes it will make only 150rpc calls/sec (which means
> nothing for cpu). By the way we way we will remove 150 calls/s from
> conductor. One more thing currently in 10nodes deployment I think we will
> spend almost all time fro waiting DB (compute_nodes_get_all()). And also
> when we are calling this method in this moment we should process all data
> for 60 sec. (So in this case in numbers we are doing on scheduler side
> 60*request_pro_sec of our approach. Which means if we get more then 1
> request pro sec we will do more CPU load.)
>

There are deployments in production (bluehost) that are already bigger then
10k nodes, AFAIK the last numbers I heard were 16k nodes and they didn't
use our scheduler at all. So a better upper limit would be something like
30k nodes.  At that scale we get 500 RPC broadcasts per second (assuming 60
second periodic update) from periodic updates, plus updates from state
changes.  If we assume only 1% of compute nodes have instances that are
changing state that is an additional 300 RPC broadcasts to the schedulers
per second.  So now we have 800 per second.  How many RPC updates (from
compute node to scheduler) per second can a single python thread handle
without DB access? With DB Access?

As for your second point, I don't follow can you elaborate.





>
>
> >> Also OpenStack is already deeply invested in using the central DB model
> for the state of the 'world' and while I am not against changing that, I
> think we should evaluate that switch in a larger context.
>
> Step by step. As first step we could just remove compute_node_get_all
> method. Which will make our openstack much scalable and fast.
>

Yes, step by step is how to fix something.  But before going in this
direction it is worth a larger discussion of how we *want* things to look
and in what direction we should be moving in.  If we want to use this
model, we should consider where else it can help,  other repercussions etc.


>
> By the way see one more time answers on your comments in doc.
>
> Best regards,
> Boris Pavlovic
>
> Mirantis Inc.
>
>
>
>
>
> On Sat, Jul 20, 2013 at 3:14 AM, Joe Gordon  wrote:
>
>>
>>
>>
>> On Fri, Jul 19, 2013 at 3:13 PM, Sandy Walsh 
>> wrote:
>>
>>>
>>>
>>> On 07/19/2013 05:36 PM, Boris Pavlovic wrote:
>>> > Sandy,
>>> >
>>> > I don't think that we have such problems here.
>>> > Because scheduler doesn't pool compute_nodes.
>>> > The situation is another compute_nodes notify scheduler about their
>>> > state. (instead of updating their state in DB)
>>> >
>>> > So for example if scheduler send request to compute_node, compute_node
>>> > is able to run rpc call to schedulers immediately (not after 60sec).
>>> >
>>> > So there is almost no races.
>>>
>>> There are races that occur between the eventlet request threads. This is
>>> why the scheduler has been switched to single threaded and we can only
>>> run one scheduler.
>>>
>>> This problem may have been eliminated with the work that Chris Behrens
>>> and Brian Elliott were doing, but I'm not sure.
>>>
>>
>>
>> Speaking of Chris Beherns  "Relying on anything but the DB for current
>> memory free, etc, is just too laggy… so we need to stick with it, IMO."
>> http://lists.openstack

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-22 Thread Russell Bryant
On 07/22/2013 12:51 PM, John Garbutt wrote:
> On 22 July 2013 13:23, Boris Pavlovic  wrote:
>> I see only one race condition. (in current solution we have the same
>> situtaiton)
>> Between request to compute node and data is updated in DB, we could use
>> wrong state of compute node.
>> By the way it is fixed by retry.
> 
> This race turns out to be a big deal when there are bursts of VM.start 
> requests.
> 
> I am currently thinking about ways we can look to eliminate this one.
> Hoping to have a design summit session on that.

Cool.  In addition to retries, it's somewhat mitigated by using the
scheduler_host_subset_size to reduce the chance that multiple schedulers
choose the same host.

# New instances will be scheduled on a host chosen randomly
# from a subset of the N best hosts. This property defines the
# subset size that a host is chosen from. A value of 1 chooses
# the first host returned by the weighing functions. This
# value must be at least 1. Any value less than 1 will be
# ignored, and 1 will be used instead (integer value)
#scheduler_host_subset_size=1

-- 
Russell Bryant

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-22 Thread John Garbutt
On 22 July 2013 13:23, Boris Pavlovic  wrote:
> I see only one race condition. (in current solution we have the same
> situtaiton)
> Between request to compute node and data is updated in DB, we could use
> wrong state of compute node.
> By the way it is fixed by retry.

This race turns out to be a big deal when there are bursts of VM.start requests.

I am currently thinking about ways we can look to eliminate this one.
Hoping to have a design summit session on that.

John

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-22 Thread Jiang, Yunhong
The thing laggy is, currently resource tracker will update the usage 
information whenever resource changes, not only in periodic tasks. If you 
really want to get the current result with periodic update, you have to do some 
in-memory management and you even need sync between different scheduler 
controller, as stated in  
http://lists.openstack.org/pipermail/openstack-dev/2013-June/010490.html .

-jyh

From: Boris Pavlovic [mailto:bo...@pavlovic.me]
Sent: Monday, July 22, 2013 5:17 AM
To: OpenStack Development Mailing List
Subject: Re: [openstack-dev] A simple way to improve nova scheduler

Joe,

>> Speaking of Chris Beherns  "Relying on anything but the DB for current 
>> memory free, etc, is just too laggy... so we need to stick with it, IMO." 
>> http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html

It doesn't scale, use tons of resources, works slow and is hard to extend.
Also the mechanism of getting free and used memory is done by virt layer.
And only thing that could be laggy is rpc (but it is used also by compute node 
update)


>> * How do you bring a new scheduler up in an existing deployment and make it 
>> get the full state of the system?

You should wait for a one periodic task time. And you will get full information 
about all compute nodes.

>> *  Broadcasting RPC updates from compute nodes to the scheduler means every 
>> scheduler has to process  the same RPC message.  And if a deployment hits 
>> the point where the number of compute updates is consuming 99 percent of the 
>> scheduler's time just adding another scheduler won't fix anything as it will 
>> get bombarded too.


If we are speaking about numbers. You are able to see our doc, where they are 
counted.
If we have 10k nodes it will make only 150rpc calls/sec (which means nothing 
for cpu). By the way we way we will remove 150 calls/s from conductor. One more 
thing currently in 10nodes deployment I think we will spend almost all time fro 
waiting DB (compute_nodes_get_all()). And also when we are calling this method 
in this moment we should process all data for 60 sec. (So in this case in 
numbers we are doing on scheduler side 60*request_pro_sec of our approach. 
Which means if we get more then 1 request pro sec we will do more CPU load.)


>> Also OpenStack is already deeply invested in using the central DB model for 
>> the state of the 'world' and while I am not against changing that, I think 
>> we should evaluate that switch in a larger context.

Step by step. As first step we could just remove compute_node_get_all method. 
Which will make our openstack much scalable and fast.


By the way see one more time answers on your comments in doc.

Best regards,
Boris Pavlovic

Mirantis Inc.




On Sat, Jul 20, 2013 at 3:14 AM, Joe Gordon 
mailto:joe.gord...@gmail.com>> wrote:


On Fri, Jul 19, 2013 at 3:13 PM, Sandy Walsh 
mailto:sandy.wa...@rackspace.com>> wrote:


On 07/19/2013 05:36 PM, Boris Pavlovic wrote:
> Sandy,
>
> I don't think that we have such problems here.
> Because scheduler doesn't pool compute_nodes.
> The situation is another compute_nodes notify scheduler about their
> state. (instead of updating their state in DB)
>
> So for example if scheduler send request to compute_node, compute_node
> is able to run rpc call to schedulers immediately (not after 60sec).
>
> So there is almost no races.
There are races that occur between the eventlet request threads. This is
why the scheduler has been switched to single threaded and we can only
run one scheduler.

This problem may have been eliminated with the work that Chris Behrens
and Brian Elliott were doing, but I'm not sure.


Speaking of Chris Beherns  "Relying on anything but the DB for current memory 
free, etc, is just too laggy... so we need to stick with it, IMO." 
http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html

Although there is some elegance to the proposal here I have some concerns.

If just using RPC broadcasts from compute to schedulers to keep track of 
things, we get two issues:

* How do you bring a new scheduler up in an existing deployment and make it get 
the full state of the system?
* Broadcasting RPC updates from compute nodes to the scheduler means every 
scheduler has to process  the same RPC message.  And if a deployment hits the 
point where the number of compute updates is consuming 99 percent of the 
scheduler's time just adding another scheduler won't fix anything as it will 
get bombarded too.

Also OpenStack is already deeply invested in using the central DB model for the 
state of the 'world' and while I am not against changing that, I think we 
should evaluate that switch in a larger context.



But certainly, the old approach of having the compute node broadcast
status every N se

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-22 Thread Russell Bryant
On 07/22/2013 10:43 AM, Boris Pavlovic wrote:
> Russell,
> 
> To get information about "all" compute nodes we should wait one periodic
> task (60 seconds by default).
> So starting will take a while.
> 
> But I don't think that this is a big problem:
> 1) if we are already able to wait each time for heavy and long (> few
> seconds) db querie
> 2) if we have more then one scheduler, we are always able to turn and
> change one by one.
> (I don't think that having for 5 minutes old and new schedulers will
> break anything).
> 
> Also as a first step that could be done to speed up scheduler:
> We could just remove db.compute_node_get_all() and send RPC calls
> directly to schedulers. 
> I think that patch-set that change this thing will be pretty small
> (~100-150 lines of code) and doesn't requirers big changes in current
> scheduler implementation. 

In any case, I think it's too late in the Havana cycle to be introducing
a new effort like this.  It will have to wait for Icehouse.  We should
plan to have a design summit session on it, as well.

-- 
Russell Bryant

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-22 Thread Boris Pavlovic
Russell,

To get information about "all" compute nodes we should wait one periodic
task (60 seconds by default).
So starting will take a while.

But I don't think that this is a big problem:
1) if we are already able to wait each time for heavy and long (> few
seconds) db querie
2) if we have more then one scheduler, we are always able to turn and
change one by one.
(I don't think that having for 5 minutes old and new schedulers will break
anything).

Also as a first step that could be done to speed up scheduler:
We could just remove db.compute_node_get_all() and send RPC calls directly
to schedulers.
I think that patch-set that change this thing will be pretty small
(~100-150 lines of code) and doesn't requirers big changes in current
scheduler implementation.


Best regards,
Boris Pavlovic

Mirantis Inc.



On Mon, Jul 22, 2013 at 5:50 PM, Russell Bryant  wrote:

> On 07/22/2013 08:16 AM, Boris Pavlovic wrote:
> >>> * How do you bring a new scheduler up in an existing deployment and
> make it get the full state of the system?
> >
> > You should wait for a one periodic task time. And you will get full
> > information about all compute nodes.
>
> This also affects upgrading a scheduler.  Also consider a continuous
> deployment setup.  Every time you update a scheduler, it's not usable
> for (periodic task interval) seconds/minutes?
>
> --
> Russell Bryant
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-22 Thread Russell Bryant
On 07/22/2013 08:16 AM, Boris Pavlovic wrote:
>>> * How do you bring a new scheduler up in an existing deployment and make it 
>>> get the full state of the system?
> 
> You should wait for a one periodic task time. And you will get full
> information about all compute nodes. 

This also affects upgrading a scheduler.  Also consider a continuous
deployment setup.  Every time you update a scheduler, it's not usable
for (periodic task interval) seconds/minutes?

-- 
Russell Bryant

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-22 Thread Boris Pavlovic
Sandy,

I see only one race condition. (in current solution we have the same
situtaiton)
Between request to compute node and data is updated in DB, we could use
wrong state of compute node.
By the way it is fixed by retry.

I don't see any new races that are produces by new approach without DB.
Could you say line or method that will produce races?

Best regards,
Boris Pavlovic





On Sat, Jul 20, 2013 at 2:13 AM, Sandy Walsh wrote:

>
>
> On 07/19/2013 05:36 PM, Boris Pavlovic wrote:
> > Sandy,
> >
> > I don't think that we have such problems here.
> > Because scheduler doesn't pool compute_nodes.
> > The situation is another compute_nodes notify scheduler about their
> > state. (instead of updating their state in DB)
> >
> > So for example if scheduler send request to compute_node, compute_node
> > is able to run rpc call to schedulers immediately (not after 60sec).
> >
> > So there is almost no races.
>
> There are races that occur between the eventlet request threads. This is
> why the scheduler has been switched to single threaded and we can only
> run one scheduler.
>
> This problem may have been eliminated with the work that Chris Behrens
> and Brian Elliott were doing, but I'm not sure.
>
> But certainly, the old approach of having the compute node broadcast
> status every N seconds is not suitable and was eliminated a long time ago.
>
> >
> >
> > Best regards,
> > Boris Pavlovic
> >
> > Mirantis Inc.
> >
> >
> >
> > On Sat, Jul 20, 2013 at 12:23 AM, Sandy Walsh  > > wrote:
> >
> >
> >
> > On 07/19/2013 05:01 PM, Boris Pavlovic wrote:
> > > Sandy,
> > >
> > > Hm I don't know that algorithm. But our approach doesn't have
> > > exponential exchange.
> > > I don't think that in 10k nodes cloud we will have a problems with
> 150
> > > RPC call/sec. Even in 100k we will have only 1.5k RPC call/sec.
> > > More then (compute nodes update their state in DB through conductor
> > > which produce the same count of RPC calls).
> > >
> > > So I don't see any explosion here.
> >
> > Sorry, I was commenting on Soren's suggestion from way back
> (essentially
> > listening on a separate exchange for each unique flavor ... so no
> > scheduler was needed at all). It was a great idea, but fell apart
> rather
> > quickly.
> >
> > The existing approach the scheduler takes is expensive (asking the db
> > for state of all hosts) and polling the compute nodes might be
> do-able,
> > but you're still going to have latency problems waiting for the
> > responses (the states are invalid nearly immediately, especially if a
> > fill-first scheduling algorithm is used). We ran into this problem
> > before in an earlier scheduler implementation. The round-tripping
> kills.
> >
> > We have a lot of really great information on Host state in the form
> of
> > notifications right now. I think having a service (or notification
> > driver) listening for these and keeping an the HostState
> incrementally
> > updated (and reported back to all of the schedulers via the fanout
> > queue) would be a better approach.
> >
> > -S
> >
> >
> > >
> > > Best regards,
> > > Boris Pavlovic
> > >
> > > Mirantis Inc.
> > >
> > >
> > > On Fri, Jul 19, 2013 at 11:47 PM, Sandy Walsh
> > mailto:sandy.wa...@rackspace.com>
> > >  > >> wrote:
> > >
> > >
> > >
> > > On 07/19/2013 04:25 PM, Brian Schott wrote:
> > > > I think Soren suggested this way back in Cactus to use MQ
> > for compute
> > > > node state rather than database and it was a good idea then.
> > >
> > > The problem with that approach was the number of queues went
> > exponential
> > > as soon as you went beyond simple flavors. Add Capabilities or
> > other
> > > criteria and you get an explosion of exchanges to listen to.
> > >
> > >
> > >
> > > > On Jul 19, 2013, at 10:52 AM, Boris Pavlovic
> > mailto:bo...@pavlovic.me>
> > > >
> > > > 
> >  > > >
> > > >> Hi all,
> > > >>
> > > >>
> > > >> In Mirantis Alexey Ovtchinnikov and me are working on nova
> > scheduler
> > > >> improvements.
> > > >>
> > > >> As far as we can see the problem, now scheduler has two
> > major issues:
> > > >>
> > > >> 1) Scalability. Factors that contribute to bad scalability
> > are these:
> > > >> *) Each compute node every periodic task interval (60 sec
> > by default)
> > > >> updates resources state in DB.
> > > >> *) On every boot request scheduler has to fetch 

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-22 Thread Boris Pavlovic
Joe,

>> Speaking of Chris Beherns  "Relying on anything but the DB for current
memory free, etc, is just too laggy… so we need to stick with it, IMO."
http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html

It doesn't scale, use tons of resources, works slow and is hard to extend.
Also the mechanism of getting free and used memory is done by virt layer.
And only thing that could be laggy is rpc (but it is used also by compute
node update)


>> * How do you bring a new scheduler up in an existing deployment and make
it get the full state of the system?

You should wait for a one periodic task time. And you will get full
information about all compute nodes.

>> *  Broadcasting RPC updates from compute nodes to the scheduler means
every scheduler has to process  the same RPC message.  And if a deployment
hits the point where the number of compute updates is consuming 99 percent
of the scheduler's time just adding another scheduler won't fix anything as
it will get bombarded too.


If we are speaking about numbers. You are able to see our doc, where they
are counted.
If we have 10k nodes it will make only 150rpc calls/sec (which means
nothing for cpu). By the way we way we will remove 150 calls/s from
conductor. One more thing currently in 10nodes deployment I think we will
spend almost all time fro waiting DB (compute_nodes_get_all()). And also
when we are calling this method in this moment we should process all data
for 60 sec. (So in this case in numbers we are doing on scheduler side
60*request_pro_sec of our approach. Which means if we get more then 1
request pro sec we will do more CPU load.)


>> Also OpenStack is already deeply invested in using the central DB model
for the state of the 'world' and while I am not against changing that, I
think we should evaluate that switch in a larger context.

Step by step. As first step we could just remove compute_node_get_all
method. Which will make our openstack much scalable and fast.


By the way see one more time answers on your comments in doc.

Best regards,
Boris Pavlovic

Mirantis Inc.





On Sat, Jul 20, 2013 at 3:14 AM, Joe Gordon  wrote:

>
>
>
> On Fri, Jul 19, 2013 at 3:13 PM, Sandy Walsh wrote:
>
>>
>>
>> On 07/19/2013 05:36 PM, Boris Pavlovic wrote:
>> > Sandy,
>> >
>> > I don't think that we have such problems here.
>> > Because scheduler doesn't pool compute_nodes.
>> > The situation is another compute_nodes notify scheduler about their
>> > state. (instead of updating their state in DB)
>> >
>> > So for example if scheduler send request to compute_node, compute_node
>> > is able to run rpc call to schedulers immediately (not after 60sec).
>> >
>> > So there is almost no races.
>>
>> There are races that occur between the eventlet request threads. This is
>> why the scheduler has been switched to single threaded and we can only
>> run one scheduler.
>>
>> This problem may have been eliminated with the work that Chris Behrens
>> and Brian Elliott were doing, but I'm not sure.
>>
>
>
> Speaking of Chris Beherns  "Relying on anything but the DB for current
> memory free, etc, is just too laggy… so we need to stick with it, IMO."
> http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html
>
> Although there is some elegance to the proposal here I have some concerns.
>
> If just using RPC broadcasts from compute to schedulers to keep track of
> things, we get two issues:
>
> * How do you bring a new scheduler up in an existing deployment and make
> it get the full state of the system?
> * Broadcasting RPC updates from compute nodes to the scheduler means every
> scheduler has to process  the same RPC message.  And if a deployment hits
> the point where the number of compute updates is consuming 99 percent of
> the scheduler's time just adding another scheduler won't fix anything as it
> will get bombarded too.
>
> Also OpenStack is already deeply invested in using the central DB model
> for the state of the 'world' and while I am not against changing that, I
> think we should evaluate that switch in a larger context.
>
>
>
>>
>> But certainly, the old approach of having the compute node broadcast
>> status every N seconds is not suitable and was eliminated a long time ago.
>>
>> >
>> >
>> > Best regards,
>> > Boris Pavlovic
>> >
>> > Mirantis Inc.
>> >
>> >
>> >
>> > On Sat, Jul 20, 2013 at 12:23 AM, Sandy Walsh <
>> sandy.wa...@rackspace.com
>> > > wrote:
>> >
>> >
>> >
>> > On 07/19/2013 05:01 PM, Boris Pavlovic wrote:
>> > > Sandy,
>> > >
>> > > Hm I don't know that algorithm. But our approach doesn't have
>> > > exponential exchange.
>> > > I don't think that in 10k nodes cloud we will have a problems
>> with 150
>> > > RPC call/sec. Even in 100k we will have only 1.5k RPC call/sec.
>> > > More then (compute nodes update their state in DB through
>> conductor
>> > > which produce the same count of RPC calls).
>> > >
>> > > So I don't see

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-19 Thread Joe Gordon
On Fri, Jul 19, 2013 at 3:13 PM, Sandy Walsh wrote:

>
>
> On 07/19/2013 05:36 PM, Boris Pavlovic wrote:
> > Sandy,
> >
> > I don't think that we have such problems here.
> > Because scheduler doesn't pool compute_nodes.
> > The situation is another compute_nodes notify scheduler about their
> > state. (instead of updating their state in DB)
> >
> > So for example if scheduler send request to compute_node, compute_node
> > is able to run rpc call to schedulers immediately (not after 60sec).
> >
> > So there is almost no races.
>
> There are races that occur between the eventlet request threads. This is
> why the scheduler has been switched to single threaded and we can only
> run one scheduler.
>
> This problem may have been eliminated with the work that Chris Behrens
> and Brian Elliott were doing, but I'm not sure.
>


Speaking of Chris Beherns  "Relying on anything but the DB for current
memory free, etc, is just too laggy… so we need to stick with it, IMO."
http://lists.openstack.org/pipermail/openstack-dev/2013-June/010485.html

Although there is some elegance to the proposal here I have some concerns.

If just using RPC broadcasts from compute to schedulers to keep track of
things, we get two issues:

* How do you bring a new scheduler up in an existing deployment and make it
get the full state of the system?
* Broadcasting RPC updates from compute nodes to the scheduler means every
scheduler has to process  the same RPC message.  And if a deployment hits
the point where the number of compute updates is consuming 99 percent of
the scheduler's time just adding another scheduler won't fix anything as it
will get bombarded too.

Also OpenStack is already deeply invested in using the central DB model for
the state of the 'world' and while I am not against changing that, I think
we should evaluate that switch in a larger context.



>
> But certainly, the old approach of having the compute node broadcast
> status every N seconds is not suitable and was eliminated a long time ago.
>
> >
> >
> > Best regards,
> > Boris Pavlovic
> >
> > Mirantis Inc.
> >
> >
> >
> > On Sat, Jul 20, 2013 at 12:23 AM, Sandy Walsh  > > wrote:
> >
> >
> >
> > On 07/19/2013 05:01 PM, Boris Pavlovic wrote:
> > > Sandy,
> > >
> > > Hm I don't know that algorithm. But our approach doesn't have
> > > exponential exchange.
> > > I don't think that in 10k nodes cloud we will have a problems with
> 150
> > > RPC call/sec. Even in 100k we will have only 1.5k RPC call/sec.
> > > More then (compute nodes update their state in DB through conductor
> > > which produce the same count of RPC calls).
> > >
> > > So I don't see any explosion here.
> >
> > Sorry, I was commenting on Soren's suggestion from way back
> (essentially
> > listening on a separate exchange for each unique flavor ... so no
> > scheduler was needed at all). It was a great idea, but fell apart
> rather
> > quickly.
> >
> > The existing approach the scheduler takes is expensive (asking the db
> > for state of all hosts) and polling the compute nodes might be
> do-able,
> > but you're still going to have latency problems waiting for the
> > responses (the states are invalid nearly immediately, especially if a
> > fill-first scheduling algorithm is used). We ran into this problem
> > before in an earlier scheduler implementation. The round-tripping
> kills.
> >
> > We have a lot of really great information on Host state in the form
> of
> > notifications right now. I think having a service (or notification
> > driver) listening for these and keeping an the HostState
> incrementally
> > updated (and reported back to all of the schedulers via the fanout
> > queue) would be a better approach.
> >
> > -S
> >
> >
> > >
> > > Best regards,
> > > Boris Pavlovic
> > >
> > > Mirantis Inc.
> > >
> > >
> > > On Fri, Jul 19, 2013 at 11:47 PM, Sandy Walsh
> > mailto:sandy.wa...@rackspace.com>
> > >  > >> wrote:
> > >
> > >
> > >
> > > On 07/19/2013 04:25 PM, Brian Schott wrote:
> > > > I think Soren suggested this way back in Cactus to use MQ
> > for compute
> > > > node state rather than database and it was a good idea then.
> > >
> > > The problem with that approach was the number of queues went
> > exponential
> > > as soon as you went beyond simple flavors. Add Capabilities or
> > other
> > > criteria and you get an explosion of exchanges to listen to.
> > >
> > >
> > >
> > > > On Jul 19, 2013, at 10:52 AM, Boris Pavlovic
> > mailto:bo...@pavlovic.me>
> > > >
> > > > 
> > 

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-19 Thread Sandy Walsh


On 07/19/2013 05:36 PM, Boris Pavlovic wrote:
> Sandy,
> 
> I don't think that we have such problems here.
> Because scheduler doesn't pool compute_nodes. 
> The situation is another compute_nodes notify scheduler about their
> state. (instead of updating their state in DB)
> 
> So for example if scheduler send request to compute_node, compute_node
> is able to run rpc call to schedulers immediately (not after 60sec).
> 
> So there is almost no races.

There are races that occur between the eventlet request threads. This is
why the scheduler has been switched to single threaded and we can only
run one scheduler.

This problem may have been eliminated with the work that Chris Behrens
and Brian Elliott were doing, but I'm not sure.

But certainly, the old approach of having the compute node broadcast
status every N seconds is not suitable and was eliminated a long time ago.

> 
> 
> Best regards,
> Boris Pavlovic
> 
> Mirantis Inc. 
> 
> 
> 
> On Sat, Jul 20, 2013 at 12:23 AM, Sandy Walsh  > wrote:
> 
> 
> 
> On 07/19/2013 05:01 PM, Boris Pavlovic wrote:
> > Sandy,
> >
> > Hm I don't know that algorithm. But our approach doesn't have
> > exponential exchange.
> > I don't think that in 10k nodes cloud we will have a problems with 150
> > RPC call/sec. Even in 100k we will have only 1.5k RPC call/sec.
> > More then (compute nodes update their state in DB through conductor
> > which produce the same count of RPC calls).
> >
> > So I don't see any explosion here.
> 
> Sorry, I was commenting on Soren's suggestion from way back (essentially
> listening on a separate exchange for each unique flavor ... so no
> scheduler was needed at all). It was a great idea, but fell apart rather
> quickly.
> 
> The existing approach the scheduler takes is expensive (asking the db
> for state of all hosts) and polling the compute nodes might be do-able,
> but you're still going to have latency problems waiting for the
> responses (the states are invalid nearly immediately, especially if a
> fill-first scheduling algorithm is used). We ran into this problem
> before in an earlier scheduler implementation. The round-tripping kills.
> 
> We have a lot of really great information on Host state in the form of
> notifications right now. I think having a service (or notification
> driver) listening for these and keeping an the HostState incrementally
> updated (and reported back to all of the schedulers via the fanout
> queue) would be a better approach.
> 
> -S
> 
> 
> >
> > Best regards,
> > Boris Pavlovic
> >
> > Mirantis Inc.
> >
> >
> > On Fri, Jul 19, 2013 at 11:47 PM, Sandy Walsh
> mailto:sandy.wa...@rackspace.com>
> >  >> wrote:
> >
> >
> >
> > On 07/19/2013 04:25 PM, Brian Schott wrote:
> > > I think Soren suggested this way back in Cactus to use MQ
> for compute
> > > node state rather than database and it was a good idea then.
> >
> > The problem with that approach was the number of queues went
> exponential
> > as soon as you went beyond simple flavors. Add Capabilities or
> other
> > criteria and you get an explosion of exchanges to listen to.
> >
> >
> >
> > > On Jul 19, 2013, at 10:52 AM, Boris Pavlovic
> mailto:bo...@pavlovic.me>
> > >
> > > 
>  > >
> > >> Hi all,
> > >>
> > >>
> > >> In Mirantis Alexey Ovtchinnikov and me are working on nova
> scheduler
> > >> improvements.
> > >>
> > >> As far as we can see the problem, now scheduler has two
> major issues:
> > >>
> > >> 1) Scalability. Factors that contribute to bad scalability
> are these:
> > >> *) Each compute node every periodic task interval (60 sec
> by default)
> > >> updates resources state in DB.
> > >> *) On every boot request scheduler has to fetch information
> about all
> > >> compute nodes from DB.
> > >>
> > >> 2) Flexibility. Flexibility perishes due to problems with:
> > >> *) Addiing new complex resources (such as big lists of complex
> > objects
> > >> e.g. required by PCI Passthrough
> > >>
> >
> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
> > >> *) Using different sources of data in Scheduler for example
> from
> > >> cinder or ceilometer.
> > >> (as required by Volume Affinity Filter
> > >> https://review.openstack.org/#/c/29343/)
> > >>
> > >>

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-19 Thread Sandy Walsh


On 07/19/2013 04:25 PM, Brian Schott wrote:
> I think Soren suggested this way back in Cactus to use MQ for compute
> node state rather than database and it was a good idea then. 

The problem with that approach was the number of queues went exponential
as soon as you went beyond simple flavors. Add Capabilities or other
criteria and you get an explosion of exchanges to listen to.



> On Jul 19, 2013, at 10:52 AM, Boris Pavlovic  > wrote:
> 
>> Hi all, 
>>
>>
>> In Mirantis Alexey Ovtchinnikov and me are working on nova scheduler
>> improvements.
>>
>> As far as we can see the problem, now scheduler has two major issues:
>>
>> 1) Scalability. Factors that contribute to bad scalability are these:
>> *) Each compute node every periodic task interval (60 sec by default)
>> updates resources state in DB.
>> *) On every boot request scheduler has to fetch information about all
>> compute nodes from DB.
>>
>> 2) Flexibility. Flexibility perishes due to problems with:
>> *) Addiing new complex resources (such as big lists of complex objects
>> e.g. required by PCI Passthrough
>> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
>> *) Using different sources of data in Scheduler for example from
>> cinder or ceilometer.
>> (as required by Volume Affinity Filter
>> https://review.openstack.org/#/c/29343/)
>>
>>
>> We found a simple way to mitigate this issues by avoiding of DB usage
>> for host state storage.
>>
>> A more detailed discussion of the problem state and one of a possible
>> solution can be found here:
>>
>> https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit#
>>
>>
>> Best regards,
>> Boris Pavlovic
>>
>> Mirantis Inc.
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> 
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-19 Thread Boris Pavlovic
Sandy,

Hm I don't know that algorithm. But our approach doesn't have exponential
exchange.
I don't think that in 10k nodes cloud we will have a problems with 150 RPC
call/sec. Even in 100k we will have only 1.5k RPC call/sec.
More then (compute nodes update their state in DB through conductor which
produce the same count of RPC calls).

So I don't see any explosion here.

Best regards,
Boris Pavlovic

Mirantis Inc.


On Fri, Jul 19, 2013 at 11:47 PM, Sandy Walsh wrote:

>
>
> On 07/19/2013 04:25 PM, Brian Schott wrote:
> > I think Soren suggested this way back in Cactus to use MQ for compute
> > node state rather than database and it was a good idea then.
>
> The problem with that approach was the number of queues went exponential
> as soon as you went beyond simple flavors. Add Capabilities or other
> criteria and you get an explosion of exchanges to listen to.
>
>
>
> > On Jul 19, 2013, at 10:52 AM, Boris Pavlovic  > > wrote:
> >
> >> Hi all,
> >>
> >>
> >> In Mirantis Alexey Ovtchinnikov and me are working on nova scheduler
> >> improvements.
> >>
> >> As far as we can see the problem, now scheduler has two major issues:
> >>
> >> 1) Scalability. Factors that contribute to bad scalability are these:
> >> *) Each compute node every periodic task interval (60 sec by default)
> >> updates resources state in DB.
> >> *) On every boot request scheduler has to fetch information about all
> >> compute nodes from DB.
> >>
> >> 2) Flexibility. Flexibility perishes due to problems with:
> >> *) Addiing new complex resources (such as big lists of complex objects
> >> e.g. required by PCI Passthrough
> >> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
> >> *) Using different sources of data in Scheduler for example from
> >> cinder or ceilometer.
> >> (as required by Volume Affinity Filter
> >> https://review.openstack.org/#/c/29343/)
> >>
> >>
> >> We found a simple way to mitigate this issues by avoiding of DB usage
> >> for host state storage.
> >>
> >> A more detailed discussion of the problem state and one of a possible
> >> solution can be found here:
> >>
> >>
> https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit#
> >>
> >>
> >> Best regards,
> >> Boris Pavlovic
> >>
> >> Mirantis Inc.
> >>
> >> ___
> >> OpenStack-dev mailing list
> >> OpenStack-dev@lists.openstack.org
> >> 
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-19 Thread Boris Pavlovic
Sandy,

I don't think that we have such problems here.
Because scheduler doesn't pool compute_nodes.
The situation is another compute_nodes notify scheduler about their state.
(instead of updating their state in DB)

So for example if scheduler send request to compute_node, compute_node is
able to run rpc call to schedulers immediately (not after 60sec).

So there is almost no races.


Best regards,
Boris Pavlovic

Mirantis Inc.



On Sat, Jul 20, 2013 at 12:23 AM, Sandy Walsh wrote:

>
>
> On 07/19/2013 05:01 PM, Boris Pavlovic wrote:
> > Sandy,
> >
> > Hm I don't know that algorithm. But our approach doesn't have
> > exponential exchange.
> > I don't think that in 10k nodes cloud we will have a problems with 150
> > RPC call/sec. Even in 100k we will have only 1.5k RPC call/sec.
> > More then (compute nodes update their state in DB through conductor
> > which produce the same count of RPC calls).
> >
> > So I don't see any explosion here.
>
> Sorry, I was commenting on Soren's suggestion from way back (essentially
> listening on a separate exchange for each unique flavor ... so no
> scheduler was needed at all). It was a great idea, but fell apart rather
> quickly.
>
> The existing approach the scheduler takes is expensive (asking the db
> for state of all hosts) and polling the compute nodes might be do-able,
> but you're still going to have latency problems waiting for the
> responses (the states are invalid nearly immediately, especially if a
> fill-first scheduling algorithm is used). We ran into this problem
> before in an earlier scheduler implementation. The round-tripping kills.
>
> We have a lot of really great information on Host state in the form of
> notifications right now. I think having a service (or notification
> driver) listening for these and keeping an the HostState incrementally
> updated (and reported back to all of the schedulers via the fanout
> queue) would be a better approach.
>
> -S
>
>
> >
> > Best regards,
> > Boris Pavlovic
> >
> > Mirantis Inc.
> >
> >
> > On Fri, Jul 19, 2013 at 11:47 PM, Sandy Walsh  > > wrote:
> >
> >
> >
> > On 07/19/2013 04:25 PM, Brian Schott wrote:
> > > I think Soren suggested this way back in Cactus to use MQ for
> compute
> > > node state rather than database and it was a good idea then.
> >
> > The problem with that approach was the number of queues went
> exponential
> > as soon as you went beyond simple flavors. Add Capabilities or other
> > criteria and you get an explosion of exchanges to listen to.
> >
> >
> >
> > > On Jul 19, 2013, at 10:52 AM, Boris Pavlovic  > 
> > > >> wrote:
> > >
> > >> Hi all,
> > >>
> > >>
> > >> In Mirantis Alexey Ovtchinnikov and me are working on nova
> scheduler
> > >> improvements.
> > >>
> > >> As far as we can see the problem, now scheduler has two major
> issues:
> > >>
> > >> 1) Scalability. Factors that contribute to bad scalability are
> these:
> > >> *) Each compute node every periodic task interval (60 sec by
> default)
> > >> updates resources state in DB.
> > >> *) On every boot request scheduler has to fetch information about
> all
> > >> compute nodes from DB.
> > >>
> > >> 2) Flexibility. Flexibility perishes due to problems with:
> > >> *) Addiing new complex resources (such as big lists of complex
> > objects
> > >> e.g. required by PCI Passthrough
> > >>
> >
> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
> > >> *) Using different sources of data in Scheduler for example from
> > >> cinder or ceilometer.
> > >> (as required by Volume Affinity Filter
> > >> https://review.openstack.org/#/c/29343/)
> > >>
> > >>
> > >> We found a simple way to mitigate this issues by avoiding of DB
> usage
> > >> for host state storage.
> > >>
> > >> A more detailed discussion of the problem state and one of a
> possible
> > >> solution can be found here:
> > >>
> > >>
> >
> https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit#
> > >>
> > >>
> > >> Best regards,
> > >> Boris Pavlovic
> > >>
> > >> Mirantis Inc.
> > >>
> > >> ___
> > >> OpenStack-dev mailing list
> > >> OpenStack-dev@lists.openstack.org
> > 
> > >>  > >
> > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >
> > >
> > >
> > > ___
> > > OpenStack-dev mailing list
> > > OpenStack-dev@lists.openstack.org
> > 
> > > http://lists.openstack.org/cgi-bin/mailman/li

Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-19 Thread Sandy Walsh


On 07/19/2013 05:01 PM, Boris Pavlovic wrote:
> Sandy,
> 
> Hm I don't know that algorithm. But our approach doesn't have
> exponential exchange.
> I don't think that in 10k nodes cloud we will have a problems with 150
> RPC call/sec. Even in 100k we will have only 1.5k RPC call/sec.
> More then (compute nodes update their state in DB through conductor
> which produce the same count of RPC calls). 
> 
> So I don't see any explosion here.

Sorry, I was commenting on Soren's suggestion from way back (essentially
listening on a separate exchange for each unique flavor ... so no
scheduler was needed at all). It was a great idea, but fell apart rather
quickly.

The existing approach the scheduler takes is expensive (asking the db
for state of all hosts) and polling the compute nodes might be do-able,
but you're still going to have latency problems waiting for the
responses (the states are invalid nearly immediately, especially if a
fill-first scheduling algorithm is used). We ran into this problem
before in an earlier scheduler implementation. The round-tripping kills.

We have a lot of really great information on Host state in the form of
notifications right now. I think having a service (or notification
driver) listening for these and keeping an the HostState incrementally
updated (and reported back to all of the schedulers via the fanout
queue) would be a better approach.

-S


> 
> Best regards,
> Boris Pavlovic
> 
> Mirantis Inc.  
> 
> 
> On Fri, Jul 19, 2013 at 11:47 PM, Sandy Walsh  > wrote:
> 
> 
> 
> On 07/19/2013 04:25 PM, Brian Schott wrote:
> > I think Soren suggested this way back in Cactus to use MQ for compute
> > node state rather than database and it was a good idea then.
> 
> The problem with that approach was the number of queues went exponential
> as soon as you went beyond simple flavors. Add Capabilities or other
> criteria and you get an explosion of exchanges to listen to.
> 
> 
> 
> > On Jul 19, 2013, at 10:52 AM, Boris Pavlovic  
> > >> wrote:
> >
> >> Hi all,
> >>
> >>
> >> In Mirantis Alexey Ovtchinnikov and me are working on nova scheduler
> >> improvements.
> >>
> >> As far as we can see the problem, now scheduler has two major issues:
> >>
> >> 1) Scalability. Factors that contribute to bad scalability are these:
> >> *) Each compute node every periodic task interval (60 sec by default)
> >> updates resources state in DB.
> >> *) On every boot request scheduler has to fetch information about all
> >> compute nodes from DB.
> >>
> >> 2) Flexibility. Flexibility perishes due to problems with:
> >> *) Addiing new complex resources (such as big lists of complex
> objects
> >> e.g. required by PCI Passthrough
> >>
> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
> >> *) Using different sources of data in Scheduler for example from
> >> cinder or ceilometer.
> >> (as required by Volume Affinity Filter
> >> https://review.openstack.org/#/c/29343/)
> >>
> >>
> >> We found a simple way to mitigate this issues by avoiding of DB usage
> >> for host state storage.
> >>
> >> A more detailed discussion of the problem state and one of a possible
> >> solution can be found here:
> >>
> >>
> 
> https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit#
> >>
> >>
> >> Best regards,
> >> Boris Pavlovic
> >>
> >> Mirantis Inc.
> >>
> >> ___
> >> OpenStack-dev mailing list
> >> OpenStack-dev@lists.openstack.org
> 
> >>  >
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> 
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> 
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> 
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] A simple way to improve nova scheduler

2013-07-19 Thread Brian Schott
I think Soren suggested this way back in Cactus to use MQ for compute node 
state rather than database and it was a good idea then. 

On Jul 19, 2013, at 10:52 AM, Boris Pavlovic  wrote:

> Hi all, 
> 
> 
> In Mirantis Alexey Ovtchinnikov and me are working on nova scheduler 
> improvements.
> 
> As far as we can see the problem, now scheduler has two major issues:
> 
> 1) Scalability. Factors that contribute to bad scalability are these:
>   *) Each compute node every periodic task interval (60 sec by default) 
> updates resources state in DB.
>   *) On every boot request scheduler has to fetch information about all 
> compute nodes from DB. 
> 
> 2) Flexibility. Flexibility perishes due to problems with:
>   *) Addiing new complex resources (such as big lists of complex objects e.g. 
> required by PCI Passthrough 
> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
>   *) Using different sources of data in Scheduler for example from cinder or 
> ceilometer.
> (as required by Volume Affinity Filter 
> https://review.openstack.org/#/c/29343/)
> 
> 
> We found a simple way to mitigate this issues by avoiding of DB usage for 
> host state storage.
> 
> A more detailed discussion of the problem state and one of a possible 
> solution can be found here:
> 
> https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit#
> 
> 
> Best regards,
> Boris Pavlovic
> 
> Mirantis Inc. 
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev