Re: [openstack-dev] Scheduler proposal

2015-10-16 Thread Julien Danjou
On Fri, Oct 16 2015, Joshua Harlow wrote:

> Another idea is to use numpy and start representing filters as linear
> equations, then use something like
> https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html#numpy.linalg.solve
> to solve linear equations given some data.
>
> Another idea, turn each filter into a constraint equation (which it sorta is
> anyway) and use a known fast constraint solver on that data...
>
> Lots of ideas here that can be possible, likely endless :)

Already pasted on Twitter, but just in case, Optaplanner:

  
http://community.redhat.com/blog/2014/11/smart-vm-scheduling-in-ovirt-clusters/

-- 
Julien Danjou
-- Free Software hacker
-- https://julien.danjou.info


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-16 Thread Clint Byrum
Excerpts from Ed Leafe's message of 2015-10-15 11:56:24 -0700:
> Wow, I seem to have unleashed a bunch of pent-up frustration in the 
> community! It's great to see everyone coming forward with their ideas and 
> insights for improving the way Nova (and, by extension, all of OpenStack) can 
> potentially scale.
> 
> I do have a few comments on the discussion:
> 
> 1) This isn't a proposal to simply add some sort of DLM to Nova as a magic 
> cure-all. The concerns about Nova's ability to scale have to do a lot more 
> with the overall internal communication design.
> 

In this, we agree.

> 2) I really liked the comment about "made-up numbers". It's so true: we are 
> all impressed by such examples of speed that we sometimes forget whether 
> speeding up X will improve the overall process to any significant degree. The 
> purpose of my original email back in July, and the question I asked at the 
> Nova midcycle, is if we could get some numbers that would be a target to 
> shoot for with any of these experiments. Sure, I could come up with a test 
> that shows a zillion transactions per second, but if that doesn't result in a 
> cloud being able to schedule more efficiently, what's the point?
>

Speed is only 1 dimension. Efficiency and simplicity are two others that
I think are harder to quantify, but are also equally important in any
component of OpenStack.

> 3) I like the idea of something like ZooKeeper, but my concern is how to 
> efficiently query the data. If, for example, we had records for 100K compute 
> nodes, would it be possible to do the equivalent of "SELECT * FROM resources 
> WHERE resource_type = 'compute' AND free_ram_mb >= 2048 AND …" - well, you 
> get the idea. Are complex data queries possible in ZK? I haven't been able to 
> find that information anywhere.
>

You don't do complex queries, because you have all of the data in RAM,
in an efficient in-RAM format. Even if each record is 50KB, we can do
100,000 of them in 5GB. That's a drop in the bucket.

> 4) It is true that even in a very large deployment, it is possible to keep 
> all the relevant data needed for scheduling in memory. My concern is how to 
> efficiently search that data, much like in the ZK scenario.
> 

There are a bunch of ways to do this. My favorite is to have filter
plugins in the scheduler define what they need to index, and then
build a B-tree for each filter as each record arrives in the main data
structure. When scheduling requests come in, they simply walk through
each B-tree and turn that into a set. Then read each piece of the set
out of the main structure and sort based on whichever you want (less
full for load balancing, most full for efficient stacking).

> 5) Concerns about Cassandra running with OpenJDK instead of the Oracle JVM 
> are troubling. I sent an email about this to one of the people I know at 
> DataStax, but so far have not received a response. And while it would be 
> great to have people contribute to OpenJDK to make it compatible, keep in 
> mind that that would be an ongoing commitment, not just a one-time effort.
> 

There are a few avenues to success with Cassandra but I don't think any
of them pass very close to OpenStack's current neighborhood.

> 6) I remember discussions back in the Austin-Bexar time frame about what 
> Thierry referred to as 'flavor-based schedulers', and they were immediately 
> discounted as not sophisticated enough to handle the sort of complex 
> scheduling requests that were expected. I'd be interested in finding out from 
> the big cloud providers what percentage of their requests would fall into 
> this simple structure, and what percent are more complicated than that. 
> Having hosts listening to queues that they know they can satisfy removes the 
> raciness from the process, although it would require some additional handling 
> for the situation where no host accepts the request. Still, it has the 
> advantage of being dead simple. Unfortunately, this would probably require a 
> bigger architectural change than integrating Cassandra into the Scheduler 
> would.
> 

No host accepting the request means your cloud is, more or less, full. If
you have flavors that aren't proper factors of smaller flavors, this
will indeed happen even when it isn't 100% utilized. If you have other
constraints that you allow your users to specify, then you are letting
them dictate how your hardware is utilized, which I think is a foolhardy
business decision. This is no different than any other manufacturing batch
size problem: sometimes parts of your process are under utilized, and
you have to make choices about rejecting certain workloads if they will
end up costing you more than you're willing to pay for the happy customer.

Note that the "efficient stacking" model I talked about can't really
work in the queue-based approach. If you want to fill up the most full
hosts before filling more, you need some awareness of what host is most
full and the compute nodes can't really 

Re: [openstack-dev] Scheduler proposal

2015-10-16 Thread Alec Hothan (ahothan)





On 10/15/15, 11:11 PM, "Clint Byrum"  wrote:

>Excerpts from Ed Leafe's message of 2015-10-15 11:56:24 -0700:
>> Wow, I seem to have unleashed a bunch of pent-up frustration in the 
>> community! It's great to see everyone coming forward with their ideas and 
>> insights for improving the way Nova (and, by extension, all of OpenStack) 
>> can potentially scale.
>> 
>> I do have a few comments on the discussion:
>> 
>> 1) This isn't a proposal to simply add some sort of DLM to Nova as a magic 
>> cure-all. The concerns about Nova's ability to scale have to do a lot more 
>> with the overall internal communication design.
>> 
>
>In this, we agree.
>
>> 2) I really liked the comment about "made-up numbers". It's so true: we are 
>> all impressed by such examples of speed that we sometimes forget whether 
>> speeding up X will improve the overall process to any significant degree. 
>> The purpose of my original email back in July, and the question I asked at 
>> the Nova midcycle, is if we could get some numbers that would be a target to 
>> shoot for with any of these experiments. Sure, I could come up with a test 
>> that shows a zillion transactions per second, but if that doesn't result in 
>> a cloud being able to schedule more efficiently, what's the point?
>>
>
>Speed is only 1 dimension. Efficiency and simplicity are two others that
>I think are harder to quantify, but are also equally important in any
>component of OpenStack.

Monty did suggest a goal with 100K nodes - which I think is moon expedition 
kind of goal given how far we are from it, but goal nevertheless ;-)
Openstack does not provide any number today outside of "massive scale" and this 
could be a problem for designers and implementors.
I think openstack is now mature enough to have to worry seriously about scale, 
and we have a very long way to go at that level.

I agree for the importance of simplicity and efficiency. But lets also add 
operational requirements such as ease of deployment and ease of 
troubleshooting. It is more difficult for Ops to deal with too many different 
technologies under the cover.
My concern is we may not have sufficient oversight (from the TC) for this kind 
of project to keep it within reasonable complexity for the given requirements, 
and this is hard to achieve when the requirements are very vague.
It looks like the main area where we might need faster nova scheduling would be 
those big deployments that use nova networking (thousands of nodes) - just 
because neutron deployments just may not have enough nodes to require such 
rate. And nobody seems to know what is the targeted rate (schedules per second 
here) and what is the exact problem to solve (by exact I mean what numbers do 
we have to say that the current nova scheduling is too slow or does not scale).


>
>> 3) I like the idea of something like ZooKeeper, but my concern is how to 
>> efficiently query the data. If, for example, we had records for 100K compute 
>> nodes, would it be possible to do the equivalent of "SELECT * FROM resources 
>> WHERE resource_type = 'compute' AND free_ram_mb >= 2048 AND …" - well, you 
>> get the idea. Are complex data queries possible in ZK? I haven't been able 
>> to find that information anywhere.
>>
>
>You don't do complex queries, because you have all of the data in RAM,
>in an efficient in-RAM format. Even if each record is 50KB, we can do
>100,000 of them in 5GB. That's a drop in the bucket.

Yes

>
>> 4) It is true that even in a very large deployment, it is possible to keep 
>> all the relevant data needed for scheduling in memory. My concern is how to 
>> efficiently search that data, much like in the ZK scenario.
>> 
>
>There are a bunch of ways to do this. My favorite is to have filter
>plugins in the scheduler define what they need to index, and then
>build a B-tree for each filter as each record arrives in the main data
>structure. When scheduling requests come in, they simply walk through
>each B-tree and turn that into a set. Then read each piece of the set
>out of the main structure and sort based on whichever you want (less
>full for load balancing, most full for efficient stacking).

There are clearly things you should be doing to scale properly. Python is not 
very speedy but can be made good enough at scale using the proper algorithm 
(such as the one you propose).
Furthermore, it can be made to run much faster - close to native speed - with 
proper design and the use of the proper libraries. So there is a lot that can 
be done to speed up things without having necessarily to increase complexity 
and scale out everything.


>
>> 5) Concerns about Cassandra running with OpenJDK instead of the Oracle JVM 
>> are troubling. I sent an email about this to one of the people I know at 
>> DataStax, but so far have not received a response. And while it would be 
>> great to have people contribute to OpenJDK to make it compatible, keep in 
>> mind that that would be an ongoing 

Re: [openstack-dev] Scheduler proposal

2015-10-16 Thread Joshua Harlow

Clint Byrum wrote:

Excerpts from Ed Leafe's message of 2015-10-15 11:56:24 -0700:

Wow, I seem to have unleashed a bunch of pent-up frustration in the
community! It's great to see everyone coming forward with their
ideas and insights for improving the way Nova (and, by extension,
all of OpenStack) can potentially scale.

I do have a few comments on the discussion:

1) This isn't a proposal to simply add some sort of DLM to Nova as
a magic cure-all. The concerns about Nova's ability to scale have
to do a lot more with the overall internal communication design.



In this, we agree.


2) I really liked the comment about "made-up numbers". It's so
true: we are all impressed by such examples of speed that we
sometimes forget whether speeding up X will improve the overall
process to any significant degree. The purpose of my original email
back in July, and the question I asked at the Nova midcycle, is if
we could get some numbers that would be a target to shoot for with
any of these experiments. Sure, I could come up with a test that
shows a zillion transactions per second, but if that doesn't result
in a cloud being able to schedule more efficiently, what's the
point?



Speed is only 1 dimension. Efficiency and simplicity are two others
that I think are harder to quantify, but are also equally important
in any component of OpenStack.


3) I like the idea of something like ZooKeeper, but my concern is
how to efficiently query the data. If, for example, we had records
for 100K compute nodes, would it be possible to do the equivalent
of "SELECT * FROM resources WHERE resource_type = 'compute' AND
free_ram_mb>= 2048 AND …" - well, you get the idea. Are complex
data queries possible in ZK? I haven't been able to find that
information anywhere.



You don't do complex queries, because you have all of the data in
RAM, in an efficient in-RAM format. Even if each record is 50KB, we
can do 100,000 of them in 5GB. That's a drop in the bucket.


4) It is true that even in a very large deployment, it is possible
to keep all the relevant data needed for scheduling in memory. My
concern is how to efficiently search that data, much like in the ZK
scenario.



There are a bunch of ways to do this. My favorite is to have filter
plugins in the scheduler define what they need to index, and then
build a B-tree for each filter as each record arrives in the main
data structure. When scheduling requests come in, they simply walk
through each B-tree and turn that into a set. Then read each piece of
the set out of the main structure and sort based on whichever you
want (less full for load balancing, most full for efficient
stacking).


Another idea is to use numpy and start representing filters as linear 
equations, then use something like 
https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html#numpy.linalg.solve 
to solve linear equations given some data.


Another idea, turn each filter into a constraint equation (which it 
sorta is anyway) and use a known fast constraint solver on that data...


Lots of ideas here that can be possible, likely endless :)




5) Concerns about Cassandra running with OpenJDK instead of the
Oracle JVM are troubling. I sent an email about this to one of the
people I know at DataStax, but so far have not received a response.
And while it would be great to have people contribute to OpenJDK to
make it compatible, keep in mind that that would be an ongoing
commitment, not just a one-time effort.



There are a few avenues to success with Cassandra but I don't think
any of them pass very close to OpenStack's current neighborhood.


6) I remember discussions back in the Austin-Bexar time frame about
what Thierry referred to as 'flavor-based schedulers', and they
were immediately discounted as not sophisticated enough to handle
the sort of complex scheduling requests that were expected. I'd be
interested in finding out from the big cloud providers what
percentage of their requests would fall into this simple structure,
and what percent are more complicated than that. Having hosts
listening to queues that they know they can satisfy removes the
raciness from the process, although it would require some
additional handling for the situation where no host accepts the
request. Still, it has the advantage of being dead simple.
Unfortunately, this would probably require a bigger architectural
change than integrating Cassandra into the Scheduler would.



No host accepting the request means your cloud is, more or less,
full. If you have flavors that aren't proper factors of smaller
flavors, this will indeed happen even when it isn't 100% utilized. If
you have other constraints that you allow your users to specify, then
you are letting them dictate how your hardware is utilized, which I
think is a foolhardy business decision. This is no different than any
other manufacturing batch size problem: sometimes parts of your
process are under utilized, and you have to make choices about
rejecting 

Re: [openstack-dev] Scheduler proposal

2015-10-15 Thread Joshua Harlow

Ed Leafe wrote:

Wow, I seem to have unleashed a bunch of pent-up frustration in the
community! It's great to see everyone coming forward with their ideas
and insights for improving the way Nova (and, by extension, all of
OpenStack) can potentially scale.

I do have a few comments on the discussion:

1) This isn't a proposal to simply add some sort of DLM to Nova as a
magic cure-all. The concerns about Nova's ability to scale have to do
a lot more with the overall internal communication design.

2) I really liked the comment about "made-up numbers". It's so true:
we are all impressed by such examples of speed that we sometimes
forget whether speeding up X will improve the overall process to any
significant degree. The purpose of my original email back in July,
and the question I asked at the Nova midcycle, is if we could get
some numbers that would be a target to shoot for with any of these
experiments. Sure, I could come up with a test that shows a zillion
transactions per second, but if that doesn't result in a cloud being
able to schedule more efficiently, what's the point?

3) I like the idea of something like ZooKeeper, but my concern is how
to efficiently query the data. If, for example, we had records for
100K compute nodes, would it be possible to do the equivalent of
"SELECT * FROM resources WHERE resource_type = 'compute' AND
free_ram_mb>= 2048 AND …" - well, you get the idea. Are complex data
queries possible in ZK? I haven't been able to find that information
anywhere.


The idea is that u wouldn't do these queries against any remote source 
in the first place. Instead a scheduler would get notified (via a 
concept like 
http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#sc_zkDataMode_watches) 
when a hypervisor updates its data in zookeeper (or other equivalent 
system); when that notification happens the scheduler then reads the 
data then updates some *local* data source with that information (this 
could be a in-memory dict, or a local sqlite, or something else better 
optimized for searching fast) and then from that point on that local 
source is used to do queries on. This way a hypervisor (compute-node) is 
performing *nearly* the equivalent of a push notification (like on your 
phone) to schedulers.




4) It is true that even in a very large deployment, it is possible to
keep all the relevant data needed for scheduling in memory. My
concern is how to efficiently search that data, much like in the ZK
scenario.


See above.



5) Concerns about Cassandra running with OpenJDK instead of the
Oracle JVM are troubling. I sent an email about this to one of the
people I know at DataStax, but so far have not received a response.
And while it would be great to have people contribute to OpenJDK to
make it compatible, keep in mind that that would be an ongoing
commitment, not just a one-time effort.

6) I remember discussions back in the Austin-Bexar time frame about
what Thierry referred to as 'flavor-based schedulers', and they were
immediately discounted as not sophisticated enough to handle the sort
of complex scheduling requests that were expected. I'd be interested
in finding out from the big cloud providers what percentage of their
requests would fall into this simple structure, and what percent are
more complicated than that. Having hosts listening to queues that
they know they can satisfy removes the raciness from the process,
although it would require some additional handling for the situation
where no host accepts the request. Still, it has the advantage of
being dead simple. Unfortunately, this would probably require a
bigger architectural change than integrating Cassandra into the
Scheduler would.


Another discussion that also should get talked about, but is again much 
larger in scope: https://review.openstack.org/#/c/210549/ (still WIP but 
the idea/problem/issue hopefully is clear).




I hope that those of us who will be at the Tokyo Summit and are
interested in these ideas can get together for an informal
discussion, and come up with some ideas for grand experiments and
reality checks. ;-)

BTW, I started playing around with some ideas, and thought that if
anyone wanted to also try Cassandra, I'd write up a quick how-to for
setting up a small cluster:
http://blog.leafe.com/small-scale-cassandra/. Using docker images
makes it a breeze!


-- Ed Leafe






__



OpenStack Development Mailing List (not for usage questions)

Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-15 Thread Ed Leafe
Wow, I seem to have unleashed a bunch of pent-up frustration in the community! 
It's great to see everyone coming forward with their ideas and insights for 
improving the way Nova (and, by extension, all of OpenStack) can potentially 
scale.

I do have a few comments on the discussion:

1) This isn't a proposal to simply add some sort of DLM to Nova as a magic 
cure-all. The concerns about Nova's ability to scale have to do a lot more with 
the overall internal communication design.

2) I really liked the comment about "made-up numbers". It's so true: we are all 
impressed by such examples of speed that we sometimes forget whether speeding 
up X will improve the overall process to any significant degree. The purpose of 
my original email back in July, and the question I asked at the Nova midcycle, 
is if we could get some numbers that would be a target to shoot for with any of 
these experiments. Sure, I could come up with a test that shows a zillion 
transactions per second, but if that doesn't result in a cloud being able to 
schedule more efficiently, what's the point?

3) I like the idea of something like ZooKeeper, but my concern is how to 
efficiently query the data. If, for example, we had records for 100K compute 
nodes, would it be possible to do the equivalent of "SELECT * FROM resources 
WHERE resource_type = 'compute' AND free_ram_mb >= 2048 AND …" - well, you get 
the idea. Are complex data queries possible in ZK? I haven't been able to find 
that information anywhere.

4) It is true that even in a very large deployment, it is possible to keep all 
the relevant data needed for scheduling in memory. My concern is how to 
efficiently search that data, much like in the ZK scenario.

5) Concerns about Cassandra running with OpenJDK instead of the Oracle JVM are 
troubling. I sent an email about this to one of the people I know at DataStax, 
but so far have not received a response. And while it would be great to have 
people contribute to OpenJDK to make it compatible, keep in mind that that 
would be an ongoing commitment, not just a one-time effort.

6) I remember discussions back in the Austin-Bexar time frame about what 
Thierry referred to as 'flavor-based schedulers', and they were immediately 
discounted as not sophisticated enough to handle the sort of complex scheduling 
requests that were expected. I'd be interested in finding out from the big 
cloud providers what percentage of their requests would fall into this simple 
structure, and what percent are more complicated than that. Having hosts 
listening to queues that they know they can satisfy removes the raciness from 
the process, although it would require some additional handling for the 
situation where no host accepts the request. Still, it has the advantage of 
being dead simple. Unfortunately, this would probably require a bigger 
architectural change than integrating Cassandra into the Scheduler would.

I hope that those of us who will be at the Tokyo Summit and are interested in 
these ideas can get together for an informal discussion, and come up with some 
ideas for grand experiments and reality checks. ;-)

BTW, I started playing around with some ideas, and thought that if anyone 
wanted to also try Cassandra, I'd write up a quick how-to for setting up a 
small cluster: http://blog.leafe.com/small-scale-cassandra/. Using docker 
images makes it a breeze!


-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP using GPGMail
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-14 Thread Dulko, Michal
On Tue, 2015-10-13 at 08:47 -0700, Joshua Harlow wrote:
> Well great!
> 
> When is that going to be accessible :-P
> 
> Dulko, Michal wrote:
> > On Mon, 2015-10-12 at 10:58 -0700, Joshua Harlow wrote:
> >> Just a related thought/question. It really seems we (as a community)
> >> need some kind of scale testing ground. Internally at yahoo we were/are
> >> going to use a 200 hypervisor cluster for some of this and then expand
> >> that into 200 * X by using nested virtualization and/or fake drivers and
> >> such. But this is a 'lab' that not everyone can have, and therefore
> >> isn't suited toward community work IMHO. Has there been any thought on
> >> such a 'lab' that is directly in the community, perhaps trystack.org can
> >> be this? (users get free VMs, but then we can tell them this area is a
> >> lab, so don't expect things to always work, free isn't free after all...)
> >>
> >> With such a lab, there could be these kinds of experiments, graphs,
> >> tweaks and such...
> >
> > https://www.mirantis.com/blog/intel-rackspace-want-cloud/
> >
> > "The plan is to build out an OpenStack developer cloud that consists of
> > two 1,000 node clusters available for use by anyone in the OpenStack
> > community for scaling, performance, and code testing. Rackspace plans to
> > have the cloud available within the next six months."
> >
> > Stuff you've described is actually being worked on for a few months. :)

Judging from 6-month ETA and the fact that the work started in August it
seems that the answer is - beginning of 2016.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-14 Thread Thomas Goirand
On 10/12/2015 07:10 PM, Monty Taylor wrote:
> On 10/12/2015 12:43 PM, Clint Byrum wrote:
>> Excerpts from Thomas Goirand's message of 2015-10-12 05:57:26 -0700:
>>> On 10/11/2015 02:53 AM, Davanum Srinivas wrote:
 Thomas,

 i am curious as well. AFAIK, cassandra works well with OpenJDK. Can you
 please elaborate what you concerns are for #1?

 Thanks,
 Dims
>>>
>>> s/works well/works/
>>>
>>> Upstream doesn't test against OpenJDK, and they close bugs without
>>> fixing them when it only affects OpenJDK and it isn't grave. I know this
>>> from one of the upstream from Cassandra, who is also a Debian developer.
>>> Because of this state of things, he gave up on packaging Cassandra in
>>> Debian (and for other reasons too, like not having enough time to work
>>> on the packaging).
>>>
>>> I trust what this Debian developer told me. If I remember correctly,
>>> it's Eric Evans  (ie, the author of the ITP at
>>> https://bugs.debian.org/585905) that I'm talking about.
>>>
>>
>> Indeed, I once took a crack at packaging it for Debian/Ubuntu too.
>> There's a reason 'apt-cache search cassandra' returns 0 results on Debian
>> and Ubuntu.
> 
> There is a different reason too - which is that (at least at one point
> in the past) upstream expressed frustration with the idea of distro
> packages of Cassandra because it led to people coming to them with
> complaints about the software which had been fixed in newer versions but
> which, because of distro support policies, were not present in the
> user's software version. (I can sympathize)

This is free software. We don't need to ask for permission from upstream
first.

Thomas Goirand (zigo)


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Dulko, Michal
On Mon, 2015-10-12 at 10:58 -0700, Joshua Harlow wrote:
> Just a related thought/question. It really seems we (as a community) 
> need some kind of scale testing ground. Internally at yahoo we were/are 
> going to use a 200 hypervisor cluster for some of this and then expand 
> that into 200 * X by using nested virtualization and/or fake drivers and 
> such. But this is a 'lab' that not everyone can have, and therefore 
> isn't suited toward community work IMHO. Has there been any thought on 
> such a 'lab' that is directly in the community, perhaps trystack.org can 
> be this? (users get free VMs, but then we can tell them this area is a 
> lab, so don't expect things to always work, free isn't free after all...)
> 
> With such a lab, there could be these kinds of experiments, graphs, 
> tweaks and such...

https://www.mirantis.com/blog/intel-rackspace-want-cloud/

"The plan is to build out an OpenStack developer cloud that consists of
two 1,000 node clusters available for use by anyone in the OpenStack
community for scaling, performance, and code testing. Rackspace plans to
have the cloud available within the next six months."

Stuff you've described is actually being worked on for a few months. :)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Dulko, Michal
On Mon, 2015-10-12 at 10:13 -0700, Clint Byrum wrote:
> Zookeeper sits in a very different space from Cassandra. I have had good
> success with it on OpenJDK as well.
> 
> That said, we need to maybe go through some feature/risk matrices and
> compare to etcd and Consul (this might be good to do as part of filling
> out the DLM spec). The jvm issues goes away with both of those, but then
> we get to deal Go issues.
> 
> Also, ZK has one other advantage over those: It is already in Debian and
> Ubuntu, making access for developers much easier.

What about RHEL/CentOS? Maybe I'm mistaken, but I think these two
doesn't have it packaged.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Jeremy Stanley
On 2015-10-12 20:49:44 -0700 (-0700), Joshua Harlow wrote:
> Does the openstack foundation have access to a scaling area that
> can be used by the community for this kind of experimental work?

The OpenStack Foundation has a staff of fewer than 20 full-time
employees, with a primary focus on event planning and preserving the
community's trademarks. If instead you mean the member companies who
make up the OpenStack Foundation, then I agree with the other reply
on the thread that it sounds like the effort already underway at
Intel and Rackspace.

> It seems like infra or others should be able make that possible?
[...]

The Infrastructure team is in the process of standing up a
community-managed deployment of OpenStack, but it's not even within
an order of magnitude of being 1k host scale (and at that, it's
still a multi-cycle plan just to reach viability).
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Alec Hothan (ahothan)





On 10/12/15, 12:05 PM, "Monty Taylor"  wrote:

>On 10/12/2015 02:45 PM, Joshua Harlow wrote:
>> Alec Hothan (ahothan) wrote:
>>>
>>>
>>>
>>>
>
>I want to do 100k hypervisors. No, that's not hyperbole.
>
>Also, I do not think that ZK/consul/etcd are very costly for small 
>deployments. Given the number of simple dev-oriented projects that start 
>with "so install ZK/consul/etcd" I think they've all proven their 
>ability to scale _down_ - and I'm also pretty sure all of them have 
>installations that clear 100k nodes.
>
>This:
>
>to produce the ubiquitous Open Source Cloud Computing platform that will 
>meet the needs of public and private clouds regardless of size, by being 
>simple to implement and massively scalable.
>
>is what we're doing.
>
>Our mission is NOT "produce a mid-range cloud that is too complex for 
>small deployments and tops out before you get to big ones"
>
>I don't think "handle massive clouds" has ever NOT been on the list of 
>stated goals. (that mission statement has not changed since we started 
>the project - although I agree with Joe, it's in need of an update- 
>there is no mention of users)

Then it'd be great that there be an official statement from the TC about the 
scale objectives and if possible put some numbers, "massive cloud" is ambiguous 
for folks who actually have to make sure they scale to specs.
So should mention "OpenStack should scale from 1 node to 100K nodes" for 
example. As long as everybody is fully aware about how far we are today from 
that lofty goal.
This clearly will have an impact on how we need to design services and how we 
should change the way we test for them. It will be tricky to get a 1000 node 
lab up and running just for openstack developers, it is just not practical at 
all. The only practical way will be to do proper unit testing at scale (e.g. 
emulate a 10K node cloud for unit testing any given service).


>
>BTW - Infra runs against currently runs against clouds rate-limited at 
>roughly 10 api calls / second. That's just one tenant - but it's a 
>perfectly managable rate. Now, if the cloud could continue to add nodes 
>and users without that rate degrading I think we'd be in really good shape.

I think that rate limit only applies to REST APIs, I don't think there is any 
rate limit for oslo messaging.
Even only 10 API calls per second per tenant can be a challenge with a large 
number of tenants. I don't think there is any provision today for example to 
ensure fairness across tenants.




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Joshua Harlow

Jeremy Stanley wrote:

On 2015-10-12 20:49:44 -0700 (-0700), Joshua Harlow wrote:

Does the openstack foundation have access to a scaling area that
can be used by the community for this kind of experimental work?


The OpenStack Foundation has a staff of fewer than 20 full-time
employees, with a primary focus on event planning and preserving the
community's trademarks. If instead you mean the member companies who
make up the OpenStack Foundation, then I agree with the other reply
on the thread that it sounds like the effort already underway at
Intel and Rackspace.


Sure, that its *current* primary focus, but this could be an addition.

I've also been thinking that long-term cross project changes should 
really also be guided by the foundation as well. Something akin to 
keeping long-term changes (ones that require years of work, such as 
cross project quota, or...) on track even when member companies come and 
go (because IMHO expecting otherwise leaves things halfway done, or not 
done at all).





It seems like infra or others should be able make that possible?

[...]

The Infrastructure team is in the process of standing up a
community-managed deployment of OpenStack, but it's not even within
an order of magnitude of being 1k host scale (and at that, it's
still a multi-cycle plan just to reach viability).


Well go big or go home :-P

-Josh

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Joshua Harlow

Well great!

When is that going to be accessible :-P

Dulko, Michal wrote:

On Mon, 2015-10-12 at 10:58 -0700, Joshua Harlow wrote:

Just a related thought/question. It really seems we (as a community)
need some kind of scale testing ground. Internally at yahoo we were/are
going to use a 200 hypervisor cluster for some of this and then expand
that into 200 * X by using nested virtualization and/or fake drivers and
such. But this is a 'lab' that not everyone can have, and therefore
isn't suited toward community work IMHO. Has there been any thought on
such a 'lab' that is directly in the community, perhaps trystack.org can
be this? (users get free VMs, but then we can tell them this area is a
lab, so don't expect things to always work, free isn't free after all...)

With such a lab, there could be these kinds of experiments, graphs,
tweaks and such...


https://www.mirantis.com/blog/intel-rackspace-want-cloud/

"The plan is to build out an OpenStack developer cloud that consists of
two 1,000 node clusters available for use by anyone in the OpenStack
community for scaling, performance, and code testing. Rackspace plans to
have the cloud available within the next six months."

Stuff you've described is actually being worked on for a few months. :)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Ian Wells
On 12 October 2015 at 21:18, Clint Byrum  wrote:

> We _would_ keep a local cache of the information in the schedulers. The
> centralized copy of it is to free the schedulers from the complexity of
> having to keep track of it as state, rather than as a cache. We also don't
> have to provide a way for on-demand stat fetching to seed scheduler 0.
>

I'm not sure that actually changes.  On restart of a scheduler, it wouldn't
have enough knowledge to schedule, but the other schedulers are not and can
service requests while it waits for data.  Using ZK, that takes fewer
seconds because it can get a braindump, but during that window in either
case the system works at n-1/n capacity assuming queries are only done in
memory.

Also, you were seeming to tout the ZK option would take less memory, but it
seems it would take more.  You can't schedule without a relatively complete
set of information or some relatively intricate query language, which I
didn't think ZK was up to (but I'm open to correction there, certainly).
That implies that when you notify a scheduler of a change to the data
model, it's going to grab the fresh data and keep it locally.


> > Also, the notification path here is that the compute host notifies ZK and
> > ZK notifies many schedulers, assuming they're all capable of handling all
> > queries.  That is in fact N * (M+1) messages, which is slightly more than
> > if there's no central node, as it happens.  There are fewer *channels*,
> but
> > more messages.  (I feel like I'm overlooking something here, but I can't
> > pick out the flaw...)  Yes, RMQ will suck at this - but then let's talk
> > about better messaging rather than another DB type.
> >
>
> You're calling transactions messages, and that's not really fair to
> messaging or transactions. :)
>

I was actually talking about the number of messages crossing the network.
Your point is that the transaction with ZK is heavier weight than the
update processing at the schedulers, I think.  But then removing ZK as a
nexus removes that transaction, so both the number of messages and the
number of transactions goes down.

However, it's important to note that in
> this situation, compute nodes do not have to send anything anywhere if
> nothing has changed, which is very likely the case for "full" compute
> nodes, and certainly will save many many redundant messages.


Now that's a fair comment, certainly, and would drastically reduce the
number of messages in the system if we can keep the nodes from updating
just because their free memory has changed by a couple of pages.


> Forgive me
> if nova already makes this optimization somehow, it didn't seem to when
> I was tinkering a year ago.
>

Not as far as I know, it doesn't.

There is also the complexity of designing a scheduler which is fault
> tolerant and scales economically. What we have now will overtax the
> message bus and the database as the number of compute nodes increases.
> We want to get O(1) complexity out of that, but we're getting O(N)
> right now.
>

O(N) will work providing O is small. ;)

I think our cost currently lies in doing 1 MySQL DB update per node per
minute, and one really quite mad query per schedule.  I agree that ZK would
be less costly for that in both respects, which is really more about
lowering O than N.  I'm wondering if we can do better still, that's all,
but we both agree that this approach would work.
-- 
Ian.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Joshua Harlow

Clint Byrum wrote:

Excerpts from Ian Wells's message of 2015-10-13 09:24:42 -0700:

On 12 October 2015 at 21:18, Clint Byrum  wrote:


We _would_ keep a local cache of the information in the schedulers. The
centralized copy of it is to free the schedulers from the complexity of
having to keep track of it as state, rather than as a cache. We also don't
have to provide a way for on-demand stat fetching to seed scheduler 0.


I'm not sure that actually changes.  On restart of a scheduler, it wouldn't
have enough knowledge to schedule, but the other schedulers are not and can
service requests while it waits for data.  Using ZK, that takes fewer
seconds because it can get a braindump, but during that window in either
case the system works at n-1/n capacity assuming queries are only done in
memory.



Yeah, I'd put this as a 3 on the 1-10 scale of optimizations. Not a
reason to do it, but an assessment that it improves the efficiency of
starting new schedulers. It also has the benefit that if you do choose
to just run 1 scheduler, you can just start a new one and it will walk
the tree and start scheduling immediately thereafter.


Also, you were seeming to tout the ZK option would take less memory, but it
seems it would take more.  You can't schedule without a relatively complete
set of information or some relatively intricate query language, which I
didn't think ZK was up to (but I'm open to correction there, certainly).
That implies that when you notify a scheduler of a change to the data
model, it's going to grab the fresh data and keep it locally.



If I did that, I was being unclear and I'm sorry for that. I do think
the cache of potential scheduling targets and stats should fit in RAM
easily for even 100,000 nodes, including indexes for fast lookups.
The intermediary is entirely to alleviate the need for complicated sync
protocols to be implemented in the scheduler and compute agent. RAM is
cheap, time is not.


+1

Servers come with many tens/hundreds gigabytes of memory now-a-days, and 
if we locally cache with various levels of indexing (perhaps even using 
some other db-like library to help here) then I'd hope we can fit as 
many nodes as we desire.





Also, the notification path here is that the compute host notifies ZK and
ZK notifies many schedulers, assuming they're all capable of handling all
queries.  That is in fact N * (M+1) messages, which is slightly more than
if there's no central node, as it happens.  There are fewer *channels*,

but

more messages.  (I feel like I'm overlooking something here, but I can't
pick out the flaw...)  Yes, RMQ will suck at this - but then let's talk
about better messaging rather than another DB type.


You're calling transactions messages, and that's not really fair to
messaging or transactions. :)


I was actually talking about the number of messages crossing the network.
Your point is that the transaction with ZK is heavier weight than the
update processing at the schedulers, I think.  But then removing ZK as a
nexus removes that transaction, so both the number of messages and the
number of transactions goes down.



Point taken and agreed.


However, it's important to note that in

this situation, compute nodes do not have to send anything anywhere if
nothing has changed, which is very likely the case for "full" compute
nodes, and certainly will save many many redundant messages.


Now that's a fair comment, certainly, and would drastically reduce the
number of messages in the system if we can keep the nodes from updating
just because their free memory has changed by a couple of pages.



Indeed, an optimization like this is actually orthogonal to the management
of the corpus of state from all hosts. Hosts should in fact be able
to optimize for this already. Of course, then you lose the heartbeat..
which might be more valuable than the savings in communication load.


Forgive me
if nova already makes this optimization somehow, it didn't seem to when
I was tinkering a year ago.


Not as far as I know, it doesn't.

There is also the complexity of designing a scheduler which is fault

tolerant and scales economically. What we have now will overtax the
message bus and the database as the number of compute nodes increases.
We want to get O(1) complexity out of that, but we're getting O(N)
right now.


O(N) will work providing O is small. ;)

I think our cost currently lies in doing 1 MySQL DB update per node per
minute, and one really quite mad query per schedule.  I agree that ZK would
be less costly for that in both respects, which is really more about
lowering O than N.  I'm wondering if we can do better still, that's all,
but we both agree that this approach would work.


Right, I think it is worth an experiment if for no other reason than
MySQL can't really go much faster for this. We could move the mad query
out into RAM, but then we get the problem of how to keep a useful dataset
in RAM and we're back to syncing or polling the 

Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Jeremy Stanley
On 2015-10-13 09:15:02 -0700 (-0700), Clint Byrum wrote:
> Excerpts from Jeremy Stanley's message of 2015-10-13 06:13:32 -0700:
[...]
> > it's not even within an order of magnitude of being 1k host
> > scale (and at that, it's still a multi-cycle plan just to reach
> > viability).
> 
> Infra-cloud currently has about 200 total real servers donated by
> HP.
[...]

I stand corrected--it _is_ within an order of magnitude of being 1k
host scale!

Anyway, my point was that to build and manage something similar for
scalability experiments would require a lot of extra hardware,
people and time to implement and manage.
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Jeremy Stanley
On 2015-10-13 10:17:26 -0700 (-0700), Joshua Harlow wrote:
[...]
> Interesting, doesn't the foundation have money? I was under the
> assumption it does (but I'm not a finance person); seeing that the
> membership fee to become a member afaik is not cheap, and there
> seems to be quite a-lot of members
> (https://www.openstack.org/foundation/companies/) one could
> speculate that resources (compute, lab clouds, people to help
> manager all of these) shouldn't really be a problem...

Yep, there is some money. I have no idea whether there is surplus
sufficient to sustain facilities and staff for a 1k-node service
provider with no customer revenue, but I have my doubts. On the
other hand it might be easier for data center space, hardware and
humans to be donated from 0.1% of 5 different 200k-node service
providers since they already have experience doing that at some
economy of scale (where the OpenStack Foundation does not).

> Anyway, perhaps this is for another conversation...

Indeed, and one better had with the OpenStack Foundation Board of
Directors rather than the developer community/Infra team.
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Clint Byrum
Excerpts from Jeremy Stanley's message of 2015-10-13 06:13:32 -0700:
> On 2015-10-12 20:49:44 -0700 (-0700), Joshua Harlow wrote:
> > Does the openstack foundation have access to a scaling area that
> > can be used by the community for this kind of experimental work?
> 
> The OpenStack Foundation has a staff of fewer than 20 full-time
> employees, with a primary focus on event planning and preserving the
> community's trademarks. If instead you mean the member companies who
> make up the OpenStack Foundation, then I agree with the other reply
> on the thread that it sounds like the effort already underway at
> Intel and Rackspace.
> 
> > It seems like infra or others should be able make that possible?
> [...]
> 
> The Infrastructure team is in the process of standing up a
> community-managed deployment of OpenStack, but it's not even within
> an order of magnitude of being 1k host scale (and at that, it's
> still a multi-cycle plan just to reach viability).

Infra-cloud currently has about 200 total real servers donated by HP.
The primary focus is on adding nodes for nodepool, so that we can
keep ahead of the milestone surges and general widening of the scope
of OpenStack. Doing it the same way infra does their other apps also
means that infra can fix this cloud when it isn't suited to their needs,
instead of having to work around public cloud quirks.

However, it was always a secondary goal of infra-cloud to provide a cloud
that is 100% visible to the entire community, including operators, so
that the community can collaborate on improving said cloud which should
drive quality.

We're currently going very slow, mostly because there are basically
3 people working about 5-30 percent of their time on it. As the needs
listed above grow, I imagine infra-cloud will rise in our priorities.

If a member company were to donate _more_ nodes in a single place so that
we could push the bounds of a single region/az/cell, that would be great,
but I don't think those could be capitalized on without a donation of
more staff to the infra team as well.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Clint Byrum
Excerpts from Ian Wells's message of 2015-10-13 09:24:42 -0700:
> On 12 October 2015 at 21:18, Clint Byrum  wrote:
> 
> > We _would_ keep a local cache of the information in the schedulers. The
> > centralized copy of it is to free the schedulers from the complexity of
> > having to keep track of it as state, rather than as a cache. We also don't
> > have to provide a way for on-demand stat fetching to seed scheduler 0.
> >
> 
> I'm not sure that actually changes.  On restart of a scheduler, it wouldn't
> have enough knowledge to schedule, but the other schedulers are not and can
> service requests while it waits for data.  Using ZK, that takes fewer
> seconds because it can get a braindump, but during that window in either
> case the system works at n-1/n capacity assuming queries are only done in
> memory.
> 

Yeah, I'd put this as a 3 on the 1-10 scale of optimizations. Not a
reason to do it, but an assessment that it improves the efficiency of
starting new schedulers. It also has the benefit that if you do choose
to just run 1 scheduler, you can just start a new one and it will walk
the tree and start scheduling immediately thereafter.

> Also, you were seeming to tout the ZK option would take less memory, but it
> seems it would take more.  You can't schedule without a relatively complete
> set of information or some relatively intricate query language, which I
> didn't think ZK was up to (but I'm open to correction there, certainly).
> That implies that when you notify a scheduler of a change to the data
> model, it's going to grab the fresh data and keep it locally.
> 

If I did that, I was being unclear and I'm sorry for that. I do think
the cache of potential scheduling targets and stats should fit in RAM
easily for even 100,000 nodes, including indexes for fast lookups.
The intermediary is entirely to alleviate the need for complicated sync
protocols to be implemented in the scheduler and compute agent. RAM is
cheap, time is not.

> > > Also, the notification path here is that the compute host notifies ZK and
> > > ZK notifies many schedulers, assuming they're all capable of handling all
> > > queries.  That is in fact N * (M+1) messages, which is slightly more than
> > > if there's no central node, as it happens.  There are fewer *channels*,
> > but
> > > more messages.  (I feel like I'm overlooking something here, but I can't
> > > pick out the flaw...)  Yes, RMQ will suck at this - but then let's talk
> > > about better messaging rather than another DB type.
> > >
> >
> > You're calling transactions messages, and that's not really fair to
> > messaging or transactions. :)
> >
> 
> I was actually talking about the number of messages crossing the network.
> Your point is that the transaction with ZK is heavier weight than the
> update processing at the schedulers, I think.  But then removing ZK as a
> nexus removes that transaction, so both the number of messages and the
> number of transactions goes down.
> 

Point taken and agreed.

> However, it's important to note that in
> > this situation, compute nodes do not have to send anything anywhere if
> > nothing has changed, which is very likely the case for "full" compute
> > nodes, and certainly will save many many redundant messages.
> 
> 
> Now that's a fair comment, certainly, and would drastically reduce the
> number of messages in the system if we can keep the nodes from updating
> just because their free memory has changed by a couple of pages.
> 

Indeed, an optimization like this is actually orthogonal to the management
of the corpus of state from all hosts. Hosts should in fact be able
to optimize for this already. Of course, then you lose the heartbeat..
which might be more valuable than the savings in communication load.

> > Forgive me
> > if nova already makes this optimization somehow, it didn't seem to when
> > I was tinkering a year ago.
> >
> 
> Not as far as I know, it doesn't.
> 
> There is also the complexity of designing a scheduler which is fault
> > tolerant and scales economically. What we have now will overtax the
> > message bus and the database as the number of compute nodes increases.
> > We want to get O(1) complexity out of that, but we're getting O(N)
> > right now.
> >
> 
> O(N) will work providing O is small. ;)
> 
> I think our cost currently lies in doing 1 MySQL DB update per node per
> minute, and one really quite mad query per schedule.  I agree that ZK would
> be less costly for that in both respects, which is really more about
> lowering O than N.  I'm wondering if we can do better still, that's all,
> but we both agree that this approach would work.

Right, I think it is worth an experiment if for no other reason than
MySQL can't really go much faster for this. We could move the mad query
out into RAM, but then we get the problem of how to keep a useful dataset
in RAM and we're back to syncing or polling the database hard.


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Joshua Harlow

Clint Byrum wrote:

Excerpts from Jeremy Stanley's message of 2015-10-13 06:13:32 -0700:

On 2015-10-12 20:49:44 -0700 (-0700), Joshua Harlow wrote:

Does the openstack foundation have access to a scaling area that
can be used by the community for this kind of experimental work?

The OpenStack Foundation has a staff of fewer than 20 full-time
employees, with a primary focus on event planning and preserving the
community's trademarks. If instead you mean the member companies who
make up the OpenStack Foundation, then I agree with the other reply
on the thread that it sounds like the effort already underway at
Intel and Rackspace.


It seems like infra or others should be able make that possible?

[...]

The Infrastructure team is in the process of standing up a
community-managed deployment of OpenStack, but it's not even within
an order of magnitude of being 1k host scale (and at that, it's
still a multi-cycle plan just to reach viability).


Infra-cloud currently has about 200 total real servers donated by HP.
The primary focus is on adding nodes for nodepool, so that we can
keep ahead of the milestone surges and general widening of the scope
of OpenStack. Doing it the same way infra does their other apps also
means that infra can fix this cloud when it isn't suited to their needs,
instead of having to work around public cloud quirks.

However, it was always a secondary goal of infra-cloud to provide a cloud
that is 100% visible to the entire community, including operators, so
that the community can collaborate on improving said cloud which should
drive quality.

We're currently going very slow, mostly because there are basically
3 people working about 5-30 percent of their time on it. As the needs
listed above grow, I imagine infra-cloud will rise in our priorities.

If a member company were to donate _more_ nodes in a single place so that
we could push the bounds of a single region/az/cell, that would be great,
but I don't think those could be capitalized on without a donation of
more staff to the infra team as well.


Interesting, doesn't the foundation have money? I was under the 
assumption it does (but I'm not a finance person); seeing that the 
membership fee to become a member afaik is not cheap, and there seems to 
be quite a-lot of members 
(https://www.openstack.org/foundation/companies/) one could speculate 
that resources (compute, lab clouds, people to help manager all of 
these) shouldn't really be a problem...


Anyway, perhaps this is for another conversation...

-Josh



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-13 Thread Clint Byrum
Excerpts from Dulko, Michal's message of 2015-10-13 03:49:44 -0700:
> On Mon, 2015-10-12 at 10:13 -0700, Clint Byrum wrote:
> > Zookeeper sits in a very different space from Cassandra. I have had good
> > success with it on OpenJDK as well.
> > 
> > That said, we need to maybe go through some feature/risk matrices and
> > compare to etcd and Consul (this might be good to do as part of filling
> > out the DLM spec). The jvm issues goes away with both of those, but then
> > we get to deal Go issues.
> > 
> > Also, ZK has one other advantage over those: It is already in Debian and
> > Ubuntu, making access for developers much easier.
> 
> What about RHEL/CentOS? Maybe I'm mistaken, but I think these two
> doesn't have it packaged.

I don't know about RHEL/CentOS, but Fedora packages Zookeeper:

https://apps.fedoraproject.org/packages/zookeeper/sources

Seems like that can be spun into RHEL/CentOS relatively easily, perhaps
through EPEL?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Thierry Carrez
Adam Lawson wrote:
> I have a quick question: how is Amazon doing this? When choosing a next
> path forward that reliably scales, would be interesting to know how this
> is already being done.

Well, those who know probably would be sued if they told.

Since they have a limited set of instance types and very limited
placement options, my bet would be that they do flavor-based scheduling
("let compute nodes grab node reservation requests directly
out of flavor based queues based on their own current observation of
their ability to service it" in Clint's own words).

This is the most efficient way to scale: you no longer rely on a
specific scheduler trying to keep an up-to-date view of your compute
nodes resource availability. As long as you are ready to abandon fancy
placement features, you can get simple, reliable and scalable
(non-)scheduling.

Personally as we explore the options we have in that space, I'd like to
consider options that still enable us to plug such a no-scheduler
solution without too much trouble. Just for those of us who are ready to
make that trade-off :)

-- 
Thierry Carrez (ttx)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Thierry Carrez
Clint Byrum wrote:
> Excerpts from Joshua Harlow's message of 2015-10-10 17:43:40 -0700:
>> I'm curious is there any more detail about #1 below anywhere online?
>>
>> Does cassandra use some features of the JVM that the openJDK version 
>> doesn't support? Something else?
> 
> This about sums it up:
> 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StartupChecks.java#L153-L155
> 
> // There is essentially no QA done on OpenJDK builds, and
> // clusters running OpenJDK have seen many heap and load issues.
> logger.warn("OpenJDK is not recommended. Please upgrade to the newest 
> Oracle Java release");

Or:
https://twitter.com/mipsytipsy/status/596697501991702528

This is one of the reasons I'm generally negative about Java solutions
(Cassandra or Zookeeper): the free software JVM is still not on par with
the non-free one, so we indirectly force our users to use a non-free
dependency. I've been there before often enough to hear "did you
reproduce that bug under the {Sun,Oracle} JVM" quite a few times.

When the Java solution is the only solution for a problem space that
might still be a good trade-off (compared to reinventing the wheel for
example), but to share state or distribute locks, there are some pretty
good other options out there that don't suffer from the same fundamental
problem...

-- 
Thierry Carrez (ttx)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Joshua Harlow

Thierry Carrez wrote:

Clint Byrum wrote:

Excerpts from Joshua Harlow's message of 2015-10-10 17:43:40 -0700:

I'm curious is there any more detail about #1 below anywhere online?

Does cassandra use some features of the JVM that the openJDK version
doesn't support? Something else?

This about sums it up:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StartupChecks.java#L153-L155

 // There is essentially no QA done on OpenJDK builds, and
 // clusters running OpenJDK have seen many heap and load issues.
 logger.warn("OpenJDK is not recommended. Please upgrade to the newest Oracle 
Java release");


Or:
https://twitter.com/mipsytipsy/status/596697501991702528

This is one of the reasons I'm generally negative about Java solutions
(Cassandra or Zookeeper): the free software JVM is still not on par with
the non-free one, so we indirectly force our users to use a non-free
dependency. I've been there before often enough to hear "did you
reproduce that bug under the {Sun,Oracle} JVM" quite a few times.


I'd be happy to 'fight' (and even fix) for any issues found with 
zookeeper + openjdk if needed, that twitter posting hopefully ended up 
in a bug being filed at https://issues.apache.org/jira/browse/ZOOKEEPER/ 
and hopefully things getting fixed...




When the Java solution is the only solution for a problem space that
might still be a good trade-off (compared to reinventing the wheel for
example), but to share state or distribute locks, there are some pretty
good other options out there that don't suffer from the same fundamental
problem...



IMHO it's the only 'mature' solution so far; but of course maturity is a 
relative thing (look at the project age, version number of zookeeper vs 
etcd, consul for a general idea around this); in general I'd really like 
the TC and the foundation to help make the right decision here, because 
this kind of choice affects the long-term future (and health) of 
openstack as a whole (or I believe it does).


-Josh

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Clint Byrum
Excerpts from Thomas Goirand's message of 2015-10-12 05:57:26 -0700:
> On 10/11/2015 02:53 AM, Davanum Srinivas wrote:
> > Thomas,
> > 
> > i am curious as well. AFAIK, cassandra works well with OpenJDK. Can you
> > please elaborate what you concerns are for #1?
> > 
> > Thanks,
> > Dims
> 
> s/works well/works/
> 
> Upstream doesn't test against OpenJDK, and they close bugs without
> fixing them when it only affects OpenJDK and it isn't grave. I know this
> from one of the upstream from Cassandra, who is also a Debian developer.
> Because of this state of things, he gave up on packaging Cassandra in
> Debian (and for other reasons too, like not having enough time to work
> on the packaging).
> 
> I trust what this Debian developer told me. If I remember correctly,
> it's Eric Evans  (ie, the author of the ITP at
> https://bugs.debian.org/585905) that I'm talking about.
> 

Indeed, I once took a crack at packaging it for Debian/Ubuntu too.
There's a reason 'apt-cache search cassandra' returns 0 results on Debian
and Ubuntu.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Thomas Goirand
On 10/11/2015 02:53 AM, Davanum Srinivas wrote:
> Thomas,
> 
> i am curious as well. AFAIK, cassandra works well with OpenJDK. Can you
> please elaborate what you concerns are for #1?
> 
> Thanks,
> Dims

s/works well/works/

Upstream doesn't test against OpenJDK, and they close bugs without
fixing them when it only affects OpenJDK and it isn't grave. I know this
from one of the upstream from Cassandra, who is also a Debian developer.
Because of this state of things, he gave up on packaging Cassandra in
Debian (and for other reasons too, like not having enough time to work
on the packaging).

I trust what this Debian developer told me. If I remember correctly,
it's Eric Evans  (ie, the author of the ITP at
https://bugs.debian.org/585905) that I'm talking about.

On 10/12/2015 01:19 AM, Amrith Kumar wrote:
> This is not a requirement by any means. See [3].
>
http://stackoverflow.com/questions/21487354/does-latest-cassandra-support-openjdk

A *hard* requirement, probably not. But this doesn't mean that Cassandra
works *well* on OpenJDK.

Anyway, I'd prefer if nobody trusted me, and that this was seriously
checked.

On 10/11/2015 08:53 AM, Clint Byrum wrote:
>
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StartupChecks.java#L153-L155
>
> // There is essentially no QA done on OpenJDK builds, and
> // clusters running OpenJDK have seen many heap and load issues.
> logger.warn("OpenJDK is not recommended. Please upgrade to the
> newest Oracle Java release");

Ah! Thanks for finding out the details I was missing... :)
With this kinds of problems, I don't think anyone would like to take the
responsibility to upload Cassandra to either Debian or Ubuntu. At least
*I* wouldn't.

That being said, maybe the issue that Clint quoted is fixable. Probably
the issue is that nobody really cares... (yet?)

Cheers,

Thomas Goirand (zigo)


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Jean-Daniel Bonnetot
Hi everyone,

What do you think about this proposal ?
http://www.slideshare.net/viggates/openstack-india-meetupscheduler

It seems they found a real solution for a scaling scheduler.
The good idea is to move intelligence on compute.
A synchronisation is only needed for anti-afinity and stuff like that which can 
be managed in an other way.

—
Jean-Daniel Bonnetot
http://www.ovh.com
@pilgrimstack



> Le 12 oct. 2015 à 12:30, Thierry Carrez  a écrit :
> 
> Clint Byrum wrote:
>> Excerpts from Joshua Harlow's message of 2015-10-10 17:43:40 -0700:
>>> I'm curious is there any more detail about #1 below anywhere online?
>>> 
>>> Does cassandra use some features of the JVM that the openJDK version 
>>> doesn't support? Something else?
>> 
>> This about sums it up:
>> 
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StartupChecks.java#L153-L155
>> 
>>// There is essentially no QA done on OpenJDK builds, and
>>// clusters running OpenJDK have seen many heap and load issues.
>>logger.warn("OpenJDK is not recommended. Please upgrade to the newest 
>> Oracle Java release");
> 
> Or:
> https://twitter.com/mipsytipsy/status/596697501991702528
> 
> This is one of the reasons I'm generally negative about Java solutions
> (Cassandra or Zookeeper): the free software JVM is still not on par with
> the non-free one, so we indirectly force our users to use a non-free
> dependency. I've been there before often enough to hear "did you
> reproduce that bug under the {Sun,Oracle} JVM" quite a few times.
> 
> When the Java solution is the only solution for a problem space that
> might still be a good trade-off (compared to reinventing the wheel for
> example), but to share state or distribute locks, there are some pretty
> good other options out there that don't suffer from the same fundamental
> problem...
> 
> -- 
> Thierry Carrez (ttx)
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Monty Taylor

On 10/12/2015 12:43 PM, Clint Byrum wrote:

Excerpts from Thomas Goirand's message of 2015-10-12 05:57:26 -0700:

On 10/11/2015 02:53 AM, Davanum Srinivas wrote:

Thomas,

i am curious as well. AFAIK, cassandra works well with OpenJDK. Can you
please elaborate what you concerns are for #1?

Thanks,
Dims


s/works well/works/

Upstream doesn't test against OpenJDK, and they close bugs without
fixing them when it only affects OpenJDK and it isn't grave. I know this
from one of the upstream from Cassandra, who is also a Debian developer.
Because of this state of things, he gave up on packaging Cassandra in
Debian (and for other reasons too, like not having enough time to work
on the packaging).

I trust what this Debian developer told me. If I remember correctly,
it's Eric Evans  (ie, the author of the ITP at
https://bugs.debian.org/585905) that I'm talking about.



Indeed, I once took a crack at packaging it for Debian/Ubuntu too.
There's a reason 'apt-cache search cassandra' returns 0 results on Debian
and Ubuntu.


There is a different reason too - which is that (at least at one point 
in the past) upstream expressed frustration with the idea of distro 
packages of Cassandra because it led to people coming to them with 
complaints about the software which had been fixed in newer versions but 
which, because of distro support policies, were not present in the 
user's software version. (I can sympathize)


I think they've been an excellent case study in how there is an 
impedance mismatch sometimes between the value that distros provide and 
the needs of particular communities. That's not a negative thought 
towards either of them - just purely that it's not purely limited to them.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2015-10-12 08:35:20 -0700:
> Thierry Carrez wrote:
> > Clint Byrum wrote:
> >> Excerpts from Joshua Harlow's message of 2015-10-10 17:43:40 -0700:
> >>> I'm curious is there any more detail about #1 below anywhere online?
> >>>
> >>> Does cassandra use some features of the JVM that the openJDK version
> >>> doesn't support? Something else?
> >> This about sums it up:
> >>
> >> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StartupChecks.java#L153-L155
> >>
> >>  // There is essentially no QA done on OpenJDK builds, and
> >>  // clusters running OpenJDK have seen many heap and load issues.
> >>  logger.warn("OpenJDK is not recommended. Please upgrade to the newest 
> >> Oracle Java release");
> >
> > Or:
> > https://twitter.com/mipsytipsy/status/596697501991702528
> >
> > This is one of the reasons I'm generally negative about Java solutions
> > (Cassandra or Zookeeper): the free software JVM is still not on par with
> > the non-free one, so we indirectly force our users to use a non-free
> > dependency. I've been there before often enough to hear "did you
> > reproduce that bug under the {Sun,Oracle} JVM" quite a few times.
> 
> I'd be happy to 'fight' (and even fix) for any issues found with 
> zookeeper + openjdk if needed, that twitter posting hopefully ended up 
> in a bug being filed at https://issues.apache.org/jira/browse/ZOOKEEPER/ 
> and hopefully things getting fixed...
> 
> >
> > When the Java solution is the only solution for a problem space that
> > might still be a good trade-off (compared to reinventing the wheel for
> > example), but to share state or distribute locks, there are some pretty
> > good other options out there that don't suffer from the same fundamental
> > problem...
> >
> 
> IMHO it's the only 'mature' solution so far; but of course maturity is a 
> relative thing (look at the project age, version number of zookeeper vs 
> etcd, consul for a general idea around this); in general I'd really like 
> the TC and the foundation to help make the right decision here, because 
> this kind of choice affects the long-term future (and health) of 
> openstack as a whole (or I believe it does).
> 

Zookeeper sits in a very different space from Cassandra. I have had good
success with it on OpenJDK as well.

That said, we need to maybe go through some feature/risk matrices and
compare to etcd and Consul (this might be good to do as part of filling
out the DLM spec). The jvm issues goes away with both of those, but then
we get to deal Go issues.

Also, ZK has one other advantage over those: It is already in Debian and
Ubuntu, making access for developers much easier.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Clint Byrum
Excerpts from Boris Pavlovic's message of 2015-10-11 01:14:08 -0700:
> Clint,
> 
> There are many PROS and CONS in both of approaches.
> 
> Reinventing wheel (in this case it's quite simple task) and it gives more
> flexibility and doesn't require
> usage of ZK/Consul (which will simplify integration of it with current
> system)
> 
> Using ZK/Consul for POC may save a lot of time and as well we are
> delegating part of work
> to other communities (which may lead in better supported/working code).
> 
> By the way some of the parts (like sync of schedulers) stuck on review in
> Nova project.
> 
> Basically for POC we can use anything and using ZK/Consul may reduce
> resources for development
> which is good.
> 

Awesome, I think we are aligned.

So, let's try and come up with a set of next steps to see a POC.

1) Let's try and get some numbers at the upper bounds of the current
scheduler with one and multiple schedulers. We can actually turn this
into a gate test harness, as we don't _actually_ care about the vms,
so this is an excellent use for the fake virt driver. In addition to
"where it breaks", I'd also like to see graphs of what it does to the
database and MQ bus. This aligns with the performance discussions that
will be happening as a sub-group of the large operators group, so I
think we can gather support for such an effort there.

2) Let's resolve which backend thing to use in the DLM spec. I have a
strong desire to consider the needs of DLM and the needs of scheduling
together. If the DLM discussion is tied, or nearly tied, on a few
choices, but one of the choices is better for the scheduler, it may
help the discussion. It may also hurt if one is more desirable for DLM,
and one is more desirable for scheduling. My gut says that they'll all
be suitable for both of these tasks, and it will boil down to binary
access and operator preference.

3) POC goes to the first person with free time. It's been my experience
that people come free at somewhat unexpected intervals, and I don't
want anyone to wait too long for consensus. So if anyone who agrees
with this direction gets time, I say, write a spec, get it out there,
and experiment with code.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Joshua Harlow

Clint Byrum wrote:

Excerpts from Boris Pavlovic's message of 2015-10-11 01:14:08 -0700:

Clint,

There are many PROS and CONS in both of approaches.

Reinventing wheel (in this case it's quite simple task) and it gives more
flexibility and doesn't require
usage of ZK/Consul (which will simplify integration of it with current
system)

Using ZK/Consul for POC may save a lot of time and as well we are
delegating part of work
to other communities (which may lead in better supported/working code).

By the way some of the parts (like sync of schedulers) stuck on review in
Nova project.

Basically for POC we can use anything and using ZK/Consul may reduce
resources for development
which is good.



Awesome, I think we are aligned.

So, let's try and come up with a set of next steps to see a POC.

1) Let's try and get some numbers at the upper bounds of the current
scheduler with one and multiple schedulers. We can actually turn this
into a gate test harness, as we don't _actually_ care about the vms,
so this is an excellent use for the fake virt driver. In addition to
"where it breaks", I'd also like to see graphs of what it does to the
database and MQ bus. This aligns with the performance discussions that
will be happening as a sub-group of the large operators group, so I
think we can gather support for such an effort there.


Just a related thought/question. It really seems we (as a community) 
need some kind of scale testing ground. Internally at yahoo we were/are 
going to use a 200 hypervisor cluster for some of this and then expand 
that into 200 * X by using nested virtualization and/or fake drivers and 
such. But this is a 'lab' that not everyone can have, and therefore 
isn't suited toward community work IMHO. Has there been any thought on 
such a 'lab' that is directly in the community, perhaps trystack.org can 
be this? (users get free VMs, but then we can tell them this area is a 
lab, so don't expect things to always work, free isn't free after all...)


With such a lab, there could be these kinds of experiments, graphs, 
tweaks and such...




2) Let's resolve which backend thing to use in the DLM spec. I have a
strong desire to consider the needs of DLM and the needs of scheduling
together. If the DLM discussion is tied, or nearly tied, on a few
choices, but one of the choices is better for the scheduler, it may
help the discussion. It may also hurt if one is more desirable for DLM,
and one is more desirable for scheduling. My gut says that they'll all
be suitable for both of these tasks, and it will boil down to binary
access and operator preference.

3) POC goes to the first person with free time. It's been my experience
that people come free at somewhat unexpected intervals, and I don't
want anyone to wait too long for consensus. So if anyone who agrees
with this direction gets time, I say, write a spec, get it out there,
and experiment with code.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Alec Hothan (ahothan)





On 10/10/15, 11:35 PM, "Clint Byrum"  wrote:

>Excerpts from Alec Hothan (ahothan)'s message of 2015-10-09 21:19:14 -0700:
>> 
>> On 10/9/15, 6:29 PM, "Clint Byrum"  wrote:
>> 
>> >Excerpts from Chris Friesen's message of 2015-10-09 17:33:38 -0700:
>> >> On 10/09/2015 03:36 PM, Ian Wells wrote:
>> >> > On 9 October 2015 at 12:50, Chris Friesen > >> > > wrote:
>> >> >
>> >> > Has anybody looked at why 1 instance is too slow and what it would 
>> >> > take to
>> >> >
>> >> > make 1 scheduler instance work fast enough? This does not 
>> >> > preclude the
>> >> > use of
>> >> > concurrency for finer grain tasks in the background.
>> >> >
>> >> >
>> >> > Currently we pull data on all (!) of the compute nodes out of the 
>> >> > database
>> >> > via a series of RPC calls, then evaluate the various filters in 
>> >> > python code.
>> >> >
>> >> >
>> >> > I'll say again: the database seems to me to be the problem here.  Not to
>> >> > mention, you've just explained that they are in practice holding all 
>> >> > the data in
>> >> > memory in order to do the work so the benefit we're getting here is 
>> >> > really a
>> >> > N-to-1-to-M pattern with a DB in the middle (the store-to-DB is rather
>> >> > secondary, in fact), and that without incremental updates to the 
>> >> > receivers.
>> >> 
>> >> I don't see any reason why you couldn't have an in-memory scheduler.
>> >> 
>> >> Currently the database serves as the persistant storage for the resource 
>> >> usage, 
>> >> so if we take it out of the picture I imagine you'd want to have some way 
>> >> of 
>> >> querying the compute nodes for their current state when the scheduler 
>> >> first 
>> >> starts up.
>> >> 
>> >> I think the current code uses the fact that objects are remotable via the 
>> >> conductor, so changing that to do explicit posts to a known scheduler 
>> >> topic 
>> >> would take some work.
>> >> 
>> >
>> >Funny enough, I think thats exactly what Josh's "just use Zookeeper"
>> >message is about. Except in memory, it is "in an observable storage
>> >location".
>> >
>> >Instead of having the scheduler do all of the compute node inspection
>> >and querying though, you have the nodes push their stats into something
>> >like Zookeeper or consul, and then have schedulers watch those stats
>> >for changes to keep their in-memory version of the data up to date. So
>> >when you bring a new one online, you don't have to query all the nodes,
>> >you just scrape the data store, which all of these stores (etcd, consul,
>> >ZK) are built to support atomically querying and watching at the same
>> >time, so you can have a reasonable expectation of correctness.
>> >
>> >Even if you figured out how to make the in-memory scheduler crazy fast,
>> >There's still value in concurrency for other reasons. No matter how
>> >fast you make the scheduler, you'll be slave to the response time of
>> >a single scheduling request. If you take 1ms to schedule each node
>> >(including just reading the request and pushing out your scheduling
>> >result!) you will never achieve greater than 1000/s. 1ms is way lower
>> >than it's going to take just to shove a tiny message into RabbitMQ or
>> >even 0mq.
>> 
>> That is not what I have seen, measurements that I did or done by others show 
>> between 5000 and 1 send *per sec* (depending on mirroring, up to 1KB msg 
>> size) using oslo messaging/kombu over rabbitMQ.
>
>You're quoting througput of RabbitMQ, but how many threads were
>involved? An in-memory scheduler that was multi-threaded would need to
>implement synchronization at a fairly granular level to use the same
>in-memory store, and we're right back to the extreme need for efficient
>concurrency in the design, though with much better latency on the
>synchronization.

These were single-threaded tests and you're correct that if you had multiple 
threads trying to send something you'd have some inefficiency.
However I'd question the likelihood of that happening as it is very likely that 
most of the cpu time will be spent outside of oslo messaging code.

Furthermore, Python does not need multiple threads to go faster. As a matter of 
fact, for in-memory operations, it could end up being slower because of the 
inherent design of the interpreter (and there are many independent measurements 
that have shown it).


>
>> And this is unmodified/highly unoptimized oslo messaging code.
>> If you remove the oslo messaging layer, you get 25000 to 45000 msg/sec with 
>> kombu/rabbitMQ (which shows how inefficient is oslo messaging layer itself)
>> 
>> > So I'm pretty sure this is o-k for small clouds, but would be
>> >a disaster for a large, busy cloud.
>> 
>> It all depends on how many sched/sec for the "large busy cloud"...
>> 
>
>I think there are two interesting things to discern. Of course, the
>exact rate would be great to have as a target, 

Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Joshua Harlow

Alec Hothan (ahothan) wrote:





On 10/10/15, 11:35 PM, "Clint Byrum"  wrote:


Excerpts from Alec Hothan (ahothan)'s message of 2015-10-09 21:19:14 -0700:

On 10/9/15, 6:29 PM, "Clint Byrum"  wrote:


Excerpts from Chris Friesen's message of 2015-10-09 17:33:38 -0700:

On 10/09/2015 03:36 PM, Ian Wells wrote:

On 9 October 2015 at 12:50, Chris Friesen>  wrote:

 Has anybody looked at why 1 instance is too slow and what it would take to

 make 1 scheduler instance work fast enough? This does not preclude the
 use of
 concurrency for finer grain tasks in the background.


 Currently we pull data on all (!) of the compute nodes out of the database
 via a series of RPC calls, then evaluate the various filters in python 
code.


I'll say again: the database seems to me to be the problem here.  Not to
mention, you've just explained that they are in practice holding all the data in
memory in order to do the work so the benefit we're getting here is really a
N-to-1-to-M pattern with a DB in the middle (the store-to-DB is rather
secondary, in fact), and that without incremental updates to the receivers.

I don't see any reason why you couldn't have an in-memory scheduler.

Currently the database serves as the persistant storage for the resource usage,
so if we take it out of the picture I imagine you'd want to have some way of
querying the compute nodes for their current state when the scheduler first
starts up.

I think the current code uses the fact that objects are remotable via the
conductor, so changing that to do explicit posts to a known scheduler topic
would take some work.


Funny enough, I think thats exactly what Josh's "just use Zookeeper"
message is about. Except in memory, it is "in an observable storage
location".

Instead of having the scheduler do all of the compute node inspection
and querying though, you have the nodes push their stats into something
like Zookeeper or consul, and then have schedulers watch those stats
for changes to keep their in-memory version of the data up to date. So
when you bring a new one online, you don't have to query all the nodes,
you just scrape the data store, which all of these stores (etcd, consul,
ZK) are built to support atomically querying and watching at the same
time, so you can have a reasonable expectation of correctness.

Even if you figured out how to make the in-memory scheduler crazy fast,
There's still value in concurrency for other reasons. No matter how
fast you make the scheduler, you'll be slave to the response time of
a single scheduling request. If you take 1ms to schedule each node
(including just reading the request and pushing out your scheduling
result!) you will never achieve greater than 1000/s. 1ms is way lower
than it's going to take just to shove a tiny message into RabbitMQ or
even 0mq.

That is not what I have seen, measurements that I did or done by others show 
between 5000 and 1 send *per sec* (depending on mirroring, up to 1KB msg 
size) using oslo messaging/kombu over rabbitMQ.

You're quoting througput of RabbitMQ, but how many threads were
involved? An in-memory scheduler that was multi-threaded would need to
implement synchronization at a fairly granular level to use the same
in-memory store, and we're right back to the extreme need for efficient
concurrency in the design, though with much better latency on the
synchronization.


These were single-threaded tests and you're correct that if you had multiple 
threads trying to send something you'd have some inefficiency.
However I'd question the likelihood of that happening as it is very likely that 
most of the cpu time will be spent outside of oslo messaging code.

Furthermore, Python does not need multiple threads to go faster. As a matter of 
fact, for in-memory operations, it could end up being slower because of the 
inherent design of the interpreter (and there are many independent measurements 
that have shown it).



And this is unmodified/highly unoptimized oslo messaging code.
If you remove the oslo messaging layer, you get 25000 to 45000 msg/sec with 
kombu/rabbitMQ (which shows how inefficient is oslo messaging layer itself)


So I'm pretty sure this is o-k for small clouds, but would be
a disaster for a large, busy cloud.

It all depends on how many sched/sec for the "large busy cloud"...


I think there are two interesting things to discern. Of course, the
exact rate would be great to have as a target, but operational security
and just plain secrecy of business models will probably prevent us from
getting at many of these requirements.


I don't think that is the case. We have no visibility because nobody has really 
thought about these numbers. Ops should be ok to provide some rough requirement 
numbers if asked (everybody is in the same boat).



The second is the complexity model of scaling. We can just think about
the 

Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Monty Taylor

On 10/12/2015 02:45 PM, Joshua Harlow wrote:

Alec Hothan (ahothan) wrote:





On 10/10/15, 11:35 PM, "Clint Byrum"  wrote:


Excerpts from Alec Hothan (ahothan)'s message of 2015-10-09 21:19:14
-0700:

On 10/9/15, 6:29 PM, "Clint Byrum"  wrote:


Excerpts from Chris Friesen's message of 2015-10-09 17:33:38 -0700:

On 10/09/2015 03:36 PM, Ian Wells wrote:

On 9 October 2015 at 12:50, Chris
Friesen>  wrote:

 Has anybody looked at why 1 instance is too slow and what it
would take to

 make 1 scheduler instance work fast enough? This does
not preclude the
 use of
 concurrency for finer grain tasks in the background.


 Currently we pull data on all (!) of the compute nodes out
of the database
 via a series of RPC calls, then evaluate the various filters
in python code.


I'll say again: the database seems to me to be the problem here.
Not to
mention, you've just explained that they are in practice holding
all the data in
memory in order to do the work so the benefit we're getting here
is really a
N-to-1-to-M pattern with a DB in the middle (the store-to-DB is
rather
secondary, in fact), and that without incremental updates to the
receivers.

I don't see any reason why you couldn't have an in-memory scheduler.

Currently the database serves as the persistant storage for the
resource usage,
so if we take it out of the picture I imagine you'd want to have
some way of
querying the compute nodes for their current state when the
scheduler first
starts up.

I think the current code uses the fact that objects are remotable
via the
conductor, so changing that to do explicit posts to a known
scheduler topic
would take some work.


Funny enough, I think thats exactly what Josh's "just use Zookeeper"
message is about. Except in memory, it is "in an observable storage
location".

Instead of having the scheduler do all of the compute node inspection
and querying though, you have the nodes push their stats into
something
like Zookeeper or consul, and then have schedulers watch those stats
for changes to keep their in-memory version of the data up to date. So
when you bring a new one online, you don't have to query all the
nodes,
you just scrape the data store, which all of these stores (etcd,
consul,
ZK) are built to support atomically querying and watching at the same
time, so you can have a reasonable expectation of correctness.

Even if you figured out how to make the in-memory scheduler crazy
fast,
There's still value in concurrency for other reasons. No matter how
fast you make the scheduler, you'll be slave to the response time of
a single scheduling request. If you take 1ms to schedule each node
(including just reading the request and pushing out your scheduling
result!) you will never achieve greater than 1000/s. 1ms is way lower
than it's going to take just to shove a tiny message into RabbitMQ or
even 0mq.

That is not what I have seen, measurements that I did or done by
others show between 5000 and 1 send *per sec* (depending on
mirroring, up to 1KB msg size) using oslo messaging/kombu over
rabbitMQ.

You're quoting througput of RabbitMQ, but how many threads were
involved? An in-memory scheduler that was multi-threaded would need to
implement synchronization at a fairly granular level to use the same
in-memory store, and we're right back to the extreme need for efficient
concurrency in the design, though with much better latency on the
synchronization.


These were single-threaded tests and you're correct that if you had
multiple threads trying to send something you'd have some inefficiency.
However I'd question the likelihood of that happening as it is very
likely that most of the cpu time will be spent outside of oslo
messaging code.

Furthermore, Python does not need multiple threads to go faster. As a
matter of fact, for in-memory operations, it could end up being slower
because of the inherent design of the interpreter (and there are many
independent measurements that have shown it).



And this is unmodified/highly unoptimized oslo messaging code.
If you remove the oslo messaging layer, you get 25000 to 45000
msg/sec with kombu/rabbitMQ (which shows how inefficient is oslo
messaging layer itself)


So I'm pretty sure this is o-k for small clouds, but would be
a disaster for a large, busy cloud.

It all depends on how many sched/sec for the "large busy cloud"...


I think there are two interesting things to discern. Of course, the
exact rate would be great to have as a target, but operational security
and just plain secrecy of business models will probably prevent us from
getting at many of these requirements.


I don't think that is the case. We have no visibility because nobody
has really thought about these numbers. Ops should be ok to provide
some rough requirement numbers if asked (everybody is in the same boat).



The second is the complexity model of 

Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Alec Hothan (ahothan)





On 10/12/15, 11:45 AM, "Joshua Harlow"  wrote:

>Alec Hothan (ahothan) wrote:
>>
>>
>>
>>
>> On 10/10/15, 11:35 PM, "Clint Byrum"  wrote:
>>
>>> Excerpts from Alec Hothan (ahothan)'s message of 2015-10-09 21:19:14 -0700:
 On 10/9/15, 6:29 PM, "Clint Byrum"  wrote:

> Excerpts from Chris Friesen's message of 2015-10-09 17:33:38 -0700:
>> On 10/09/2015 03:36 PM, Ian Wells wrote:
>>> On 9 October 2015 at 12:50, Chris Friesen>> >  wrote:
>>>
>>>  Has anybody looked at why 1 instance is too slow and what it would 
>>> take to
>>>
>>>  make 1 scheduler instance work fast enough? This does not 
>>> preclude the
>>>  use of
>>>  concurrency for finer grain tasks in the background.
>>>
>>>
>>>  Currently we pull data on all (!) of the compute nodes out of the 
>>> database
>>>  via a series of RPC calls, then evaluate the various filters in 
>>> python code.
>>>
>>>
>>> I'll say again: the database seems to me to be the problem here.  Not to
>>> mention, you've just explained that they are in practice holding all 
>>> the data in
>>> memory in order to do the work so the benefit we're getting here is 
>>> really a
>>> N-to-1-to-M pattern with a DB in the middle (the store-to-DB is rather
>>> secondary, in fact), and that without incremental updates to the 
>>> receivers.
>> I don't see any reason why you couldn't have an in-memory scheduler.
>>
>> Currently the database serves as the persistant storage for the resource 
>> usage,
>> so if we take it out of the picture I imagine you'd want to have some 
>> way of
>> querying the compute nodes for their current state when the scheduler 
>> first
>> starts up.
>>
>> I think the current code uses the fact that objects are remotable via the
>> conductor, so changing that to do explicit posts to a known scheduler 
>> topic
>> would take some work.
>>
> Funny enough, I think thats exactly what Josh's "just use Zookeeper"
> message is about. Except in memory, it is "in an observable storage
> location".
>
> Instead of having the scheduler do all of the compute node inspection
> and querying though, you have the nodes push their stats into something
> like Zookeeper or consul, and then have schedulers watch those stats
> for changes to keep their in-memory version of the data up to date. So
> when you bring a new one online, you don't have to query all the nodes,
> you just scrape the data store, which all of these stores (etcd, consul,
> ZK) are built to support atomically querying and watching at the same
> time, so you can have a reasonable expectation of correctness.
>
> Even if you figured out how to make the in-memory scheduler crazy fast,
> There's still value in concurrency for other reasons. No matter how
> fast you make the scheduler, you'll be slave to the response time of
> a single scheduling request. If you take 1ms to schedule each node
> (including just reading the request and pushing out your scheduling
> result!) you will never achieve greater than 1000/s. 1ms is way lower
> than it's going to take just to shove a tiny message into RabbitMQ or
> even 0mq.
 That is not what I have seen, measurements that I did or done by others 
 show between 5000 and 1 send *per sec* (depending on mirroring, up to 
 1KB msg size) using oslo messaging/kombu over rabbitMQ.
>>> You're quoting througput of RabbitMQ, but how many threads were
>>> involved? An in-memory scheduler that was multi-threaded would need to
>>> implement synchronization at a fairly granular level to use the same
>>> in-memory store, and we're right back to the extreme need for efficient
>>> concurrency in the design, though with much better latency on the
>>> synchronization.
>>
>> These were single-threaded tests and you're correct that if you had multiple 
>> threads trying to send something you'd have some inefficiency.
>> However I'd question the likelihood of that happening as it is very likely 
>> that most of the cpu time will be spent outside of oslo messaging code.
>>
>> Furthermore, Python does not need multiple threads to go faster. As a matter 
>> of fact, for in-memory operations, it could end up being slower because of 
>> the inherent design of the interpreter (and there are many independent 
>> measurements that have shown it).
>>
>>
 And this is unmodified/highly unoptimized oslo messaging code.
 If you remove the oslo messaging layer, you get 25000 to 45000 msg/sec 
 with kombu/rabbitMQ (which shows how inefficient is oslo messaging layer 
 itself)

> So I'm pretty sure this is o-k for small clouds, but would be
> a 

Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Joshua Harlow

Alec Hothan (ahothan) wrote:





On 10/12/15, 11:45 AM, "Joshua Harlow"  wrote:


Alec Hothan (ahothan) wrote:




On 10/10/15, 11:35 PM, "Clint Byrum"   wrote:


Excerpts from Alec Hothan (ahothan)'s message of 2015-10-09 21:19:14 -0700:

On 10/9/15, 6:29 PM, "Clint Byrum"   wrote:


Excerpts from Chris Friesen's message of 2015-10-09 17:33:38 -0700:

On 10/09/2015 03:36 PM, Ian Wells wrote:

On 9 October 2015 at 12:50, Chris Friesen>   wrote:

  Has anybody looked at why 1 instance is too slow and what it would take to

  make 1 scheduler instance work fast enough? This does not preclude the
  use of
  concurrency for finer grain tasks in the background.


  Currently we pull data on all (!) of the compute nodes out of the database
  via a series of RPC calls, then evaluate the various filters in python 
code.


I'll say again: the database seems to me to be the problem here.  Not to
mention, you've just explained that they are in practice holding all the data in
memory in order to do the work so the benefit we're getting here is really a
N-to-1-to-M pattern with a DB in the middle (the store-to-DB is rather
secondary, in fact), and that without incremental updates to the receivers.

I don't see any reason why you couldn't have an in-memory scheduler.

Currently the database serves as the persistant storage for the resource usage,
so if we take it out of the picture I imagine you'd want to have some way of
querying the compute nodes for their current state when the scheduler first
starts up.

I think the current code uses the fact that objects are remotable via the
conductor, so changing that to do explicit posts to a known scheduler topic
would take some work.


Funny enough, I think thats exactly what Josh's "just use Zookeeper"
message is about. Except in memory, it is "in an observable storage
location".

Instead of having the scheduler do all of the compute node inspection
and querying though, you have the nodes push their stats into something
like Zookeeper or consul, and then have schedulers watch those stats
for changes to keep their in-memory version of the data up to date. So
when you bring a new one online, you don't have to query all the nodes,
you just scrape the data store, which all of these stores (etcd, consul,
ZK) are built to support atomically querying and watching at the same
time, so you can have a reasonable expectation of correctness.

Even if you figured out how to make the in-memory scheduler crazy fast,
There's still value in concurrency for other reasons. No matter how
fast you make the scheduler, you'll be slave to the response time of
a single scheduling request. If you take 1ms to schedule each node
(including just reading the request and pushing out your scheduling
result!) you will never achieve greater than 1000/s. 1ms is way lower
than it's going to take just to shove a tiny message into RabbitMQ or
even 0mq.

That is not what I have seen, measurements that I did or done by others show 
between 5000 and 1 send *per sec* (depending on mirroring, up to 1KB msg 
size) using oslo messaging/kombu over rabbitMQ.

You're quoting througput of RabbitMQ, but how many threads were
involved? An in-memory scheduler that was multi-threaded would need to
implement synchronization at a fairly granular level to use the same
in-memory store, and we're right back to the extreme need for efficient
concurrency in the design, though with much better latency on the
synchronization.

These were single-threaded tests and you're correct that if you had multiple 
threads trying to send something you'd have some inefficiency.
However I'd question the likelihood of that happening as it is very likely that 
most of the cpu time will be spent outside of oslo messaging code.

Furthermore, Python does not need multiple threads to go faster. As a matter of 
fact, for in-memory operations, it could end up being slower because of the 
inherent design of the interpreter (and there are many independent measurements 
that have shown it).



And this is unmodified/highly unoptimized oslo messaging code.
If you remove the oslo messaging layer, you get 25000 to 45000 msg/sec with 
kombu/rabbitMQ (which shows how inefficient is oslo messaging layer itself)


So I'm pretty sure this is o-k for small clouds, but would be
a disaster for a large, busy cloud.

It all depends on how many sched/sec for the "large busy cloud"...


I think there are two interesting things to discern. Of course, the
exact rate would be great to have as a target, but operational security
and just plain secrecy of business models will probably prevent us from
getting at many of these requirements.

I don't think that is the case. We have no visibility because nobody has really 
thought about these numbers. Ops should be ok to provide some rough requirement 
numbers if asked 

Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Joshua Harlow

Ian Wells wrote:

On 10 October 2015 at 23:47, Clint Byrum > wrote:

>  Per before, my suggestion was that every scheduler tries to
maintain a copy
>  of the cloud's state in memory (in much the same way, per the previous
>  example, as every router on the internet tries to make a route
table out of
>  what it learns from BGP).  They don't have to be perfect.  They
don't have
>  to be in sync.  As long as there's some variability in the
decision making,
>  they don't have to update when another scheduler schedules
something (and
>  you can make the compute node send an immediate update when a new
VM is
>  run, anyway).  They all stand a good chance of scheduling VMs well
>  simultaneously.
>

I'm quite in favor of eventual consistency and retries. Even if we had
a system of perfect updating of all state records everywhere, it would
break sometimes and I'd still want to not trust any record of state as
being correct for the entire distributed system. However, there is an
efficiency win gained by staying _close_ to correct. It is actually a
function of the expected entropy. The more concurrent schedulers, the
more entropy there will be to deal with.


... and the fewer the servers in total, the larger the entropy as a
proportion of the whole system (if that's a thing, it's a long time
since I did physical chemistry).  But consider the use cases:

1. I have a small cloud, I run two schedulers for redundancy.  There's a
good possibility that, when the cloud is loaded, the schedulers make
poor decisions occasionally.  We'd have to consider how likely that was,
certainly.

2. I have a large cloud, and I run 20 schedulers for redundancy.
There's a good chance that a scheduler is out of date on its
information.  But there could be several hundred hosts willing to
satisfy a scheduling request, and even of the ones with incorrect
information a low chance that any of those are close to the threshold
where they won't run the VM in question, so good odds it will pick a
host that's happy to satsify the request.


>  But to be fair, we're throwing made up numbers around at this
point.  Maybe
>  it's time to work out how to test this for scale in a harness -
which is
>  the bit of work we all really need to do this properly, or there's
no proof
>  we've actually helped - and leave people to code their ideas up?

I'm working on adding meters for rates and amounts of messages and
queries that the system does right now for performance purposes. Rally
though, is the place where I'd go to ask "how fast can we schedule
things
right now?".


My only concern is that we're testing a real cloud at scale and I
haven't got any more firstborn to sell for hardware, so I wonder if we
can fake up a compute node in our test harness.


Does the openstack foundation have access to a scaling area that can be 
used by the community for this kind of experimental work? It seems like 
infra or others should be able make that possible? Maybe we could 
sacrifice a summit and instead of spending the money on that we (as a 
community) could spend the money on a really nice scale lab for the 
community ;)



--
Ian.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Ian Wells
On 11 October 2015 at 00:23, Clint Byrum  wrote:

> I'm in, except I think this gets simpler with an intermediary service
> like ZK/Consul to keep track of this 1GB of data and replace the need
> for 6, and changes the implementation of 5 to "updates its record and
> signals its presence".
>

OK, so we're not keeping a copy of the information in the schedulers,
saving us 5GB of information, but we are notifying the schedulers of the
updated information to that they can update their copies?

Also, the notification path here is that the compute host notifies ZK and
ZK notifies many schedulers, assuming they're all capable of handling all
queries.  That is in fact N * (M+1) messages, which is slightly more than
if there's no central node, as it happens.  There are fewer *channels*, but
more messages.  (I feel like I'm overlooking something here, but I can't
pick out the flaw...)  Yes, RMQ will suck at this - but then let's talk
about better messaging rather than another DB type.

Again, the saving here seems to be that a freshly started scheduler can get
an infodump rather than waiting 60s to be useful.  I wonder if that's
necessary.
-- 
Ian.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Ian Wells
On 10 October 2015 at 23:47, Clint Byrum  wrote:

> > Per before, my suggestion was that every scheduler tries to maintain a
> copy
> > of the cloud's state in memory (in much the same way, per the previous
> > example, as every router on the internet tries to make a route table out
> of
> > what it learns from BGP).  They don't have to be perfect.  They don't
> have
> > to be in sync.  As long as there's some variability in the decision
> making,
> > they don't have to update when another scheduler schedules something (and
> > you can make the compute node send an immediate update when a new VM is
> > run, anyway).  They all stand a good chance of scheduling VMs well
> > simultaneously.
> >
>
> I'm quite in favor of eventual consistency and retries. Even if we had
> a system of perfect updating of all state records everywhere, it would
> break sometimes and I'd still want to not trust any record of state as
> being correct for the entire distributed system. However, there is an
> efficiency win gained by staying _close_ to correct. It is actually a
> function of the expected entropy. The more concurrent schedulers, the
> more entropy there will be to deal with.
>

... and the fewer the servers in total, the larger the entropy as a
proportion of the whole system (if that's a thing, it's a long time since I
did physical chemistry).  But consider the use cases:

1. I have a small cloud, I run two schedulers for redundancy.  There's a
good possibility that, when the cloud is loaded, the schedulers make poor
decisions occasionally.  We'd have to consider how likely that was,
certainly.

2. I have a large cloud, and I run 20 schedulers for redundancy.  There's a
good chance that a scheduler is out of date on its information.  But there
could be several hundred hosts willing to satisfy a scheduling request, and
even of the ones with incorrect information a low chance that any of those
are close to the threshold where they won't run the VM in question, so good
odds it will pick a host that's happy to satsify the request.


> But to be fair, we're throwing made up numbers around at this point.
> Maybe
> > it's time to work out how to test this for scale in a harness - which is
> > the bit of work we all really need to do this properly, or there's no
> proof
> > we've actually helped - and leave people to code their ideas up?
>
> I'm working on adding meters for rates and amounts of messages and
> queries that the system does right now for performance purposes. Rally
> though, is the place where I'd go to ask "how fast can we schedule things
> right now?".
>

My only concern is that we're testing a real cloud at scale and I haven't
got any more firstborn to sell for hardware, so I wonder if we can fake up
a compute node in our test harness.
-- 
Ian.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-12 Thread Clint Byrum
Excerpts from Ian Wells's message of 2015-10-12 19:43:48 -0700:
> On 11 October 2015 at 00:23, Clint Byrum  wrote:
> 
> > I'm in, except I think this gets simpler with an intermediary service
> > like ZK/Consul to keep track of this 1GB of data and replace the need
> > for 6, and changes the implementation of 5 to "updates its record and
> > signals its presence".
> >
> 
> OK, so we're not keeping a copy of the information in the schedulers,
> saving us 5GB of information, but we are notifying the schedulers of the
> updated information to that they can update their copies?
> 

We _would_ keep a local cache of the information in the schedulers. The
centralized copy of it is to free the schedulers from the complexity of
having to keep track of it as state, rather than as a cache. We also don't
have to provide a way for on-demand stat fetching to seed scheduler 0.

> Also, the notification path here is that the compute host notifies ZK and
> ZK notifies many schedulers, assuming they're all capable of handling all
> queries.  That is in fact N * (M+1) messages, which is slightly more than
> if there's no central node, as it happens.  There are fewer *channels*, but
> more messages.  (I feel like I'm overlooking something here, but I can't
> pick out the flaw...)  Yes, RMQ will suck at this - but then let's talk
> about better messaging rather than another DB type.
> 

You're calling transactions messages, and that's not really fair to
messaging or transactions. :)

If N==Number of Schedulers, then the transaction which records a change
in available resources for a compute node results in 1 transaction, and
N "watches" to the schedulers. However, it's important to note that in
this situation, compute nodes do not have to send anything anywhere if
nothing has changed, which is very likely the case for "full" compute
nodes, and certainly will save many many redundant messages. Forgive me
if nova already makes this optimization somehow, it didn't seem to when
I was tinkering a year ago.

> Again, the saving here seems to be that a freshly started scheduler can get
> an infodump rather than waiting 60s to be useful.  I wonder if that's
> necessary.

There is also the complexity of designing a scheduler which is fault
tolerant and scales economically. What we have now will overtax the
message bus and the database as the number of compute nodes increases.
We want to get O(1) complexity out of that, but we're getting O(N)
right now.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Clint Byrum
Excerpts from Alec Hothan (ahothan)'s message of 2015-10-09 21:19:14 -0700:
> 
> On 10/9/15, 6:29 PM, "Clint Byrum"  wrote:
> 
> >Excerpts from Chris Friesen's message of 2015-10-09 17:33:38 -0700:
> >> On 10/09/2015 03:36 PM, Ian Wells wrote:
> >> > On 9 October 2015 at 12:50, Chris Friesen  >> > > wrote:
> >> >
> >> > Has anybody looked at why 1 instance is too slow and what it would 
> >> > take to
> >> >
> >> > make 1 scheduler instance work fast enough? This does not 
> >> > preclude the
> >> > use of
> >> > concurrency for finer grain tasks in the background.
> >> >
> >> >
> >> > Currently we pull data on all (!) of the compute nodes out of the 
> >> > database
> >> > via a series of RPC calls, then evaluate the various filters in 
> >> > python code.
> >> >
> >> >
> >> > I'll say again: the database seems to me to be the problem here.  Not to
> >> > mention, you've just explained that they are in practice holding all the 
> >> > data in
> >> > memory in order to do the work so the benefit we're getting here is 
> >> > really a
> >> > N-to-1-to-M pattern with a DB in the middle (the store-to-DB is rather
> >> > secondary, in fact), and that without incremental updates to the 
> >> > receivers.
> >> 
> >> I don't see any reason why you couldn't have an in-memory scheduler.
> >> 
> >> Currently the database serves as the persistant storage for the resource 
> >> usage, 
> >> so if we take it out of the picture I imagine you'd want to have some way 
> >> of 
> >> querying the compute nodes for their current state when the scheduler 
> >> first 
> >> starts up.
> >> 
> >> I think the current code uses the fact that objects are remotable via the 
> >> conductor, so changing that to do explicit posts to a known scheduler 
> >> topic 
> >> would take some work.
> >> 
> >
> >Funny enough, I think thats exactly what Josh's "just use Zookeeper"
> >message is about. Except in memory, it is "in an observable storage
> >location".
> >
> >Instead of having the scheduler do all of the compute node inspection
> >and querying though, you have the nodes push their stats into something
> >like Zookeeper or consul, and then have schedulers watch those stats
> >for changes to keep their in-memory version of the data up to date. So
> >when you bring a new one online, you don't have to query all the nodes,
> >you just scrape the data store, which all of these stores (etcd, consul,
> >ZK) are built to support atomically querying and watching at the same
> >time, so you can have a reasonable expectation of correctness.
> >
> >Even if you figured out how to make the in-memory scheduler crazy fast,
> >There's still value in concurrency for other reasons. No matter how
> >fast you make the scheduler, you'll be slave to the response time of
> >a single scheduling request. If you take 1ms to schedule each node
> >(including just reading the request and pushing out your scheduling
> >result!) you will never achieve greater than 1000/s. 1ms is way lower
> >than it's going to take just to shove a tiny message into RabbitMQ or
> >even 0mq.
> 
> That is not what I have seen, measurements that I did or done by others show 
> between 5000 and 1 send *per sec* (depending on mirroring, up to 1KB msg 
> size) using oslo messaging/kombu over rabbitMQ.

You're quoting througput of RabbitMQ, but how many threads were
involved? An in-memory scheduler that was multi-threaded would need to
implement synchronization at a fairly granular level to use the same
in-memory store, and we're right back to the extreme need for efficient
concurrency in the design, though with much better latency on the
synchronization.

> And this is unmodified/highly unoptimized oslo messaging code.
> If you remove the oslo messaging layer, you get 25000 to 45000 msg/sec with 
> kombu/rabbitMQ (which shows how inefficient is oslo messaging layer itself)
> 
> > So I'm pretty sure this is o-k for small clouds, but would be
> >a disaster for a large, busy cloud.
> 
> It all depends on how many sched/sec for the "large busy cloud"...
> 

I think there are two interesting things to discern. Of course, the
exact rate would be great to have as a target, but operational security
and just plain secrecy of business models will probably prevent us from
getting at many of these requirements.

The second is the complexity model of scaling. We can just think about
the actual cost benefit of running 1, 3, and more schedulers and come up
with some rough numbers for a lower bounds for scheduler performance
that would make sense.

> >
> >If, however, you can have 20 schedulers that all take 10ms on average,
> >and have the occasional lock contention for a resource counter resulting
> >in 100ms, now you're at 2000/s minus the lock contention rate. This
> >strategy would scale better with the number of compute nodes, since
> >more nodes means more distinct locks, so 

Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Boris Pavlovic
2Everybody,

Just curios why we need such complexity.


Let's take a look from other side:
1) Information about all hosts (even in case of 100k hosts) will be less
then 1 GB
2) Usually servers that runs scheduler service have at least 64GB RAM and
more on the board
3) math.log(10) < 12  (binary search per rule)
4) We have less then 20 rules for scheduling
5) Information about hosts is updated every 60 seconds (no updates host is
dead)


According to this information:
1) We can store everything in RAM of single server
2) We can use Python
3) Information about hosts is temporary data and shouldn't be stored in
persistence storage


Simplest architecture to cover this:
1) Single RPC service that has two methods: find_host(rules),
update_host(host, data)
2) Store information about hosts  like a dict (host_name->data)
3) Create for each rule binary tree and update it on each host update
4) Make a algorithm that will use binary trees to find host based on rules
5) Each service like compute node, volume node, or neutron will send
updates about host
   that they managed (cross service scheduling)
6) Make a algorithm that will sync host stats in memory between different
schedulers
7) ...
8) PROFIT!

It's:
1) Simple to manage
2) Simple to understand
3) Simple to calc scalability limits
4) Simple to integrate in current OpenStack architecture


As a future bonus, we can implement scheduler-per-az functionality, so each
scheduler will store information
only about his AZ, and separated AZ can have own rabbit servers for example
which will allows us to get
horizontal scalability in terms of AZ.


So do we really need Cassandra, Mongo, ... and other Web-scale solution for
such simple task?


Best regards,
Boris Pavlovic

On Sat, Oct 10, 2015 at 11:19 PM, Clint Byrum  wrote:

> Excerpts from Chris Friesen's message of 2015-10-09 23:16:43 -0700:
> > On 10/09/2015 07:29 PM, Clint Byrum wrote:
> >
> > > Even if you figured out how to make the in-memory scheduler crazy fast,
> > > There's still value in concurrency for other reasons. No matter how
> > > fast you make the scheduler, you'll be slave to the response time of
> > > a single scheduling request. If you take 1ms to schedule each node
> > > (including just reading the request and pushing out your scheduling
> > > result!) you will never achieve greater than 1000/s. 1ms is way lower
> > > than it's going to take just to shove a tiny message into RabbitMQ or
> > > even 0mq. So I'm pretty sure this is o-k for small clouds, but would be
> > > a disaster for a large, busy cloud.
> > >
> > > If, however, you can have 20 schedulers that all take 10ms on average,
> > > and have the occasional lock contention for a resource counter
> resulting
> > > in 100ms, now you're at 2000/s minus the lock contention rate. This
> > > strategy would scale better with the number of compute nodes, since
> > > more nodes means more distinct locks, so you can scale out the number
> > > of running servers separate from the number of scheduling requests.
> >
> > As far as I can see, moving to an in-memory scheduler is essentially
> orthogonal
> > to allowing multiple schedulers to run concurrently.  We can do both.
> >
>
> Agreed, and I want to make sure we continue to be able to run concurrent
> schedulers.
>
> Going in memory won't reduce contention for the same resources. So it
> will definitely schedule faster, but it may also serialize with concurrent
> schedulers sooner, and thus turn into a situation where scaling out more
> nodes means the same, or even less throughput.
>
> Keep in mind, I actually think we give our users _WAY_ too much power
> over our clouds, and I actually think we should simply have flavor based
> scheduling and let compute nodes grab node reservation requests directly
> out of flavor based queues based on their own current observation of
> their ability to service it.
>
> But I understand that there are quite a few clouds now that have been
> given shiny dynamic scheduling tools and now we have to engineer for
> those.
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Geoff O'Callaghan
On 11/10/2015 6:25 PM, "Clint Byrum"  wrote:
>
> Excerpts from Boris Pavlovic's message of 2015-10-11 00:02:39 -0700:
> > 2Everybody,
> >
> > Just curios why we need such complexity.
> >
> >
> > Let's take a look from other side:
> > 1) Information about all hosts (even in case of 100k hosts) will be less
> > then 1 GB
> > 2) Usually servers that runs scheduler service have at least 64GB RAM
and
> > more on the board
> > 3) math.log(10) < 12  (binary search per rule)
> > 4) We have less then 20 rules for scheduling
> > 5) Information about hosts is updated every 60 seconds (no updates host
is
> > dead)

[Snip]

>
> I'm in, except I think this gets simpler with an intermediary service
> like ZK/Consul to keep track of this 1GB of data and replace the need
> for 6, and changes the implementation of 5 to "updates its record and
> signals its presence".

I have to agree,  something like ZK looks like it'd make things simpler and
they're in general well proven technology  (esp ZK)
They handle the centralized coordination well and all the hard resiliency
is thrown in.

Geoff
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Boris Pavlovic
Clint,

There are many PROS and CONS in both of approaches.

Reinventing wheel (in this case it's quite simple task) and it gives more
flexibility and doesn't require
usage of ZK/Consul (which will simplify integration of it with current
system)

Using ZK/Consul for POC may save a lot of time and as well we are
delegating part of work
to other communities (which may lead in better supported/working code).

By the way some of the parts (like sync of schedulers) stuck on review in
Nova project.

Basically for POC we can use anything and using ZK/Consul may reduce
resources for development
which is good.

Best regards,
Boris Pavlovic

On Sun, Oct 11, 2015 at 12:23 AM, Clint Byrum  wrote:

> Excerpts from Boris Pavlovic's message of 2015-10-11 00:02:39 -0700:
> > 2Everybody,
> >
> > Just curios why we need such complexity.
> >
> >
> > Let's take a look from other side:
> > 1) Information about all hosts (even in case of 100k hosts) will be less
> > then 1 GB
> > 2) Usually servers that runs scheduler service have at least 64GB RAM and
> > more on the board
> > 3) math.log(10) < 12  (binary search per rule)
> > 4) We have less then 20 rules for scheduling
> > 5) Information about hosts is updated every 60 seconds (no updates host
> is
> > dead)
> >
> >
> > According to this information:
> > 1) We can store everything in RAM of single server
> > 2) We can use Python
> > 3) Information about hosts is temporary data and shouldn't be stored in
> > persistence storage
> >
> >
> > Simplest architecture to cover this:
> > 1) Single RPC service that has two methods: find_host(rules),
> > update_host(host, data)
> > 2) Store information about hosts  like a dict (host_name->data)
> > 3) Create for each rule binary tree and update it on each host update
> > 4) Make a algorithm that will use binary trees to find host based on
> rules
> > 5) Each service like compute node, volume node, or neutron will send
> > updates about host
> >that they managed (cross service scheduling)
> > 6) Make a algorithm that will sync host stats in memory between different
> > schedulers
>
> I'm in, except I think this gets simpler with an intermediary service
> like ZK/Consul to keep track of this 1GB of data and replace the need
> for 6, and changes the implementation of 5 to "updates its record and
> signals its presence".
>
> What you've described is where I'd like to experiment, but I don't want
> to reinvent ZK or Consul or etcd when they already exist and do such a
> splendid job keeping observers informed of small changes in small data
> sets. You still end up with the same in-memory performance, and this is
> in line with some published white papers from Google around their use
> of Chubby, which is their ZK/Consul.
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Clint Byrum
Excerpts from Ian Wells's message of 2015-10-09 19:14:17 -0700:
> On 9 October 2015 at 18:29, Clint Byrum  wrote:
> 
> > Instead of having the scheduler do all of the compute node inspection
> > and querying though, you have the nodes push their stats into something
> > like Zookeeper or consul, and then have schedulers watch those stats
> > for changes to keep their in-memory version of the data up to date. So
> > when you bring a new one online, you don't have to query all the nodes,
> > you just scrape the data store, which all of these stores (etcd, consul,
> > ZK) are built to support atomically querying and watching at the same
> > time, so you can have a reasonable expectation of correctness.
> >
> 
> We have to be careful about our definition of 'correctness' here.  In
> practice, the data is never going to be perfect because compute hosts
> update periodically and the information is therefore always dated.  With
> ZK, it's going to be strictly consistent with regard to the updates from
> the compute hosts, but again that doesn't really matter too much because
> the scheduler is going to have to make a best effort job with a mixed bag
> of information anyway.
> 

I was actually thinking nodes would update ZK _when they are changed
themselves_. As in, the scheduler would reduce the available resources
upon allocating them, and the nodes would only update them after
reclaiming those resources or when they start fresh.

> In fact, putting ZK in the middle basically means that your compute hosts
> now synchronously update a majority of nodes in a minimum 3 node quorum -
> not the fastest form of update - and then the quorum will see to notifying
> the schedulers.  In practice this is just a store-and-fanout again. Once
> more it's not clear to me whether the store serves much use, and as for the
> fanout, I wonder if we'll need >>3 schedulers running so that this is
> reducing communication overhead.
> 

This is indeed store and fanout. Except unlike mysql+rabbitMQ, we're
using a service optimized for store and fanout. :)

All of the DLM-ish primitive things we've talked about can handle a
ton of churn in what turns out to be very small amounts of data. The
difference here is that instead of a scheduler querying for the data,
it has already received it because it was watching for changes. And
if some of it hasn't changed, there's no query, and there's no fanout,
and the local cache is just used.

So yes, if we did things the same as now, this would be terrible. But we
wouldn't. We'd let ZK or Consul do this for us, because they are better
than anything we can build to do this.

> Even if you figured out how to make the in-memory scheduler crazy fast,
> > There's still value in concurrency for other reasons. No matter how
> > fast you make the scheduler, you'll be slave to the response time of
> > a single scheduling request. If you take 1ms to schedule each node
> > (including just reading the request and pushing out your scheduling
> > result!) you will never achieve greater than 1000/s. 1ms is way lower
> > than it's going to take just to shove a tiny message into RabbitMQ or
> > even 0mq. So I'm pretty sure this is o-k for small clouds, but would be
> > a disaster for a large, busy cloud.
> >
> 
> Per before, my suggestion was that every scheduler tries to maintain a copy
> of the cloud's state in memory (in much the same way, per the previous
> example, as every router on the internet tries to make a route table out of
> what it learns from BGP).  They don't have to be perfect.  They don't have
> to be in sync.  As long as there's some variability in the decision making,
> they don't have to update when another scheduler schedules something (and
> you can make the compute node send an immediate update when a new VM is
> run, anyway).  They all stand a good chance of scheduling VMs well
> simultaneously.
> 

I'm quite in favor of eventual consistency and retries. Even if we had
a system of perfect updating of all state records everywhere, it would
break sometimes and I'd still want to not trust any record of state as
being correct for the entire distributed system. However, there is an
efficiency win gained by staying _close_ to correct. It is actually a
function of the expected entropy. The more concurrent schedulers, the
more entropy there will be to deal with.

> If, however, you can have 20 schedulers that all take 10ms on average,
> > and have the occasional lock contention for a resource counter resulting
> > in 100ms, now you're at 2000/s minus the lock contention rate. This
> > strategy would scale better with the number of compute nodes, since
> > more nodes means more distinct locks, so you can scale out the number
> > of running servers separate from the number of scheduling requests.
> >
> 
> If you have 20 schedulers that take 1ms on average, and there's absolutely
> no lock contention, then you're at 20,000/s.  (Unfair, granted, since what
> I'm suggesting is more 

Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2015-10-10 17:43:40 -0700:
> I'm curious is there any more detail about #1 below anywhere online?
> 
> Does cassandra use some features of the JVM that the openJDK version 
> doesn't support? Something else?
> 

This about sums it up:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StartupChecks.java#L153-L155

// There is essentially no QA done on OpenJDK builds, and
// clusters running OpenJDK have seen many heap and load issues.
logger.warn("OpenJDK is not recommended. Please upgrade to the newest 
Oracle Java release");

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Clint Byrum
Excerpts from Boris Pavlovic's message of 2015-10-11 00:02:39 -0700:
> 2Everybody,
> 
> Just curios why we need such complexity.
> 
> 
> Let's take a look from other side:
> 1) Information about all hosts (even in case of 100k hosts) will be less
> then 1 GB
> 2) Usually servers that runs scheduler service have at least 64GB RAM and
> more on the board
> 3) math.log(10) < 12  (binary search per rule)
> 4) We have less then 20 rules for scheduling
> 5) Information about hosts is updated every 60 seconds (no updates host is
> dead)
> 
> 
> According to this information:
> 1) We can store everything in RAM of single server
> 2) We can use Python
> 3) Information about hosts is temporary data and shouldn't be stored in
> persistence storage
> 
> 
> Simplest architecture to cover this:
> 1) Single RPC service that has two methods: find_host(rules),
> update_host(host, data)
> 2) Store information about hosts  like a dict (host_name->data)
> 3) Create for each rule binary tree and update it on each host update
> 4) Make a algorithm that will use binary trees to find host based on rules
> 5) Each service like compute node, volume node, or neutron will send
> updates about host
>that they managed (cross service scheduling)
> 6) Make a algorithm that will sync host stats in memory between different
> schedulers

I'm in, except I think this gets simpler with an intermediary service
like ZK/Consul to keep track of this 1GB of data and replace the need
for 6, and changes the implementation of 5 to "updates its record and
signals its presence".

What you've described is where I'd like to experiment, but I don't want
to reinvent ZK or Consul or etcd when they already exist and do such a
splendid job keeping observers informed of small changes in small data
sets. You still end up with the same in-memory performance, and this is
in line with some published white papers from Google around their use
of Chubby, which is their ZK/Consul.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Clint Byrum
Excerpts from Chris Friesen's message of 2015-10-09 23:16:43 -0700:
> On 10/09/2015 07:29 PM, Clint Byrum wrote:
> 
> > Even if you figured out how to make the in-memory scheduler crazy fast,
> > There's still value in concurrency for other reasons. No matter how
> > fast you make the scheduler, you'll be slave to the response time of
> > a single scheduling request. If you take 1ms to schedule each node
> > (including just reading the request and pushing out your scheduling
> > result!) you will never achieve greater than 1000/s. 1ms is way lower
> > than it's going to take just to shove a tiny message into RabbitMQ or
> > even 0mq. So I'm pretty sure this is o-k for small clouds, but would be
> > a disaster for a large, busy cloud.
> >
> > If, however, you can have 20 schedulers that all take 10ms on average,
> > and have the occasional lock contention for a resource counter resulting
> > in 100ms, now you're at 2000/s minus the lock contention rate. This
> > strategy would scale better with the number of compute nodes, since
> > more nodes means more distinct locks, so you can scale out the number
> > of running servers separate from the number of scheduling requests.
> 
> As far as I can see, moving to an in-memory scheduler is essentially 
> orthogonal 
> to allowing multiple schedulers to run concurrently.  We can do both.
> 

Agreed, and I want to make sure we continue to be able to run concurrent
schedulers.

Going in memory won't reduce contention for the same resources. So it
will definitely schedule faster, but it may also serialize with concurrent
schedulers sooner, and thus turn into a situation where scaling out more
nodes means the same, or even less throughput.

Keep in mind, I actually think we give our users _WAY_ too much power
over our clouds, and I actually think we should simply have flavor based
scheduling and let compute nodes grab node reservation requests directly
out of flavor based queues based on their own current observation of
their ability to service it.

But I understand that there are quite a few clouds now that have been
given shiny dynamic scheduling tools and now we have to engineer for
those.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Adam Lawson
I have a quick question: how is Amazon doing this? When choosing a next
path forward that reliably scales, would be interesting to know how this is
already being done.
On Oct 9, 2015 10:12 AM, "Zane Bitter"  wrote:

> On 08/10/15 21:32, Ian Wells wrote:
>
>>
>> > 2. if many hosts suit the 5 VMs then this is *very* unlucky,because
>> we should be choosing a host at random from the set of
>> suitable hosts and that's a huge coincidence - so this is a tiny
>> corner case that we shouldn't be designing around
>>
>> Here is where we differ in our understanding. With the current
>> system of filters and weighers, 5 schedulers getting requests for
>> identical VMs and having identical information are *expected* to
>> select the same host. It is not a tiny corner case; it is the most
>> likely result for the current system design. By catching this
>> situation early (in the scheduling process) we can avoid multiple
>> RPC round-trips to handle the fail/retry mechanism.
>>
>>
>> And so maybe this would be a different fix - choose, at random, one of
>> the hosts above a weighting threshold, not choose the top host every
>> time? Technically, any host passing the filter is adequate to the task
>> from the perspective of an API user (and they can't prove if they got
>> the highest weighting or not), so if we assume weighting an operator
>> preference, and just weaken it slightly, we'd have a few more options.
>>
>
> The optimal way to do this would be a weighted random selection, where the
> probability of any given host being selected is proportional to its
> weighting. (Obviously this is limited by the accuracy of the weighting
> function in expressing your actual preferences - and it's at least
> conceivable that this could vary with the number of schedulers running.)
>
> In fact, the choice of the name 'weighting' would normally imply that it's
> done this way; hearing that the 'weighting' is actually used as a 'score'
> with the highest one always winning is quite surprising.
>
> cheers,
> Zane.
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Davanum Srinivas
Thanks Clint!

On Sat, Oct 10, 2015 at 11:53 PM, Clint Byrum  wrote:

> Excerpts from Joshua Harlow's message of 2015-10-10 17:43:40 -0700:
> > I'm curious is there any more detail about #1 below anywhere online?
> >
> > Does cassandra use some features of the JVM that the openJDK version
> > doesn't support? Something else?
> >
>
> This about sums it up:
>
>
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StartupChecks.java#L153-L155
>
> // There is essentially no QA done on OpenJDK builds, and
> // clusters running OpenJDK have seen many heap and load issues.
> logger.warn("OpenJDK is not recommended. Please upgrade to the newest
> Oracle Java release");
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Davanum Srinivas :: https://twitter.com/dims
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Joshua Harlow

Clint Byrum wrote:

Excerpts from Boris Pavlovic's message of 2015-10-11 00:02:39 -0700:

2Everybody,

Just curios why we need such complexity.


Let's take a look from other side:
1) Information about all hosts (even in case of 100k hosts) will be less
then 1 GB
2) Usually servers that runs scheduler service have at least 64GB RAM and
more on the board
3) math.log(10)<  12  (binary search per rule)
4) We have less then 20 rules for scheduling
5) Information about hosts is updated every 60 seconds (no updates host is
dead)


According to this information:
1) We can store everything in RAM of single server
2) We can use Python
3) Information about hosts is temporary data and shouldn't be stored in
persistence storage


Simplest architecture to cover this:
1) Single RPC service that has two methods: find_host(rules),
update_host(host, data)
2) Store information about hosts  like a dict (host_name->data)
3) Create for each rule binary tree and update it on each host update
4) Make a algorithm that will use binary trees to find host based on rules
5) Each service like compute node, volume node, or neutron will send
updates about host
that they managed (cross service scheduling)
6) Make a algorithm that will sync host stats in memory between different
schedulers


I'm in, except I think this gets simpler with an intermediary service
like ZK/Consul to keep track of this 1GB of data and replace the need
for 6, and changes the implementation of 5 to "updates its record and
signals its presence".

What you've described is where I'd like to experiment, but I don't want
to reinvent ZK or Consul or etcd when they already exist and do such a
splendid job keeping observers informed of small changes in small data
sets. You still end up with the same in-memory performance, and this is
in line with some published white papers from Google around their use
of Chubby, which is their ZK/Consul.



+1 let's not recreate this; the code @ paste.openstack.org/show/475941/ 
basically does 1-6 and within about ~100 lines, it doesn't optimize 
things into a binary tree, but thats easily doable... for all I care put 
the information received into N trees (perhaps even using 
http://docs.openstack.org/developer/taskflow/types.html#module-taskflow.types.tree) 
and do searches across those as desired (and this is where u can get 
into considering something like numpy to help).



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-11 Thread Amrith Kumar
Dims,

Not that I know of; I believe that Cassandra works fine with OpenJDK. See [1] 
and [2].

From time to time, there have been questions about the supported JDK for 
Cassandra, a recent one (just that I happen to remember this) tries to make the 
case that you must use Sun/Oracle JDK. This is not a requirement by any means. 
See [3].

To the best of my knowledge, OpenJDK is sufficient.

-amrith

[1] http://wiki.apache.org/cassandra/GettingStarted
[2] http://docs.datastax.com/en/cassandra/2.2/cassandra/install/installDeb.html
[3] 
http://stackoverflow.com/questions/21487354/does-latest-cassandra-support-openjdk


From: Davanum Srinivas [mailto:dava...@gmail.com]
Sent: Saturday, October 10, 2015 8:54 PM
To: OpenStack Development Mailing List (not for usage questions) 
<openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] Scheduler proposal

Not implying cassandra is the right option. Just curious about the assertion.

-- Dims

On Sat, Oct 10, 2015 at 5:53 PM, Davanum Srinivas 
<dava...@gmail.com<mailto:dava...@gmail.com>> wrote:
Thomas,

i am curious as well. AFAIK, cassandra works well with OpenJDK. Can you please 
elaborate what you concerns are for #1?

Thanks,
Dims

On Sat, Oct 10, 2015 at 5:43 PM, Joshua Harlow 
<harlo...@fastmail.com<mailto:harlo...@fastmail.com>> wrote:
I'm curious is there any more detail about #1 below anywhere online?

Does cassandra use some features of the JVM that the openJDK version doesn't 
support? Something else?

-Josh

Thomas Goirand wrote:
On 10/07/2015 07:36 PM, Ed Leafe wrote:
Several months ago I proposed an experiment [0] to see if switching
the data model for the Nova scheduler to use Cassandra as the backend
would be a significant improvement as opposed to the current design

This is probably right. I don't know, I'm not an expert in Nova, or its
scheduler. However, to make it possible for us (ie: downstream
distributions and/or OpenStack users) to use Cassandra, you have to
solve one of the below issues:

1/ Cassandra developers upstream should start caring about OpenJDK, and
make sure that it is also a good platform for it. They should stop
caring only about the Oracle JVM.

... or ...

2/ Oracle should make its JVM free software.

As there is no hope for any of the above, Cassandra is a no-go for
downstream distributions.

So, by all means, propose a new back-end, implement it, profit. But that
back-end cannot be Cassandra the way it is now.

Cheers,

Thomas Goirand (zigo)


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe<http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe<http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



--
Davanum Srinivas :: https://twitter.com/dims



--
Davanum Srinivas :: https://twitter.com/dims
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-10 Thread Joshua Harlow

I'm curious is there any more detail about #1 below anywhere online?

Does cassandra use some features of the JVM that the openJDK version 
doesn't support? Something else?


-Josh

Thomas Goirand wrote:

On 10/07/2015 07:36 PM, Ed Leafe wrote:

Several months ago I proposed an experiment [0] to see if switching
the data model for the Nova scheduler to use Cassandra as the backend
would be a significant improvement as opposed to the current design


This is probably right. I don't know, I'm not an expert in Nova, or its
scheduler. However, to make it possible for us (ie: downstream
distributions and/or OpenStack users) to use Cassandra, you have to
solve one of the below issues:

1/ Cassandra developers upstream should start caring about OpenJDK, and
make sure that it is also a good platform for it. They should stop
caring only about the Oracle JVM.

... or ...

2/ Oracle should make its JVM free software.

As there is no hope for any of the above, Cassandra is a no-go for
downstream distributions.

So, by all means, propose a new back-end, implement it, profit. But that
back-end cannot be Cassandra the way it is now.

Cheers,

Thomas Goirand (zigo)


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-10 Thread Thomas Goirand
On 10/07/2015 07:36 PM, Ed Leafe wrote:
> Several months ago I proposed an experiment [0] to see if switching
> the data model for the Nova scheduler to use Cassandra as the backend
> would be a significant improvement as opposed to the current design

This is probably right. I don't know, I'm not an expert in Nova, or its
scheduler. However, to make it possible for us (ie: downstream
distributions and/or OpenStack users) to use Cassandra, you have to
solve one of the below issues:

1/ Cassandra developers upstream should start caring about OpenJDK, and
make sure that it is also a good platform for it. They should stop
caring only about the Oracle JVM.

... or ...

2/ Oracle should make its JVM free software.

As there is no hope for any of the above, Cassandra is a no-go for
downstream distributions.

So, by all means, propose a new back-end, implement it, profit. But that
back-end cannot be Cassandra the way it is now.

Cheers,

Thomas Goirand (zigo)


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-10 Thread Davanum Srinivas
Thomas,

i am curious as well. AFAIK, cassandra works well with OpenJDK. Can you
please elaborate what you concerns are for #1?

Thanks,
Dims

On Sat, Oct 10, 2015 at 5:43 PM, Joshua Harlow 
wrote:

> I'm curious is there any more detail about #1 below anywhere online?
>
> Does cassandra use some features of the JVM that the openJDK version
> doesn't support? Something else?
>
> -Josh
>
> Thomas Goirand wrote:
>
>> On 10/07/2015 07:36 PM, Ed Leafe wrote:
>>
>>> Several months ago I proposed an experiment [0] to see if switching
>>> the data model for the Nova scheduler to use Cassandra as the backend
>>> would be a significant improvement as opposed to the current design
>>>
>>
>> This is probably right. I don't know, I'm not an expert in Nova, or its
>> scheduler. However, to make it possible for us (ie: downstream
>> distributions and/or OpenStack users) to use Cassandra, you have to
>> solve one of the below issues:
>>
>> 1/ Cassandra developers upstream should start caring about OpenJDK, and
>> make sure that it is also a good platform for it. They should stop
>> caring only about the Oracle JVM.
>>
>> ... or ...
>>
>> 2/ Oracle should make its JVM free software.
>>
>> As there is no hope for any of the above, Cassandra is a no-go for
>> downstream distributions.
>>
>> So, by all means, propose a new back-end, implement it, profit. But that
>> back-end cannot be Cassandra the way it is now.
>>
>> Cheers,
>>
>> Thomas Goirand (zigo)
>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Davanum Srinivas :: https://twitter.com/dims
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-10 Thread Davanum Srinivas
Not implying cassandra is the right option. Just curious about the
assertion.

-- Dims

On Sat, Oct 10, 2015 at 5:53 PM, Davanum Srinivas  wrote:

> Thomas,
>
> i am curious as well. AFAIK, cassandra works well with OpenJDK. Can you
> please elaborate what you concerns are for #1?
>
> Thanks,
> Dims
>
> On Sat, Oct 10, 2015 at 5:43 PM, Joshua Harlow 
> wrote:
>
>> I'm curious is there any more detail about #1 below anywhere online?
>>
>> Does cassandra use some features of the JVM that the openJDK version
>> doesn't support? Something else?
>>
>> -Josh
>>
>> Thomas Goirand wrote:
>>
>>> On 10/07/2015 07:36 PM, Ed Leafe wrote:
>>>
 Several months ago I proposed an experiment [0] to see if switching
 the data model for the Nova scheduler to use Cassandra as the backend
 would be a significant improvement as opposed to the current design

>>>
>>> This is probably right. I don't know, I'm not an expert in Nova, or its
>>> scheduler. However, to make it possible for us (ie: downstream
>>> distributions and/or OpenStack users) to use Cassandra, you have to
>>> solve one of the below issues:
>>>
>>> 1/ Cassandra developers upstream should start caring about OpenJDK, and
>>> make sure that it is also a good platform for it. They should stop
>>> caring only about the Oracle JVM.
>>>
>>> ... or ...
>>>
>>> 2/ Oracle should make its JVM free software.
>>>
>>> As there is no hope for any of the above, Cassandra is a no-go for
>>> downstream distributions.
>>>
>>> So, by all means, propose a new back-end, implement it, profit. But that
>>> back-end cannot be Cassandra the way it is now.
>>>
>>> Cheers,
>>>
>>> Thomas Goirand (zigo)
>>>
>>>
>>>
>>> __
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
>
> --
> Davanum Srinivas :: https://twitter.com/dims
>



-- 
Davanum Srinivas :: https://twitter.com/dims
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-10 Thread Chris Friesen

On 10/09/2015 07:29 PM, Clint Byrum wrote:


Even if you figured out how to make the in-memory scheduler crazy fast,
There's still value in concurrency for other reasons. No matter how
fast you make the scheduler, you'll be slave to the response time of
a single scheduling request. If you take 1ms to schedule each node
(including just reading the request and pushing out your scheduling
result!) you will never achieve greater than 1000/s. 1ms is way lower
than it's going to take just to shove a tiny message into RabbitMQ or
even 0mq. So I'm pretty sure this is o-k for small clouds, but would be
a disaster for a large, busy cloud.

If, however, you can have 20 schedulers that all take 10ms on average,
and have the occasional lock contention for a resource counter resulting
in 100ms, now you're at 2000/s minus the lock contention rate. This
strategy would scale better with the number of compute nodes, since
more nodes means more distinct locks, so you can scale out the number
of running servers separate from the number of scheduling requests.


As far as I can see, moving to an in-memory scheduler is essentially orthogonal 
to allowing multiple schedulers to run concurrently.  We can do both.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Chris Friesen

On 10/09/2015 03:36 PM, Ian Wells wrote:

On 9 October 2015 at 12:50, Chris Friesen > wrote:

Has anybody looked at why 1 instance is too slow and what it would take to

make 1 scheduler instance work fast enough? This does not preclude the
use of
concurrency for finer grain tasks in the background.


Currently we pull data on all (!) of the compute nodes out of the database
via a series of RPC calls, then evaluate the various filters in python code.


I'll say again: the database seems to me to be the problem here.  Not to
mention, you've just explained that they are in practice holding all the data in
memory in order to do the work so the benefit we're getting here is really a
N-to-1-to-M pattern with a DB in the middle (the store-to-DB is rather
secondary, in fact), and that without incremental updates to the receivers.


I don't see any reason why you couldn't have an in-memory scheduler.

Currently the database serves as the persistant storage for the resource usage, 
so if we take it out of the picture I imagine you'd want to have some way of 
querying the compute nodes for their current state when the scheduler first 
starts up.


I think the current code uses the fact that objects are remotable via the 
conductor, so changing that to do explicit posts to a known scheduler topic 
would take some work.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Ian Wells
On 9 October 2015 at 18:29, Clint Byrum  wrote:

> Instead of having the scheduler do all of the compute node inspection
> and querying though, you have the nodes push their stats into something
> like Zookeeper or consul, and then have schedulers watch those stats
> for changes to keep their in-memory version of the data up to date. So
> when you bring a new one online, you don't have to query all the nodes,
> you just scrape the data store, which all of these stores (etcd, consul,
> ZK) are built to support atomically querying and watching at the same
> time, so you can have a reasonable expectation of correctness.
>

We have to be careful about our definition of 'correctness' here.  In
practice, the data is never going to be perfect because compute hosts
update periodically and the information is therefore always dated.  With
ZK, it's going to be strictly consistent with regard to the updates from
the compute hosts, but again that doesn't really matter too much because
the scheduler is going to have to make a best effort job with a mixed bag
of information anyway.

In fact, putting ZK in the middle basically means that your compute hosts
now synchronously update a majority of nodes in a minimum 3 node quorum -
not the fastest form of update - and then the quorum will see to notifying
the schedulers.  In practice this is just a store-and-fanout again. Once
more it's not clear to me whether the store serves much use, and as for the
fanout, I wonder if we'll need >>3 schedulers running so that this is
reducing communication overhead.

Even if you figured out how to make the in-memory scheduler crazy fast,
> There's still value in concurrency for other reasons. No matter how
> fast you make the scheduler, you'll be slave to the response time of
> a single scheduling request. If you take 1ms to schedule each node
> (including just reading the request and pushing out your scheduling
> result!) you will never achieve greater than 1000/s. 1ms is way lower
> than it's going to take just to shove a tiny message into RabbitMQ or
> even 0mq. So I'm pretty sure this is o-k for small clouds, but would be
> a disaster for a large, busy cloud.
>

Per before, my suggestion was that every scheduler tries to maintain a copy
of the cloud's state in memory (in much the same way, per the previous
example, as every router on the internet tries to make a route table out of
what it learns from BGP).  They don't have to be perfect.  They don't have
to be in sync.  As long as there's some variability in the decision making,
they don't have to update when another scheduler schedules something (and
you can make the compute node send an immediate update when a new VM is
run, anyway).  They all stand a good chance of scheduling VMs well
simultaneously.

If, however, you can have 20 schedulers that all take 10ms on average,
> and have the occasional lock contention for a resource counter resulting
> in 100ms, now you're at 2000/s minus the lock contention rate. This
> strategy would scale better with the number of compute nodes, since
> more nodes means more distinct locks, so you can scale out the number
> of running servers separate from the number of scheduling requests.
>

If you have 20 schedulers that take 1ms on average, and there's absolutely
no lock contention, then you're at 20,000/s.  (Unfair, granted, since what
I'm suggesting is more likely to make rejected scheduling decisions, but
they could be rare.)

But to be fair, we're throwing made up numbers around at this point.  Maybe
it's time to work out how to test this for scale in a harness - which is
the bit of work we all really need to do this properly, or there's no proof
we've actually helped - and leave people to code their ideas up?
-- 
Ian.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Joshua Harlow

Gregory Haynes wrote:

Excerpts from Joshua Harlow's message of 2015-10-08 15:24:18 +:

On this point, and just thinking out loud. If we consider saving
compute_node information into say a node in said DLM backend (for
example a znode in zookeeper[1]); this information would be updated
periodically by that compute_node *itself* (it would say contain
information about what VMs are running on it, what there utilization is
and so-on).

For example the following layout could be used:

/nova/compute_nodes/

  data could be:

{
 vms: [],
 memory_free: XYZ,
 cpu_usage: ABC,
 memory_used: MNO,
 ...
}

Now if we imagine each/all schedulers having watches
on /nova/compute_nodes/ ([2] consul and etc.d have equivalent concepts
afaik) then when a compute_node updates that information a push
notification (the watch being triggered) will be sent to the
scheduler(s) and the scheduler(s) could then update a local in-memory
cache of the data about all the hypervisors that can be selected from
for scheduling. This avoids any reading of a large set of data in the
first place (besides an initial read-once on startup to read the
initial list + setup the watches); in a way its similar to push
notifications. Then when scheduling a VM ->  hypervisor there isn't any
need to query anything but the local in-memory representation that the
scheduler is maintaining (and updating as watches are triggered)...

So this is why I was wondering about what capabilities of cassandra are
being used here; because the above I think are unique capababilties of
DLM like systems (zookeeper, consul, etcd) that could be advantageous
here...

[1]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#sc_zkDataModel_znodes

[2]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkWatches


I wonder if we would even need to make something so specialized to get
this kind of local caching. I dont know what the current ZK tools are
but the original Chubby paper described that clients always have a
write-through cache for nodes which they set up subscriptions for in
order to break the cache.


Perhaps not, make it as simple as we want as long as people agree that 
the concept is useful. My idea is it would look like something like:


(simplified obviously):

http://paste.openstack.org/show/475938/

Then resources (in this example compute_nodes) would register themselves 
via a call like:


>>> from kazoo import client
>>> import json
>>> c = client.KazooClient()
>>> c.start()
>>> n = "/node/compute_nodes"
>>> c.ensure_path(n)
>>> c.create("%s/h1.hypervisor.yahoo.com" % n, json.dumps({}))

^^^ the dictionary above would be whatever data to then put into the 
receivers caches...


Then in the pasted program (running in a different shell/computer/...) 
the cache would then get updated, and then a user of that cache can use 
it to find resources to schedule things to


The example should work, just get zookeeper setup:

http://packages.ubuntu.com/precise/zookeeperd should do all of that, and 
then try it out...




Also, re: etcd - The last time I checked their subscription API was
woefully inadequate for performing this type of thing without hurding
issues.


Any idea on the consul watch capabilities?

Similar API(s) appear to exist (but I don't know how they work, if they 
do at all); https://www.consul.io/docs/agent/watches.html




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Joshua Harlow

And one last reply with more code:

http://paste.openstack.org/show/475941/ (a creator of services that 
dynamically creates services, and destroys them after a set amount of 
time is included in here, along with the prior resource watcher).


Works locally, should work for u as well.

Output from example run of 'creator process'

http://paste.openstack.org/show/475942/

Output from example run of 'watcher process'

http://paste.openstack.org/show/475943/

Enjoy!

-josh

Joshua Harlow wrote:

Further example stuff,

Get kazoo installed (http://kazoo.readthedocs.org/)

Output from my local run (with no data)

$ python test.py
Kazoo client has changed to state: CONNECTED
Got data: '' for new resource /node/compute_nodes/h1.hypervisor.yahoo.com
Idling (ran for 0.00s).
Known resources:
- h1.hypervisor.yahoo.com => {}
Idling (ran for 1.00s).
Known resources:
- h1.hypervisor.yahoo.com => {}
Idling (ran for 2.00s).
Known resources:
- h1.hypervisor.yahoo.com => {}
Idling (ran for 3.00s).
Known resources:
- h1.hypervisor.yahoo.com => {}
Idling (ran for 4.00s).
Known resources:
- h1.hypervisor.yahoo.com => {}
Idling (ran for 5.00s).
Kazoo client has changed to state: LOST
Traceback (most recent call last):
File "test.py", line 72, in 
time.sleep(1.0)
KeyboardInterrupt

Joshua Harlow wrote:

Gregory Haynes wrote:

Excerpts from Joshua Harlow's message of 2015-10-08 15:24:18 +:

On this point, and just thinking out loud. If we consider saving
compute_node information into say a node in said DLM backend (for
example a znode in zookeeper[1]); this information would be updated
periodically by that compute_node *itself* (it would say contain
information about what VMs are running on it, what there utilization is
and so-on).

For example the following layout could be used:

/nova/compute_nodes/

 data could be:

{
vms: [],
memory_free: XYZ,
cpu_usage: ABC,
memory_used: MNO,
...
}

Now if we imagine each/all schedulers having watches
on /nova/compute_nodes/ ([2] consul and etc.d have equivalent concepts
afaik) then when a compute_node updates that information a push
notification (the watch being triggered) will be sent to the
scheduler(s) and the scheduler(s) could then update a local in-memory
cache of the data about all the hypervisors that can be selected from
for scheduling. This avoids any reading of a large set of data in the
first place (besides an initial read-once on startup to read the
initial list + setup the watches); in a way its similar to push
notifications. Then when scheduling a VM -> hypervisor there isn't any
need to query anything but the local in-memory representation that the
scheduler is maintaining (and updating as watches are triggered)...

So this is why I was wondering about what capabilities of cassandra are
being used here; because the above I think are unique capababilties of
DLM like systems (zookeeper, consul, etcd) that could be advantageous
here...

[1]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#sc_zkDataModel_znodes



[2]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkWatches




I wonder if we would even need to make something so specialized to get
this kind of local caching. I dont know what the current ZK tools are
but the original Chubby paper described that clients always have a
write-through cache for nodes which they set up subscriptions for in
order to break the cache.


Perhaps not, make it as simple as we want as long as people agree that
the concept is useful. My idea is it would look like something like:

(simplified obviously):

http://paste.openstack.org/show/475938/

Then resources (in this example compute_nodes) would register themselves
via a call like:

>>> from kazoo import client
>>> import json
>>> c = client.KazooClient()
>>> c.start()
>>> n = "/node/compute_nodes"
>>> c.ensure_path(n)
>>> c.create("%s/h1.hypervisor.yahoo.com" % n, json.dumps({}))

^^^ the dictionary above would be whatever data to then put into the
receivers caches...

Then in the pasted program (running in a different shell/computer/...)
the cache would then get updated, and then a user of that cache can use
it to find resources to schedule things to

The example should work, just get zookeeper setup:

http://packages.ubuntu.com/precise/zookeeperd should do all of that, and
then try it out...



Also, re: etcd - The last time I checked their subscription API was
woefully inadequate for performing this type of thing without hurding
issues.


Any idea on the consul watch capabilities?

Similar API(s) appear to exist (but I don't know how they work, if they
do at all); https://www.consul.io/docs/agent/watches.html



__


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Neil Jerram
FWIW - and somewhat ironically given what you said just before - I couldn't 
parse your last sentence below... You might like to follow up with a corrected 
version.

(On the broad point, BTW, I really agree with you. So much OpenStack discussion 
is rendered difficult to get into by use of wrong or imprecise language.)

Regards,
 Neil


  Original Message
From: Clint Byrum
Sent: Friday, 9 October 2015 19:08
To: openstack-dev
Reply To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Scheduler proposal


Excerpts from Chris Friesen's message of 2015-10-09 10:54:36 -0700:
> On 10/09/2015 11:09 AM, Zane Bitter wrote:
>
> > The optimal way to do this would be a weighted random selection, where the
> > probability of any given host being selected is proportional to its 
> > weighting.
> > (Obviously this is limited by the accuracy of the weighting function in
> > expressing your actual preferences - and it's at least conceivable that this
> > could vary with the number of schedulers running.)
> >
> > In fact, the choice of the name 'weighting' would normally imply that it's 
> > done
> > this way; hearing that the 'weighting' is actually used as a 'score' with 
> > the
> > highest one always winning is quite surprising.
>
> If you've only got one scheduler, there's no need to get fancy, you just pick
> the "best" host based on your weighing function.
>
> It's only when you've got parallel schedulers that things get tricky.
>

Note that I think you mean _concurrent_ not _parallel_ schedulers.

Parallel schedulers would be trying to solve the same unit of work by
breaking it up into smaller components and doing them at the same time.

Concurrent means they're just doing different things at the same time.

I know this is nit-picky, but we use the wrong word _A LOT_ and the
problem space is actually vastly different, as parallelizable problems
have a whole set of optimizations and advantages that generic concurrent
problems (especially those involving mutating state!) have a whole set
of race conditions that must be managed.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Joshua Harlow

Further example stuff,

Get kazoo installed (http://kazoo.readthedocs.org/)

Output from my local run (with no data)

$ python test.py
Kazoo client has changed to state: CONNECTED
Got data: '' for new resource /node/compute_nodes/h1.hypervisor.yahoo.com
Idling (ran for 0.00s).
Known resources:
 - h1.hypervisor.yahoo.com => {}
Idling (ran for 1.00s).
Known resources:
 - h1.hypervisor.yahoo.com => {}
Idling (ran for 2.00s).
Known resources:
 - h1.hypervisor.yahoo.com => {}
Idling (ran for 3.00s).
Known resources:
 - h1.hypervisor.yahoo.com => {}
Idling (ran for 4.00s).
Known resources:
 - h1.hypervisor.yahoo.com => {}
Idling (ran for 5.00s).
Kazoo client has changed to state: LOST
Traceback (most recent call last):
  File "test.py", line 72, in 
time.sleep(1.0)
KeyboardInterrupt

Joshua Harlow wrote:

Gregory Haynes wrote:

Excerpts from Joshua Harlow's message of 2015-10-08 15:24:18 +:

On this point, and just thinking out loud. If we consider saving
compute_node information into say a node in said DLM backend (for
example a znode in zookeeper[1]); this information would be updated
periodically by that compute_node *itself* (it would say contain
information about what VMs are running on it, what there utilization is
and so-on).

For example the following layout could be used:

/nova/compute_nodes/

 data could be:

{
vms: [],
memory_free: XYZ,
cpu_usage: ABC,
memory_used: MNO,
...
}

Now if we imagine each/all schedulers having watches
on /nova/compute_nodes/ ([2] consul and etc.d have equivalent concepts
afaik) then when a compute_node updates that information a push
notification (the watch being triggered) will be sent to the
scheduler(s) and the scheduler(s) could then update a local in-memory
cache of the data about all the hypervisors that can be selected from
for scheduling. This avoids any reading of a large set of data in the
first place (besides an initial read-once on startup to read the
initial list + setup the watches); in a way its similar to push
notifications. Then when scheduling a VM -> hypervisor there isn't any
need to query anything but the local in-memory representation that the
scheduler is maintaining (and updating as watches are triggered)...

So this is why I was wondering about what capabilities of cassandra are
being used here; because the above I think are unique capababilties of
DLM like systems (zookeeper, consul, etcd) that could be advantageous
here...

[1]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#sc_zkDataModel_znodes


[2]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkWatches



I wonder if we would even need to make something so specialized to get
this kind of local caching. I dont know what the current ZK tools are
but the original Chubby paper described that clients always have a
write-through cache for nodes which they set up subscriptions for in
order to break the cache.


Perhaps not, make it as simple as we want as long as people agree that
the concept is useful. My idea is it would look like something like:

(simplified obviously):

http://paste.openstack.org/show/475938/

Then resources (in this example compute_nodes) would register themselves
via a call like:

 >>> from kazoo import client
 >>> import json
 >>> c = client.KazooClient()
 >>> c.start()
 >>> n = "/node/compute_nodes"
 >>> c.ensure_path(n)
 >>> c.create("%s/h1.hypervisor.yahoo.com" % n, json.dumps({}))

^^^ the dictionary above would be whatever data to then put into the
receivers caches...

Then in the pasted program (running in a different shell/computer/...)
the cache would then get updated, and then a user of that cache can use
it to find resources to schedule things to

The example should work, just get zookeeper setup:

http://packages.ubuntu.com/precise/zookeeperd should do all of that, and
then try it out...



Also, re: etcd - The last time I checked their subscription API was
woefully inadequate for performing this type of thing without hurding
issues.


Any idea on the consul watch capabilities?

Similar API(s) appear to exist (but I don't know how they work, if they
do at all); https://www.consul.io/docs/agent/watches.html



__

OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Gregory Haynes
Excerpts from Chris Friesen's message of 2015-10-09 19:36:03 +:
> On 10/09/2015 12:55 PM, Gregory Haynes wrote:
> 
> > There is a more generalized version of this algorithm for concurrent
> > scheduling I've seen a few times - Pick N options at random, apply
> > heuristic over that N to pick the best, attempt to schedule at your
> > choice, retry on failure. As long as you have a fast heuristic and your
> > N is sufficiently smaller than the total number of options then the
> > retries are rare-ish and cheap. It also can scale out extremely well.
> 
> If you're looking for a resource that is relatively rare (say you want a 
> particular hardware accelerator, or a very large number of CPUs, or even to 
> be 
> scheduled "near" to a specific other instance) then you may have to retry 
> quite 
> a lot.
> 
> Chris
> 

Yep. You can either be fast or correct. There is no solution which will
both scale easily and allow you to schedule to a very precise node
efficiently or this would be a solved problem.

There is a not too bad middle ground here though - you can definitely do
some filtering beforehand efficiently (especially if you have some kind
of local cache similar to what Josh mentioned with ZK) and then this is
less of an issue. This is definitely a big step in complexity though...

Cheers,
Greg

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Ian Wells
On 9 October 2015 at 12:50, Chris Friesen 
wrote:

> Has anybody looked at why 1 instance is too slow and what it would take to
>
>> make 1 scheduler instance work fast enough? This does not preclude the
>> use of
>> concurrency for finer grain tasks in the background.
>>
>
> Currently we pull data on all (!) of the compute nodes out of the database
> via a series of RPC calls, then evaluate the various filters in python code.
>

I'll say again: the database seems to me to be the problem here.  Not to
mention, you've just explained that they are in practice holding all the
data in memory in order to do the work so the benefit we're getting here is
really a N-to-1-to-M pattern with a DB in the middle (the store-to-DB is
rather secondary, in fact), and that without incremental updates to the
receivers.

I suspect it'd be a lot quicker if each filter was a DB query.
>

That's certainly one solution, but again, unless you can tell me *why* this
information will not all fit in memory per process (when it does right
now), I'm still not clear why a database is required at all, let alone a
central one.  Even if it doesn't fit, then a local DB might be reasonable
compared to a centralised one.  The schedulers don't need to work off of
precisely the same state, they just need to make different choices to each
other, which doesn't require a that's-mine-hands-off approach; and they
aren't going to have a perfect view of the state of a distributed system
anyway, so retries are inevitable.

On a different topic, on the weighted choice: it's not 'optimal', given
this is a packing problem, so there isn't a perfect solution.  In fact,
given we're trying to balance the choice of a preferable host with the
chance that multiple schedulers make different choices, it's likely worse
than even weighting.  (Technically I suspect we'd want to rethink whether
the weighting mechanism, is actually getting us a benefit.)
-- 
Ian.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Alec Hothan (ahothan)





On 10/9/15, 6:29 PM, "Clint Byrum"  wrote:

>Excerpts from Chris Friesen's message of 2015-10-09 17:33:38 -0700:
>> On 10/09/2015 03:36 PM, Ian Wells wrote:
>> > On 9 October 2015 at 12:50, Chris Friesen > > > wrote:
>> >
>> > Has anybody looked at why 1 instance is too slow and what it would 
>> > take to
>> >
>> > make 1 scheduler instance work fast enough? This does not preclude 
>> > the
>> > use of
>> > concurrency for finer grain tasks in the background.
>> >
>> >
>> > Currently we pull data on all (!) of the compute nodes out of the 
>> > database
>> > via a series of RPC calls, then evaluate the various filters in python 
>> > code.
>> >
>> >
>> > I'll say again: the database seems to me to be the problem here.  Not to
>> > mention, you've just explained that they are in practice holding all the 
>> > data in
>> > memory in order to do the work so the benefit we're getting here is really 
>> > a
>> > N-to-1-to-M pattern with a DB in the middle (the store-to-DB is rather
>> > secondary, in fact), and that without incremental updates to the receivers.
>> 
>> I don't see any reason why you couldn't have an in-memory scheduler.
>> 
>> Currently the database serves as the persistant storage for the resource 
>> usage, 
>> so if we take it out of the picture I imagine you'd want to have some way of 
>> querying the compute nodes for their current state when the scheduler first 
>> starts up.
>> 
>> I think the current code uses the fact that objects are remotable via the 
>> conductor, so changing that to do explicit posts to a known scheduler topic 
>> would take some work.
>> 
>
>Funny enough, I think thats exactly what Josh's "just use Zookeeper"
>message is about. Except in memory, it is "in an observable storage
>location".
>
>Instead of having the scheduler do all of the compute node inspection
>and querying though, you have the nodes push their stats into something
>like Zookeeper or consul, and then have schedulers watch those stats
>for changes to keep their in-memory version of the data up to date. So
>when you bring a new one online, you don't have to query all the nodes,
>you just scrape the data store, which all of these stores (etcd, consul,
>ZK) are built to support atomically querying and watching at the same
>time, so you can have a reasonable expectation of correctness.
>
>Even if you figured out how to make the in-memory scheduler crazy fast,
>There's still value in concurrency for other reasons. No matter how
>fast you make the scheduler, you'll be slave to the response time of
>a single scheduling request. If you take 1ms to schedule each node
>(including just reading the request and pushing out your scheduling
>result!) you will never achieve greater than 1000/s. 1ms is way lower
>than it's going to take just to shove a tiny message into RabbitMQ or
>even 0mq.

That is not what I have seen, measurements that I did or done by others show 
between 5000 and 1 send *per sec* (depending on mirroring, up to 1KB msg 
size) using oslo messaging/kombu over rabbitMQ.
And this is unmodified/highly unoptimized oslo messaging code.
If you remove the oslo messaging layer, you get 25000 to 45000 msg/sec with 
kombu/rabbitMQ (which shows how inefficient is oslo messaging layer itself)


> So I'm pretty sure this is o-k for small clouds, but would be
>a disaster for a large, busy cloud.

It all depends on how many sched/sec for the "large busy cloud"...

>
>If, however, you can have 20 schedulers that all take 10ms on average,
>and have the occasional lock contention for a resource counter resulting
>in 100ms, now you're at 2000/s minus the lock contention rate. This
>strategy would scale better with the number of compute nodes, since
>more nodes means more distinct locks, so you can scale out the number
>of running servers separate from the number of scheduling requests.

How many compute nodes are we talking about max? How many scheduling per second 
is the requirement? And where are we today with the latest nova scheduler?
My point is that without these numbers we could end up under-shooting, 
over-shooting or over-engineering along with the cost of maintaining that extra 
complexity over the lifetime of openstack.

I'll just make up some numbers for the sake of this discussion:

nova scheduler latest can do only 100 sched/sec for 1 instance (I guess the 
10ms average you bring out may not be that unrealistic)
the requirement is a sustained 500 sched/sec worst case with 10K nodes (that is 
5% of 10K and today we can barely launch 100VM/sec sustained)

Are we going to achieve 5x with just 3 instances which is what most people 
deploy? Not likely.
Will using more elaborate distributed infra/DLM like consul/zk/etcd going to 
get us to that 500 mark with 3 instances? Maybe but it will be at the expense 
of added complexity of the overall solution.
Can we instead optimize 

Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Gregory Haynes
Excerpts from Joshua Harlow's message of 2015-10-08 15:24:18 +:
> On this point, and just thinking out loud. If we consider saving
> compute_node information into say a node in said DLM backend (for
> example a znode in zookeeper[1]); this information would be updated
> periodically by that compute_node *itself* (it would say contain
> information about what VMs are running on it, what there utilization is
> and so-on).
> 
> For example the following layout could be used:
> 
> /nova/compute_nodes/
> 
>  data could be:
> 
> {
> vms: [],
> memory_free: XYZ,
> cpu_usage: ABC,
> memory_used: MNO,
> ...
> }
> 
> Now if we imagine each/all schedulers having watches
> on /nova/compute_nodes/ ([2] consul and etc.d have equivalent concepts
> afaik) then when a compute_node updates that information a push
> notification (the watch being triggered) will be sent to the
> scheduler(s) and the scheduler(s) could then update a local in-memory
> cache of the data about all the hypervisors that can be selected from
> for scheduling. This avoids any reading of a large set of data in the
> first place (besides an initial read-once on startup to read the
> initial list + setup the watches); in a way its similar to push
> notifications. Then when scheduling a VM -> hypervisor there isn't any
> need to query anything but the local in-memory representation that the
> scheduler is maintaining (and updating as watches are triggered)...
> 
> So this is why I was wondering about what capabilities of cassandra are
> being used here; because the above I think are unique capababilties of
> DLM like systems (zookeeper, consul, etcd) that could be advantageous
> here...
> 
> [1]
> https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#sc_zkDataModel_znodes
> 
> [2]
> https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkWatches

I wonder if we would even need to make something so specialized to get
this kind of local caching. I dont know what the current ZK tools are
but the original Chubby paper described that clients always have a
write-through cache for nodes which they set up subscriptions for in
order to break the cache.

Also, re: etcd - The last time I checked their subscription API was
woefully inadequate for performing this type of thing without hurding
issues.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Alec Hothan (ahothan)

There are several ways to make python code that deals with a lot of data 
faster, especially when it comes to operating on DB fields from SQL tables (and 
that is not limited to the nova scheduler).
Pulling data from large SQL tables and operating on them through regular python 
code (using python loops) is extremely inefficient due to the nature of the 
python interpreter. If this is what nova scheduler code is doing today, the 
good thing is there is a potentially huge room for improvement.


The approach to scale out, in practice means a few instances (3 instances is 
common), meaning the gain would be in the order of 3x (or 1 order of magnitude) 
but with sharply increased complexity to deal with concurrent schedulers and 
potentially conflicting results (with the use of tools lie ZK or Consul...). 
But in essence we're basically just running the same unoptimized code 
concurrently to achieve a better throughput.
On the other hand optimizing something that is not very optimized to start with 
can yield a much better return than 3x, with the advantage of simplicity (one 
active scheduler, which could be backed by a standby for HA).

Python is actually one of the better languages to do *fast* in-memory big data 
processing using open source python scientific and data analysis libraries as 
they can provide native speed through cythonized libraries and powerful high 
level abstraction to do complex filters and vectorized operations. Not only it 
is fast but it also yields much smaller code.

I have used libraries such as numpy and pandas to operate on very large data 
sets (the equivalent of SQL tables with hundreds of thousands of rows) and 
there is easily 2 orders of magnitude of difference for operating on these data 
in memory between plain python code with loops and python code using these 
libraries (that is without any DB access).
The order of filtering on the kind of reduction that you describe below 
certainly helps but becomes second order when you use pandas filters because 
they are extremely fast even for very large datasets.

I'm curious to know why this path was not explored more before embarking full 
speed on concurrency/scale out options which is a very complex and treacherous 
path as we see in this discussion. Clearly very attractive intellectually to 
work with all these complex distributed frameworks, but the cost of complexity 
is often overlooked.

Is there any data showing the performance of the current nova scheduler? How 
many scheduling can nova do per second at scale with worst case filters?
When you think about it, 10,000 nodes and their associated properties is not 
such a big number if you use the right libraries.




On 10/9/15, 1:10 PM, "Joshua Harlow"  wrote:

>And also we should probably deprecate/not recommend:
>
>http://docs.openstack.org/developer/nova/api/nova.scheduler.filters.json_filter.html#nova.scheduler.filters.json_filter.JsonFilter
>
>That filter IMHO basically disallows optimizations like forming SQL 
>statements for each filter (and then letting the DB do the heavy 
>lifting) or say having each filter say 'oh my logic can be performed by 
>a prepared statement ABC and u should just use that instead' (and then 
>letting the DB do the heavy lifting).
>
>Chris Friesen wrote:
>> On 10/09/2015 12:25 PM, Alec Hothan (ahothan) wrote:
>>>
>>> Still the point from Chris is valid. I guess the main reason openstack is
>>> going with multiple concurrent schedulers is to scale out by
>>> distributing the
>>> load between multiple instances of schedulers because 1 instance is too
>>> slow. This discussion is about coordinating the many instances of
>>> schedulers
>>> in a way that works and this is actually a difficult problem and will get
>>> worst as the number of variables for instance placement increases (for
>>> example NFV is going to require a lot more than just cpu pinning, huge
>>> pages
>>> and numa).
>>>
>>> Has anybody looked at why 1 instance is too slow and what it would
>>> take to
>>> make 1 scheduler instance work fast enough? This does not preclude the
>>> use of
>>> concurrency for finer grain tasks in the background.
>>
>> Currently we pull data on all (!) of the compute nodes out of the
>> database via a series of RPC calls, then evaluate the various filters in
>> python code.
>>
>> I suspect it'd be a lot quicker if each filter was a DB query.
>>
>> Also, ideally we'd want to query for the most "strict" criteria first,
>> to reduce the total number of comparisons. For example, if you want to
>> implement the "affinity" server group policy, you only need to test a
>> single host. If you're matching against host aggregate metadata, you
>> only need to test against hosts in matching aggregates.
>>
>> Chris
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Clint Byrum
Excerpts from Chris Friesen's message of 2015-10-09 10:54:36 -0700:
> On 10/09/2015 11:09 AM, Zane Bitter wrote:
> 
> > The optimal way to do this would be a weighted random selection, where the
> > probability of any given host being selected is proportional to its 
> > weighting.
> > (Obviously this is limited by the accuracy of the weighting function in
> > expressing your actual preferences - and it's at least conceivable that this
> > could vary with the number of schedulers running.)
> >
> > In fact, the choice of the name 'weighting' would normally imply that it's 
> > done
> > this way; hearing that the 'weighting' is actually used as a 'score' with 
> > the
> > highest one always winning is quite surprising.
> 
> If you've only got one scheduler, there's no need to get fancy, you just pick 
> the "best" host based on your weighing function.
> 
> It's only when you've got parallel schedulers that things get tricky.
> 

Note that I think you mean _concurrent_ not _parallel_ schedulers.

Parallel schedulers would be trying to solve the same unit of work by
breaking it up into smaller components and doing them at the same time.

Concurrent means they're just doing different things at the same time.

I know this is nit-picky, but we use the wrong word _A LOT_ and the
problem space is actually vastly different, as parallelizable problems
have a whole set of optimizations and advantages that generic concurrent
problems (especially those involving mutating state!) have a whole set
of race conditions that must be managed.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Alec Hothan (ahothan)

Still the point from Chris is valid.
I guess the main reason openstack is going with multiple concurrent schedulers 
is to scale out by distributing the load between multiple instances of 
schedulers because 1 instance is too slow.
This discussion is about coordinating the many instances of schedulers in a way 
that works and this is actually a difficult problem and will get worst as the 
number of variables for instance placement increases (for example NFV is going 
to require a lot more than just cpu pinning, huge pages and numa).

Has anybody looked at why 1 instance is too slow and what it would take to make 
1 scheduler instance work fast enough? This does not preclude the use of 
concurrency for finer grain tasks in the background.




On 10/9/15, 11:05 AM, "Clint Byrum"  wrote:

>Excerpts from Chris Friesen's message of 2015-10-09 10:54:36 -0700:
>> On 10/09/2015 11:09 AM, Zane Bitter wrote:
>> 
>> > The optimal way to do this would be a weighted random selection, where the
>> > probability of any given host being selected is proportional to its 
>> > weighting.
>> > (Obviously this is limited by the accuracy of the weighting function in
>> > expressing your actual preferences - and it's at least conceivable that 
>> > this
>> > could vary with the number of schedulers running.)
>> >
>> > In fact, the choice of the name 'weighting' would normally imply that it's 
>> > done
>> > this way; hearing that the 'weighting' is actually used as a 'score' with 
>> > the
>> > highest one always winning is quite surprising.
>> 
>> If you've only got one scheduler, there's no need to get fancy, you just 
>> pick 
>> the "best" host based on your weighing function.
>> 
>> It's only when you've got parallel schedulers that things get tricky.
>> 
>
>Note that I think you mean _concurrent_ not _parallel_ schedulers.
>
>Parallel schedulers would be trying to solve the same unit of work by
>breaking it up into smaller components and doing them at the same time.
>
>Concurrent means they're just doing different things at the same time.
>
>I know this is nit-picky, but we use the wrong word _A LOT_ and the
>problem space is actually vastly different, as parallelizable problems
>have a whole set of optimizations and advantages that generic concurrent
>problems (especially those involving mutating state!) have a whole set
>of race conditions that must be managed.
>
>__
>OpenStack Development Mailing List (not for usage questions)
>Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Chris Friesen

On 10/09/2015 11:09 AM, Zane Bitter wrote:


The optimal way to do this would be a weighted random selection, where the
probability of any given host being selected is proportional to its weighting.
(Obviously this is limited by the accuracy of the weighting function in
expressing your actual preferences - and it's at least conceivable that this
could vary with the number of schedulers running.)

In fact, the choice of the name 'weighting' would normally imply that it's done
this way; hearing that the 'weighting' is actually used as a 'score' with the
highest one always winning is quite surprising.


If you've only got one scheduler, there's no need to get fancy, you just pick 
the "best" host based on your weighing function.


It's only when you've got parallel schedulers that things get tricky.

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Gregory Haynes
Excerpts from Zane Bitter's message of 2015-10-09 17:09:46 +:
> On 08/10/15 21:32, Ian Wells wrote:
> >
> > > 2. if many hosts suit the 5 VMs then this is *very* unlucky,because 
> > we should be choosing a host at random from the set of
> > suitable hosts and that's a huge coincidence - so this is a tiny
> > corner case that we shouldn't be designing around
> >
> > Here is where we differ in our understanding. With the current
> > system of filters and weighers, 5 schedulers getting requests for
> > identical VMs and having identical information are *expected* to
> > select the same host. It is not a tiny corner case; it is the most
> > likely result for the current system design. By catching this
> > situation early (in the scheduling process) we can avoid multiple
> > RPC round-trips to handle the fail/retry mechanism.
> >
> >
> > And so maybe this would be a different fix - choose, at random, one of
> > the hosts above a weighting threshold, not choose the top host every
> > time? Technically, any host passing the filter is adequate to the task
> > from the perspective of an API user (and they can't prove if they got
> > the highest weighting or not), so if we assume weighting an operator
> > preference, and just weaken it slightly, we'd have a few more options.
> 
> The optimal way to do this would be a weighted random selection, where 
> the probability of any given host being selected is proportional to its 
> weighting. (Obviously this is limited by the accuracy of the weighting 
> function in expressing your actual preferences - and it's at least 
> conceivable that this could vary with the number of schedulers running.)
> 
> In fact, the choice of the name 'weighting' would normally imply that 
> it's done this way; hearing that the 'weighting' is actually used as a 
> 'score' with the highest one always winning is quite surprising.
> 
> cheers,
> Zane.
> 

There is a more generalized version of this algorithm for concurrent
scheduling I've seen a few times - Pick N options at random, apply
heuristic over that N to pick the best, attempt to schedule at your
choice, retry on failure. As long as you have a fast heuristic and your
N is sufficiently smaller than the total number of options then the
retries are rare-ish and cheap. It also can scale out extremely well.

Obviously you lose some of the ability to micro-manage where things are
placed with a scheduling setup like that, but if scaling up is the
concern I really hope that isnt a problem...

Cheers,
Greg

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Chris Friesen

On 10/09/2015 12:55 PM, Gregory Haynes wrote:


There is a more generalized version of this algorithm for concurrent
scheduling I've seen a few times - Pick N options at random, apply
heuristic over that N to pick the best, attempt to schedule at your
choice, retry on failure. As long as you have a fast heuristic and your
N is sufficiently smaller than the total number of options then the
retries are rare-ish and cheap. It also can scale out extremely well.


If you're looking for a resource that is relatively rare (say you want a 
particular hardware accelerator, or a very large number of CPUs, or even to be 
scheduled "near" to a specific other instance) then you may have to retry quite 
a lot.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Chris Friesen

On 10/09/2015 12:25 PM, Alec Hothan (ahothan) wrote:


Still the point from Chris is valid. I guess the main reason openstack is
going with multiple concurrent schedulers is to scale out by distributing the
load between multiple instances of schedulers because 1 instance is too
slow. This discussion is about coordinating the many instances of schedulers
in a way that works and this is actually a difficult problem and will get
worst as the number of variables for instance placement increases (for
example NFV is going to require a lot more than just cpu pinning, huge pages
and numa).

Has anybody looked at why 1 instance is too slow and what it would take to
make 1 scheduler instance work fast enough? This does not preclude the use of
concurrency for finer grain tasks in the background.


Currently we pull data on all (!) of the compute nodes out of the database via a 
series of RPC calls, then evaluate the various filters in python code.


I suspect it'd be a lot quicker if each filter was a DB query.

Also, ideally we'd want to query for the most "strict" criteria first, to reduce 
the total number of comparisons.  For example, if you want to implement the 
"affinity" server group policy, you only need to test a single host.  If you're 
matching against host aggregate metadata, you only need to test against hosts in 
matching aggregates.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-09 Thread Joshua Harlow

And also we should probably deprecate/not recommend:

http://docs.openstack.org/developer/nova/api/nova.scheduler.filters.json_filter.html#nova.scheduler.filters.json_filter.JsonFilter

That filter IMHO basically disallows optimizations like forming SQL 
statements for each filter (and then letting the DB do the heavy 
lifting) or say having each filter say 'oh my logic can be performed by 
a prepared statement ABC and u should just use that instead' (and then 
letting the DB do the heavy lifting).


Chris Friesen wrote:

On 10/09/2015 12:25 PM, Alec Hothan (ahothan) wrote:


Still the point from Chris is valid. I guess the main reason openstack is
going with multiple concurrent schedulers is to scale out by
distributing the
load between multiple instances of schedulers because 1 instance is too
slow. This discussion is about coordinating the many instances of
schedulers
in a way that works and this is actually a difficult problem and will get
worst as the number of variables for instance placement increases (for
example NFV is going to require a lot more than just cpu pinning, huge
pages
and numa).

Has anybody looked at why 1 instance is too slow and what it would
take to
make 1 scheduler instance work fast enough? This does not preclude the
use of
concurrency for finer grain tasks in the background.


Currently we pull data on all (!) of the compute nodes out of the
database via a series of RPC calls, then evaluate the various filters in
python code.

I suspect it'd be a lot quicker if each filter was a DB query.

Also, ideally we'd want to query for the most "strict" criteria first,
to reduce the total number of comparisons. For example, if you want to
implement the "affinity" server group policy, you only need to test a
single host. If you're matching against host aggregate metadata, you
only need to test against hosts in matching aggregates.

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Maish Saidel-Keesing

Forgive the top-post.

Cross-posting to openstack-operators for their feedback as well.

Ed the work seems very promising, and I am interested to see how this 
evolves.


With my operator hat on I have one piece of feedback.

By adding in a new Database solution (Cassandra) we are now up to three 
different database solutions in use in OpenStack


MySQL (practically everything)
MongoDB (Ceilometer)
Cassandra.

Not to mention two different message queues
Kafka (Monasca)
RabbitMQ (everything else)

Operational overhead has a cost - maintaining 3 different database 
tools, backing them up, providing HA, etc. has operational cost.


This is not to say that this cannot be overseen, but it should be taken 
into consideration.


And *if* they can be consolidated into an agreed solution across the 
whole of OpenStack - that would be highly beneficial (IMHO).



--
Best Regards,
Maish Saidel-Keesing


On 10/08/15 03:24, Ed Leafe wrote:

On Oct 7, 2015, at 2:28 PM, Zane Bitter  wrote:


It seems to me (disclaimer: not a Nova dev) that which database to use is 
completely irrelevant to your proposal,

Well, not entirely. The difference is that what Cassandra offers that separates it from 
other DBs is exactly the feature that we need. The solution to the scheduler isn't to 
simply "use a database".


which is really about moving the scheduling from a distributed collection of 
Python processes with ad-hoc (or sometimes completely missing) synchronisation 
into the database to take advantage of its well-defined semantics. But you've 
framed it in such a way as to guarantee that this never gets discussed, because 
everyone will be too busy arguing about whether or not Cassandra is better than 
Galera.

Understood - all one has to do is review the original thread from back in July 
to see this happening. But the reason that I framed it then as an experiment in 
which we would come up with measures of success we could all agree on up-front 
was so that if someone else thought that Product Foo would be even better, we 
could set up a similar test bed and try it out. IOW, instead of bikeshedding, 
if you want a different color, you build another shed and we can all have a 
look.


-- Ed Leafe




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Thierry Carrez
Maish Saidel-Keesing wrote:
> Operational overhead has a cost - maintaining 3 different database
> tools, backing them up, providing HA, etc. has operational cost.
> 
> This is not to say that this cannot be overseen, but it should be taken
> into consideration.
> 
> And *if* they can be consolidated into an agreed solution across the
> whole of OpenStack - that would be highly beneficial (IMHO).

Agreed, and that ties into the similar discussion we recently had about
picking a common DLM. Ideally we'd only add *one* general dependency and
use it for locks / leader election / syncing status around.

-- 
Thierry Carrez (ttx)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Joshua Harlow
On Thu, 8 Oct 2015 10:43:01 -0400
Monty Taylor  wrote:

> On 10/08/2015 09:01 AM, Thierry Carrez wrote:
> > Maish Saidel-Keesing wrote:
> >> Operational overhead has a cost - maintaining 3 different database
> >> tools, backing them up, providing HA, etc. has operational cost.
> >>
> >> This is not to say that this cannot be overseen, but it should be
> >> taken into consideration.
> >>
> >> And *if* they can be consolidated into an agreed solution across
> >> the whole of OpenStack - that would be highly beneficial (IMHO).
> >
> > Agreed, and that ties into the similar discussion we recently had
> > about picking a common DLM. Ideally we'd only add *one* general
> > dependency and use it for locks / leader election / syncing status
> > around.
> >
> 
> ++
> 
> All of the proposed DLM tools can fill this space successfully. There
> is definitely not a need for multiple.

On this point, and just thinking out loud. If we consider saving
compute_node information into say a node in said DLM backend (for
example a znode in zookeeper[1]); this information would be updated
periodically by that compute_node *itself* (it would say contain
information about what VMs are running on it, what there utilization is
and so-on).

For example the following layout could be used:

/nova/compute_nodes/

 data could be:

{
vms: [],
memory_free: XYZ,
cpu_usage: ABC,
memory_used: MNO,
...
}

Now if we imagine each/all schedulers having watches
on /nova/compute_nodes/ ([2] consul and etc.d have equivalent concepts
afaik) then when a compute_node updates that information a push
notification (the watch being triggered) will be sent to the
scheduler(s) and the scheduler(s) could then update a local in-memory
cache of the data about all the hypervisors that can be selected from
for scheduling. This avoids any reading of a large set of data in the
first place (besides an initial read-once on startup to read the
initial list + setup the watches); in a way its similar to push
notifications. Then when scheduling a VM -> hypervisor there isn't any
need to query anything but the local in-memory representation that the
scheduler is maintaining (and updating as watches are triggered)...

So this is why I was wondering about what capabilities of cassandra are
being used here; because the above I think are unique capababilties of
DLM like systems (zookeeper, consul, etcd) that could be advantageous
here...

[1]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#sc_zkDataModel_znodes

[2]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkWatches


> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Joshua Harlow

Joshua Harlow wrote:

On Thu, 8 Oct 2015 10:43:01 -0400
Monty Taylor  wrote:


On 10/08/2015 09:01 AM, Thierry Carrez wrote:

Maish Saidel-Keesing wrote:

Operational overhead has a cost - maintaining 3 different database
tools, backing them up, providing HA, etc. has operational cost.

This is not to say that this cannot be overseen, but it should be
taken into consideration.

And *if* they can be consolidated into an agreed solution across
the whole of OpenStack - that would be highly beneficial (IMHO).

Agreed, and that ties into the similar discussion we recently had
about picking a common DLM. Ideally we'd only add *one* general
dependency and use it for locks / leader election / syncing status
around.


++

All of the proposed DLM tools can fill this space successfully. There
is definitely not a need for multiple.


On this point, and just thinking out loud. If we consider saving
compute_node information into say a node in said DLM backend (for
example a znode in zookeeper[1]); this information would be updated
periodically by that compute_node *itself* (it would say contain
information about what VMs are running on it, what there utilization is
and so-on).

For example the following layout could be used:

/nova/compute_nodes/

  data could be:

{
 vms: [],
 memory_free: XYZ,
 cpu_usage: ABC,
 memory_used: MNO,
 ...
}

Now if we imagine each/all schedulers having watches
on /nova/compute_nodes/ ([2] consul and etc.d have equivalent concepts
afaik) then when a compute_node updates that information a push
notification (the watch being triggered) will be sent to the
scheduler(s) and the scheduler(s) could then update a local in-memory
cache of the data about all the hypervisors that can be selected from
for scheduling. This avoids any reading of a large set of data in the
first place (besides an initial read-once on startup to read the
initial list + setup the watches); in a way its similar to push
notifications. Then when scheduling a VM ->  hypervisor there isn't any
need to query anything but the local in-memory representation that the
scheduler is maintaining (and updating as watches are triggered)...

So this is why I was wondering about what capabilities of cassandra are
being used here; because the above I think are unique capababilties of
DLM like systems (zookeeper, consul, etcd) that could be advantageous
here...

[1]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#sc_zkDataModel_znodes

[2]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkWatches




And here's a final super-awesomeness,

Use the same existence of that znode + information (perhaps using 
ephemeral znodes or equivalent) to determine if a hypervisor is 'alive' 
or 'dead', thus removing the need to do queries and periodic writes to 
the nova database to determine if a hypervisors nova-compute service is 
alive or dead (with reads via 
https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L33 
and other similar code scattered in nova)...



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2015-10-08 08:38:57 -0700:
> Joshua Harlow wrote:
> > On Thu, 8 Oct 2015 10:43:01 -0400
> > Monty Taylor  wrote:
> >
> >> On 10/08/2015 09:01 AM, Thierry Carrez wrote:
> >>> Maish Saidel-Keesing wrote:
>  Operational overhead has a cost - maintaining 3 different database
>  tools, backing them up, providing HA, etc. has operational cost.
> 
>  This is not to say that this cannot be overseen, but it should be
>  taken into consideration.
> 
>  And *if* they can be consolidated into an agreed solution across
>  the whole of OpenStack - that would be highly beneficial (IMHO).
> >>> Agreed, and that ties into the similar discussion we recently had
> >>> about picking a common DLM. Ideally we'd only add *one* general
> >>> dependency and use it for locks / leader election / syncing status
> >>> around.
> >>>
> >> ++
> >>
> >> All of the proposed DLM tools can fill this space successfully. There
> >> is definitely not a need for multiple.
> >
> > On this point, and just thinking out loud. If we consider saving
> > compute_node information into say a node in said DLM backend (for
> > example a znode in zookeeper[1]); this information would be updated
> > periodically by that compute_node *itself* (it would say contain
> > information about what VMs are running on it, what there utilization is
> > and so-on).
> >
> > For example the following layout could be used:
> >
> > /nova/compute_nodes/
> >
> >   data could be:
> >
> > {
> >  vms: [],
> >  memory_free: XYZ,
> >  cpu_usage: ABC,
> >  memory_used: MNO,
> >  ...
> > }
> >
> > Now if we imagine each/all schedulers having watches
> > on /nova/compute_nodes/ ([2] consul and etc.d have equivalent concepts
> > afaik) then when a compute_node updates that information a push
> > notification (the watch being triggered) will be sent to the
> > scheduler(s) and the scheduler(s) could then update a local in-memory
> > cache of the data about all the hypervisors that can be selected from
> > for scheduling. This avoids any reading of a large set of data in the
> > first place (besides an initial read-once on startup to read the
> > initial list + setup the watches); in a way its similar to push
> > notifications. Then when scheduling a VM ->  hypervisor there isn't any
> > need to query anything but the local in-memory representation that the
> > scheduler is maintaining (and updating as watches are triggered)...
> >
> > So this is why I was wondering about what capabilities of cassandra are
> > being used here; because the above I think are unique capababilties of
> > DLM like systems (zookeeper, consul, etcd) that could be advantageous
> > here...
> >
> > [1]
> > https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#sc_zkDataModel_znodes
> >
> > [2]
> > https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkWatches
> >
> >
> 
> And here's a final super-awesomeness,
> 
> Use the same existence of that znode + information (perhaps using 
> ephemeral znodes or equivalent) to determine if a hypervisor is 'alive' 
> or 'dead', thus removing the need to do queries and periodic writes to 
> the nova database to determine if a hypervisors nova-compute service is 
> alive or dead (with reads via 
> https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L33
>  
> and other similar code scattered in nova)...
> 

^^ THIS is the kind of architectural thinking I'd like to see us do more
of.

This isn't "hey I have a better database" it is "I have a way to reduce
the most common operations to O(1) complexity".

Ed, for all of the promise of your experiment, I'd actually rather see
time spent on Josh's idea above. In fact, I might spend time on Josh's
idea above. :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Monty Taylor

On 10/08/2015 09:01 AM, Thierry Carrez wrote:

Maish Saidel-Keesing wrote:

Operational overhead has a cost - maintaining 3 different database
tools, backing them up, providing HA, etc. has operational cost.

This is not to say that this cannot be overseen, but it should be taken
into consideration.

And *if* they can be consolidated into an agreed solution across the
whole of OpenStack - that would be highly beneficial (IMHO).


Agreed, and that ties into the similar discussion we recently had about
picking a common DLM. Ideally we'd only add *one* general dependency and
use it for locks / leader election / syncing status around.



++

All of the proposed DLM tools can fill this space successfully. There is 
definitely not a need for multiple.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Kevin L. Mitchell
On Wed, 2015-10-07 at 23:17 -0600, Chris Friesen wrote:
> Why is it inevitable?

Well, I would say that this is probably a consequence of the CAP[1]
theorem.

> Theoretically if the DB knew about what resources were originally available 
> and 
> what resources have been consumed, then it should be able to allocate 
> resources 
> race-free (possibly with some retries involved if racing against other 
> schedulers updating the DB, but that would be internal to the scheduler 
> itself).

The problem is, it can't.  The scheduler may be making the decision at
the same time that an update from a compute node is in flight, meaning
that the scheduler is missing (at least) one piece of information.  When
you include a database, that just makes the possibility of missing an
in-flight update worse, because you also have to factor in the latency
of the database update as well.  Also, we have to factor in the
possibility that there are multiple schedulers in play, which further
worsens the possibility of in-flight information critical to the
scheduling decision.  If you employ some sort of locking to try to
mitigate all this, you've just effectively thrown away the scalability
that deploying multiple schedulers was supposed to buy you.

[1] https://en.wikipedia.org/wiki/CAP_theorem
-- 
Kevin L. Mitchell 
Rackspace


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Ed Leafe
On Oct 8, 2015, at 8:01 AM, Thierry Carrez  wrote:

>> Operational overhead has a cost - maintaining 3 different database
>> tools, backing them up, providing HA, etc. has operational cost.
>> 
>> This is not to say that this cannot be overseen, but it should be taken
>> into consideration.
>> 
>> And *if* they can be consolidated into an agreed solution across the
>> whole of OpenStack - that would be highly beneficial (IMHO).
> 
> Agreed, and that ties into the similar discussion we recently had about
> picking a common DLM. Ideally we'd only add *one* general dependency and
> use it for locks / leader election / syncing status around.

Oh, yes, sorry, I left that out of this particular post, as it had been 
discussed at length back in July. But yes, introducing a new dependency has a 
high cost, and needs to be justified before anyone would ever consider taking 
on that added cost. That was in my original email [0] back in July:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
At this point I'm sure that most of you are filled with thoughts on
how this won't work, or how much trouble it will be to switch, or how
much more of a pain it will be, or how you hate non-relational DBs, or
any of a zillion other negative thoughts. FWIW, I have them too. But
instead of ranting, I would ask that we acknowledge for now that:

a) it will be disruptive and painful to switch something like this at
this point in Nova's development
b) it would have to provide *significant* improvement to make such a
change worthwhile

So what I'm asking from all of you is to help define the second part:
what we would want improved, and how to measure those benefits. In
other words, what results would you have to see in order to make you
reconsider your initial "nah, this'll never work" reaction, and start
to think that this is will be a worthwhile change to make to Nova.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Whether we make this type of change, or some other type of change, or keep 
things the way they are, having the data to justify that decision is always 
important.

-- Ed Leafe

[0] http://lists.openstack.org/pipermail/openstack-dev/2015-July/069593.html



signature.asc
Description: Message signed with OpenPGP using GPGMail
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Ed Leafe
On Oct 8, 2015, at 10:24 AM, Joshua Harlow  wrote:



> Now if we imagine each/all schedulers having watches
> on /nova/compute_nodes/ ([2] consul and etc.d have equivalent concepts
> afaik) then when a compute_node updates that information a push
> notification (the watch being triggered) will be sent to the
> scheduler(s) and the scheduler(s) could then update a local in-memory
> cache of the data about all the hypervisors that can be selected from
> for scheduling. This avoids any reading of a large set of data in the
> first place (besides an initial read-once on startup to read the
> initial list + setup the watches); in a way its similar to push
> notifications. Then when scheduling a VM -> hypervisor there isn't any
> need to query anything but the local in-memory representation that the
> scheduler is maintaining (and updating as watches are triggered)...

You've hit upon the problem with the current design: multiple, and potentially 
out-of-sync copies of the data. What you're proposing doesn't really sound all 
that different than the current design, which has the compute nodes send the 
updates in their state to the scheduler both on a scheduled task, and in 
response to changes. The impetus for the Cassandra proposal was to eliminate 
this duplication, and have the resources being scheduled and the scheduler all 
working with the same data.

-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP using GPGMail
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Ed Leafe
On Oct 8, 2015, at 11:03 AM, Kevin L. Mitchell  
wrote:

>> Theoretically if the DB knew about what resources were originally available 
>> and
>> what resources have been consumed, then it should be able to allocate 
>> resources
>> race-free (possibly with some retries involved if racing against other
>> schedulers updating the DB, but that would be internal to the scheduler 
>> itself).
> 
> The problem is, it can't.  The scheduler may be making the decision at
> the same time that an update from a compute node is in flight, meaning
> that the scheduler is missing (at least) one piece of information.  When
> you include a database, that just makes the possibility of missing an
> in-flight update worse, because you also have to factor in the latency
> of the database update as well.  Also, we have to factor in the
> possibility that there are multiple schedulers in play, which further
> worsens the possibility of in-flight information critical to the
> scheduling decision.  If you employ some sort of locking to try to
> mitigate all this, you've just effectively thrown away the scalability
> that deploying multiple schedulers was supposed to buy you.

Yes, the multiple scheduler part is very problematic. Not only could an update 
from the compute node not be received yet, there could also be updates from 
other schedulers that aren't caught. One of the most problematic use cases is 
requests for several similar VMs being received in a short period of time, and 
all scheduling processes handling them picking the same host. In the Cassandra 
scenario, the first would "win", and others would fail their attempt to update 
the resource with the claim, forcing them to select a different host without 
having to first go through the fail/retry cycle of the current design.

-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP using GPGMail
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Ed Leafe
On Oct 8, 2015, at 10:54 AM, Clint Byrum  wrote:

> ^^ THIS is the kind of architectural thinking I'd like to see us do more
> of.

Agreed. If nothing else, I'm glad that I was able to get people thinking about 
new approaches.

> This isn't "hey I have a better database" it is "I have a way to reduce
> the most common operations to O(1) complexity".
> 
> Ed, for all of the promise of your experiment, I'd actually rather see
> time spent on Josh's idea above. In fact, I might spend time on Josh's
> idea above. :)

Cool! I don't really care if my particular ideas are selected; I just want to 
make OpenStack better.


-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP using GPGMail
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Joshua Harlow

Clint Byrum wrote:

Excerpts from Joshua Harlow's message of 2015-10-08 08:38:57 -0700:

Joshua Harlow wrote:

On Thu, 8 Oct 2015 10:43:01 -0400
Monty Taylor   wrote:


On 10/08/2015 09:01 AM, Thierry Carrez wrote:

Maish Saidel-Keesing wrote:

Operational overhead has a cost - maintaining 3 different database
tools, backing them up, providing HA, etc. has operational cost.

This is not to say that this cannot be overseen, but it should be
taken into consideration.

And *if* they can be consolidated into an agreed solution across
the whole of OpenStack - that would be highly beneficial (IMHO).

Agreed, and that ties into the similar discussion we recently had
about picking a common DLM. Ideally we'd only add *one* general
dependency and use it for locks / leader election / syncing status
around.


++

All of the proposed DLM tools can fill this space successfully. There
is definitely not a need for multiple.

On this point, and just thinking out loud. If we consider saving
compute_node information into say a node in said DLM backend (for
example a znode in zookeeper[1]); this information would be updated
periodically by that compute_node *itself* (it would say contain
information about what VMs are running on it, what there utilization is
and so-on).

For example the following layout could be used:

/nova/compute_nodes/

   data could be:

{
  vms: [],
  memory_free: XYZ,
  cpu_usage: ABC,
  memory_used: MNO,
  ...
}

Now if we imagine each/all schedulers having watches
on /nova/compute_nodes/ ([2] consul and etc.d have equivalent concepts
afaik) then when a compute_node updates that information a push
notification (the watch being triggered) will be sent to the
scheduler(s) and the scheduler(s) could then update a local in-memory
cache of the data about all the hypervisors that can be selected from
for scheduling. This avoids any reading of a large set of data in the
first place (besides an initial read-once on startup to read the
initial list + setup the watches); in a way its similar to push
notifications. Then when scheduling a VM ->   hypervisor there isn't any
need to query anything but the local in-memory representation that the
scheduler is maintaining (and updating as watches are triggered)...

So this is why I was wondering about what capabilities of cassandra are
being used here; because the above I think are unique capababilties of
DLM like systems (zookeeper, consul, etcd) that could be advantageous
here...

[1]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#sc_zkDataModel_znodes

[2]
https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkWatches



And here's a final super-awesomeness,

Use the same existence of that znode + information (perhaps using
ephemeral znodes or equivalent) to determine if a hypervisor is 'alive'
or 'dead', thus removing the need to do queries and periodic writes to
the nova database to determine if a hypervisors nova-compute service is
alive or dead (with reads via
https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L33
and other similar code scattered in nova)...



^^ THIS is the kind of architectural thinking I'd like to see us do more
of.

This isn't "hey I have a better database" it is "I have a way to reduce
the most common operations to O(1) complexity".

Ed, for all of the promise of your experiment, I'd actually rather see
time spent on Josh's idea above. In fact, I might spend time on Josh's
idea above. :)


Go for it!

We (at yahoo) are also brainstorming this idea (or something like it), 
and as we hit more performance issues pushing the 1000+ hypervisors in a 
single cluster (no cell/s) (one of our many cluster/s) we will start 
adjusting (and hopefully more blogging, upstreaming and all that) what 
needs to be fixed/tweaked/altered to continue to push these boundaries.


Collab. and all that is welcome to of course :)

P.S.

The DLM spec @ https://review.openstack.org/#/c/209661/ (rendered nicely 
at 
http://docs-draft.openstack.org/61/209661/29/check/gate-openstack-specs-docs/2ff62fa//doc/build/html/specs/chronicles-of-a-dlm.html) 
mentions 'Such a consensus being built will also influence the future 
functionality and capabilities of OpenStack at large so we need to be 
especially careful, thoughtful, and explicit here.'


This statement was really targeted at cases like this, when we (as a 
community) choose a DLM solution we affect the larger capabilities of 
openstack, not just for locking but for scheduling (and likely for other 
functionality I can't even think of/predict...)




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack 

Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Ed Leafe
On Oct 8, 2015, at 1:38 PM, Ian Wells  wrote:

>> You've hit upon the problem with the current design: multiple, and 
>> potentially out-of-sync copies of the data.
> 
> Arguably, this is the *intent* of the current design, not a problem with it.

It may have been the intent, but that doesn't mean that we are where we need to 
be.

> The data can never be perfect (ever) so go with 'good enough' and run with 
> it, and deal with the corner cases.

It is in defining what is "good enough" that is problematic.

> Truth be told, storing that data in MySQL is secondary to the correct 
> functioning of the scheduler.

I have no problem with MySQL (well, I do, but that's not relevant to this 
discussion). My issue is that the current system poorly replicates its data 
from MySQL to the places where it is needed.

> The one thing it helps with is when the scheduler restarts - it stands a 
> chance of making sensible decisions before it gets its full picture back.  
> (This is all very like route distribution protocols, you know: make the best 
> decision on the information you have to hand, assuming the rest of the system 
> will deal with your mistakes.  And hold times, and graceful restart, and…)

Yes, this is all well and good. My focus is on improving the information in 
hand when making that best decision.

> Is there any reason why the duplication (given it's not a huge amount of data 
> - megabytes, not gigabytes) is a problem?  Is there any reason why 
> inconsistency is a problem?

I'm sure that many of the larger deployments may have issues with the amount of 
data that must be managed in-memory by so many different parts of the system. 
Inconsistency is a problem, but one that has workarounds. The primary issue is 
scalability: with the current design, increasing the number of scheduler 
processes increases the raciness of the system.

> I do sympathise with your point in the following email where you have 5 VMs 
> scheduled by 5 schedulers to the same host, but consider:
> 
> 1. if only one host suits the 5 VMs this results in the same behaviour: 1 VM 
> runs, the rest don't.  There's more work to discover that but arguably less 
> work than maintaining a consistent database.

True, but in a large scale deployment this is an extremely rare case.

> 2. if many hosts suit the 5 VMs then this is *very* unlucky, because we 
> should be choosing a host at random from the set of suitable hosts and that's 
> a huge coincidence - so this is a tiny corner case that we shouldn't be 
> designing around

Here is where we differ in our understanding. With the current system of 
filters and weighers, 5 schedulers getting requests for identical VMs and 
having identical information are *expected* to select the same host. It is not 
a tiny corner case; it is the most likely result for the current system design. 
By catching this situation early (in the scheduling process) we can avoid 
multiple RPC round-trips to handle the fail/retry mechanism.

> The worst case, is, however
> 
> 3. we attempt to pick the optimal host, and the optimal host for all 5 VMs is 
> the same despite there being other less perfect choices out there.  That 
> would get you a stampeding herd and a bunch of retries.
> 
> I admit that the current system does not solve well for (3).

IMO, this is identical to (2).


-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP using GPGMail
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Ian Wells
On 8 October 2015 at 13:28, Ed Leafe  wrote:

> On Oct 8, 2015, at 1:38 PM, Ian Wells  wrote:
> > Truth be told, storing that data in MySQL is secondary to the correct
> functioning of the scheduler.
>
> I have no problem with MySQL (well, I do, but that's not relevant to this
> discussion). My issue is that the current system poorly replicates its data
> from MySQL to the places where it is needed.
>

Well, the issue is that the data shouldn't be replicated from the database
at all.  There doesn't need to be One True Copy of data here (though I
think the point further down is why we're differing on that).


> > Is there any reason why the duplication (given it's not a huge amount of
> data - megabytes, not gigabytes) is a problem?  Is there any reason why
> inconsistency is a problem?
>
> I'm sure that many of the larger deployments may have issues with the
> amount of data that must be managed in-memory by so many different parts of
> the system.
>

I wonder about that.  If I have a scheduler making a scheduling decision I
don't want it calling out to a database and the database calling out to
offline storage just to find the information, at least not if I can
possibly avoid it.  It's a critical path element in every boot call.

Given that what we're talking about is generally a bunch of resource values
for each host, I'm not sure how big this gets, even in the 100k host range,
but do you have a particularly sizeable structure in mind?


> Inconsistency is a problem, but one that has workarounds. The primary
> issue is scalability: with the current design, increasing the number of
> scheduler processes increases the raciness of the system.
>

And again, given your point below I see where you're coming from here, but
I think the key here is to make two schedulers considerably *less* likely
to make the same choice on the same information.

> I do sympathise with your point in the following email where you have 5
> VMs scheduled by 5 schedulers to the same host, but consider:
> >
> > 1. if only one host suits the 5 VMs this results in the same behaviour:
> 1 VM runs, the rest don't.  There's more work to discover that but arguably
> less work than maintaining a consistent database.
>
> True, but in a large scale deployment this is an extremely rare case.
>

Indeed; I'm trying to get that one out of the way.

> 2. if many hosts suit the 5 VMs then this is *very* unlucky, because we
> should be choosing a host at random from the set of suitable hosts and
> that's a huge coincidence - so this is a tiny corner case that we shouldn't
> be designing around
>
> Here is where we differ in our understanding. With the current system of
> filters and weighers, 5 schedulers getting requests for identical VMs and
> having identical information are *expected* to select the same host. It is
> not a tiny corner case; it is the most likely result for the current system
> design. By catching this situation early (in the scheduling process) we can
> avoid multiple RPC round-trips to handle the fail/retry mechanism.
>

And so maybe this would be a different fix - choose, at random, one of the
hosts above a weighting threshold, not choose the top host every time?
Technically, any host passing the filter is adequate to the task from the
perspective of an API user (and they can't prove if they got the highest
weighting or not), so if we assume weighting an operator preference, and
just weaken it slightly, we'd have a few more options.

Again, we want to avoid overscheduling to a host, which will eventually
cause a decline and a reschedule.  But something that on balance probably
won't overschedule is adequate; overscheduling sucks but is not in fact the
end of the world as long as it's not every single time.

I'm not averse to the central database if we need the central database, but
I'm not sure how much we do at this point, and a central database will
become a point of contention, I would think, beyond the cost of the above
idea.
 --
Ian.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Ian Wells
On 7 October 2015 at 22:17, Chris Friesen 
wrote:

> On 10/07/2015 07:23 PM, Ian Wells wrote:
>
>>
>> The whole process is inherently racy (and this is inevitable, and
>> correct),
>>
>>
> Why is it inevitable?
>

It's inevitable because everything takes time, and some things are
unpredictable.

The amount of free RAM on a machine - as we do it today - is, literally,
what the kernel reports to be free.  That's known by the host,
unpredictable, occasionally reported to the scheduler (which takes time),
and if you stored it in a database (which takes time) and recovered it from
a database (which takes time) the number you got would not be guaranteed to
be current.

Other things - like CPUs - can theoretically be centrally tracked, but the
whole thing is distributed at the moment - compute nodes are the source of
truth, not the database - which makes some sense when you consider that a
compute node knows best what VMs are running and what VMs have died at any
given moment.  In truth, if the central service is in any way wrong (for
instance, processes outside of Openstack are using a lot of CPU, which you
can't predict, again) then it makes sense for the compute node to be the
final arbiter, so (occasional, infrequent) reschedules are probably
appropriate anyway.
-- 
Ian.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-08 Thread Ian Wells
On 8 October 2015 at 09:10, Ed Leafe  wrote:

> You've hit upon the problem with the current design: multiple, and
> potentially out-of-sync copies of the data.


Arguably, this is the *intent* of the current design, not a problem with
it.  The data can never be perfect (ever) so go with 'good enough' and run
with it, and deal with the corner cases.  Truth be told, storing that data
in MySQL is secondary to the correct functioning of the scheduler.  The one
thing it helps with is when the scheduler restarts - it stands a chance of
making sensible decisions before it gets its full picture back.  (This is
all very like route distribution protocols, you know: make the best
decision on the information you have to hand, assuming the rest of the
system will deal with your mistakes.  And hold times, and graceful restart,
and...)


> What you're proposing doesn't really sound all that different than the
> current design, which has the compute nodes send the updates in their state
> to the scheduler both on a scheduled task, and in response to changes. The
> impetus for the Cassandra proposal was to eliminate this duplication, and
> have the resources being scheduled and the scheduler all working with the
> same data.


Is there any reason why the duplication (given it's not a huge amount of
data - megabytes, not gigabytes) is a problem?  Is there any reason why
inconsistency is a problem?

What you propose is a change in behaviour.  The scheduler today is intended
to make the best decision based on the available information, without
locks, and on the assumption that other things might be scheduling at the
same time.  Your proposal comes across as making all schedulers work on one
accurate copy of information that they keep updated (not, I think, entirely
synchronously, so they can still be working on outdated information, but
rather closer to it).  But when you have hundreds of hosts willing to take
a machine then there's typically no one answer to a scheduling decision and
we can tolerate really quite a lot of variability.

I do sympathise with your point in the following email where you have 5 VMs
scheduled by 5 schedulers to the same host, but consider:

1. if only one host suits the 5 VMs this results in the same behaviour: 1
VM runs, the rest don't.  There's more work to discover that but arguably
less work than maintaining a consistent database.
2. if many hosts suit the 5 VMs then this is *very* unlucky, because we
should be choosing a host at random from the set of suitable hosts and
that's a huge coincidence - so this is a tiny corner case that we shouldn't
be designing around

The worst case, is, however

3. we attempt to pick the optimal host, and the optimal host for all 5 VMs
is the same despite there being other less perfect choices out there.  That
would get you a stampeding herd and a bunch of retries.

I admit that the current system does not solve well for (3).
-- 
Ian.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-07 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2015-10-07 12:28:36 -0700:
> On 07/10/15 13:36, Ed Leafe wrote:
> > Several months ago I proposed an experiment [0] to see if switching the 
> > data model for the Nova scheduler to use Cassandra as the backend would be 
> > a significant improvement as opposed to the current design using multiple 
> > copies of the same data (compute_node in MySQL DB, HostState in memory in 
> > the scheduler, ResourceTracker in memory in the compute node) and trying to 
> > keep them all in sync via passing messages.
> 
> It seems to me (disclaimer: not a Nova dev) that which database to use 
> is completely irrelevant to your proposal, which is really about moving 
> the scheduling from a distributed collection of Python processes with 
> ad-hoc (or sometimes completely missing) synchronisation into the 
> database to take advantage of its well-defined semantics. But you've 
> framed it in such a way as to guarantee that this never gets discussed, 
> because everyone will be too busy arguing about whether or not Cassandra 
> is better than Galera.
> 

Your point is valid Zane, that the idea is more about having a
synchronized view of the scheduling state, and not about Cassandra.

I think Cassandra makes the proposal more realistic and easier to think
aboutthough, as Cassandra is focused on problems of the scale that this
represents. Galera won't do this well at any kind of scale, without
the added complexity and inefficiency of cells. So whatever Galera's
capability for a single node to handle the write churn of a truly
synchronized scheduler is, would be the maximum capacity of one cell.

I like the concrete nature of this proposal, and suggest people review
it as a whole, and not try to reduce it to its components without an
extremely strong reason to do so.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-07 Thread Chris Friesen

On 10/07/2015 11:36 AM, Ed Leafe wrote:


I've finally gotten around to finishing writing up that proposal [1], and I'd
like to hope that it would be the basis for future discussions about
addressing some of the underlying issues that exist in OpenStack for
historical reasons, and how we might rethink these choices today. I'd prefer
comments and discussion here on the dev list, so that all can see your ideas,
but I will be in Tokyo for the summit, and would also welcome some informal
discussion there, too.

-- Ed Leafe

 [1] http://blog.leafe.com/reimagining_scheduler/


I've wondered for a while (ever since I looked at the scheduler code, really) 
why we couldn't implement more of the scheduler as database transactions.


I haven't used Cassandra, so maybe you can clarify something about updates 
across a distributed DB.  I just read up on lightweight transactions, and it 
says that they're restricted to a single partition.  Is that an acceptable 
limitation for this usage?


Some points that might warrant further discussion:

1) Some resources (RAM) only require tracking amounts.  Other resources (CPUs, 
PCI devices) require tracking allocation of specific individual host resources 
(for CPU pinning, PCI device allocation, etc.).  Presumably for the latter we 
would have to actually do the allocation of resources at the time of the 
scheduling operation in order to update the database with the claimed resources 
in a race-free way.


2) Are you suggesting that all of nova switch to Cassandra, or just the 
scheduler and resource tracking portions?  If the latter, how would we handle 
things like pinned CPUs and PCI devices that are currently associated with 
specific instances in the nova DB?


3) The concept of the compute node updating the DB when things change is really 
orthogonal to the new scheduling model.  The current scheduling model would 
benefit from that as well.


4) It seems to me that to avoid races we need to do one of the following.  Which 
are you proposing?
a) Serialize the entire scheduling operation so that only one instance can 
schedule at once.
b) Make the evaluation of filters and claiming of resources a single atomic DB 
transaction.
c) Do a loop where we evaluate the filters, pick a destination, try to claim the 
resources in the DB, and retry the whole thing if the resources have already 
been claimed.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-07 Thread Ed Leafe
On Oct 7, 2015, at 6:00 PM, Chris Friesen  wrote:

> I've wondered for a while (ever since I looked at the scheduler code, really) 
> why we couldn't implement more of the scheduler as database transactions.
> 
> I haven't used Cassandra, so maybe you can clarify something about updates 
> across a distributed DB.  I just read up on lightweight transactions, and it 
> says that they're restricted to a single partition.  Is that an acceptable 
> limitation for this usage?

An implementation detail. A partition is defined by the partition key, not by 
any physical arrangement of nodes. The partition key would have to depend on 
the resource type, and whatever other columns would make such a query unique.

> Some points that might warrant further discussion:
> 
> 1) Some resources (RAM) only require tracking amounts.  Other resources 
> (CPUs, PCI devices) require tracking allocation of specific individual host 
> resources (for CPU pinning, PCI device allocation, etc.).  Presumably for the 
> latter we would have to actually do the allocation of resources at the time 
> of the scheduling operation in order to update the database with the claimed 
> resources in a race-free way.

Yes, that's correct. A lot of thought would have to be put into how to best 
represent these different types of resources, and that's something that I have 
ideas about, but would feel a whole lot better defining only after talking 
these concepts over with others who understand the underlying concepts better 
than I do.

> 2) Are you suggesting that all of nova switch to Cassandra, or just the 
> scheduler and resource tracking portions?  If the latter, how would we handle 
> things like pinned CPUs and PCI devices that are currently associated with 
> specific instances in the nova DB?

I am only thinking of the scheduler as a separate service. Perhaps Nova as a 
whole might benefit from switching to Cassandra for its database needs, but I 
haven't really thought about that at all.

> 3) The concept of the compute node updating the DB when things change is 
> really orthogonal to the new scheduling model.  The current scheduling model 
> would benefit from that as well.

Actually, it isn't that different. Compute nodes send updates to the scheduler 
when instances are created/deleted/resized/etc., so this isn't much of a 
stretch.

> 4) It seems to me that to avoid races we need to do one of the following.  
> Which are you proposing?
> a) Serialize the entire scheduling operation so that only one instance can 
> schedule at once.
> b) Make the evaluation of filters and claiming of resources a single atomic 
> DB transaction.
> c) Do a loop where we evaluate the filters, pick a destination, try to claim 
> the resources in the DB, and retry the whole thing if the resources have 
> already been claimed.

Probably a combination of b) and c). Filters would, for lack of a better term, 
add CSQL WHERE clauses to the query, which would return a set of acceptable 
hosts. Weighers would order these hosts in terms of desirability, and then the 
claim would be attempted. If the claim failed because the host had changed, the 
next acceptable host would be selected, etc. I don't imagine that "retrying the 
whole thing" would be an efficient option, unless there were no other 
acceptable hosts returned from the original filtering query.

Put another way: if we are in a racy situation, and two scheduler processes are 
trying to place a similar instance, both processes would most likely come up 
with the same set of hosts ordered in the same way. One of those processes 
would "win", and claim the first choice. The other would fail the transaction, 
and would then claim the second choice on the list. IMO, this is how you best 
deal with race conditions.


-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP using GPGMail
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-07 Thread Chris Friesen

On 10/07/2015 07:23 PM, Ian Wells wrote:

On 7 October 2015 at 16:00, Chris Friesen > wrote:

1) Some resources (RAM) only require tracking amounts.  Other resources
(CPUs, PCI devices) require tracking allocation of specific individual host
resources (for CPU pinning, PCI device allocation, etc.).  Presumably for
the latter we would have to actually do the allocation of resources at the
time of the scheduling operation in order to update the database with the
claimed resources in a race-free way.


The whole process is inherently racy (and this is inevitable, and correct),
which is why the scheduler works the way it does:

- scheduler guesses at a host based on (guaranteed - hello distributed systems!)
outdated information
- VM is scheduled to a host that looks like it might work, and host attempts to
run it
- VM run may fail (because the information was outdated or has become outdated),
in which case we retry the schedule


Why is it inevitable?

Theoretically if the DB knew about what resources were originally available and 
what resources have been consumed, then it should be able to allocate resources 
race-free (possibly with some retries involved if racing against other 
schedulers updating the DB, but that would be internal to the scheduler itself).


Or does that just not scale enough and we need to use inherently racy models?

Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-07 Thread Ian Wells
On 7 October 2015 at 16:00, Chris Friesen 
wrote:

> 1) Some resources (RAM) only require tracking amounts.  Other resources
> (CPUs, PCI devices) require tracking allocation of specific individual host
> resources (for CPU pinning, PCI device allocation, etc.).  Presumably for
> the latter we would have to actually do the allocation of resources at the
> time of the scheduling operation in order to update the database with the
> claimed resources in a race-free way.
>

The whole process is inherently racy (and this is inevitable, and correct),
which is why the scheduler works the way it does:

- scheduler guesses at a host based on (guaranteed - hello distributed
systems!) outdated information
- VM is scheduled to a host that looks like it might work, and host
attempts to run it
- VM run may fail (because the information was outdated or has become
outdated), in which case we retry the schedule

In fact, with PCI devices the code has been written rather carefully to
make sure that they fit into this model.  There is central per-device
tracking (which, fwiw, I argued against back in the day) but that's not how
allocation works (or, considering how long it is since I looked, worked).

PCI devices are actually allocated from pools of equivalent devices, and
allocation works in the same manner as other scheduling: you work out from
the nova boot call what constraints a host must satisfy (in this case, in
number of PCI devices in specific pools), you check your best guess at
global host state against those constraints, and you pick one of the hosts
that meets the constraints to schedule on.

So: yes, there is a central registry of devices, which we try to keep up to
date - but this is for admins to refer to, it's not a necessity of
scheduling.  The scheduler input is the pool counts, which work largely the
same way as the available memory works as regards scheduling and updating.

No idea on CPUs, sorry, but again I'm not sure why the behaviour would be
any different: compare suspected host state against needs, schedule if it
fits, hope you got it right and tolerate if you didn't.

That being the case, it's worth noting that the database can be eventually
consistent and doesn't need to be transactional.  It's also worth
considering that the database can have multiple (mutually inconsistent)
copies.  There's no need to use a central datastore if you don't want to -
one theoretical example is to run multiple schedulers and let each
scheduler attempt to collate cloud state from unreliable messages from the
compute hosts.  This is not quite what happens today, because messages we
send over Rabbit are reliable and therefore costly.
-- 
Ian.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-07 Thread Ed Leafe
On Oct 7, 2015, at 2:28 PM, Zane Bitter  wrote:

> It seems to me (disclaimer: not a Nova dev) that which database to use is 
> completely irrelevant to your proposal,

Well, not entirely. The difference is that what Cassandra offers that separates 
it from other DBs is exactly the feature that we need. The solution to the 
scheduler isn't to simply "use a database".

> which is really about moving the scheduling from a distributed collection of 
> Python processes with ad-hoc (or sometimes completely missing) 
> synchronisation into the database to take advantage of its well-defined 
> semantics. But you've framed it in such a way as to guarantee that this never 
> gets discussed, because everyone will be too busy arguing about whether or 
> not Cassandra is better than Galera.

Understood - all one has to do is review the original thread from back in July 
to see this happening. But the reason that I framed it then as an experiment in 
which we would come up with measures of success we could all agree on up-front 
was so that if someone else thought that Product Foo would be even better, we 
could set up a similar test bed and try it out. IOW, instead of bikeshedding, 
if you want a different color, you build another shed and we can all have a 
look.


-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP using GPGMail
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-07 Thread Fox, Kevin M
I think if you went ahead and did the experiment, and had good results from it, 
the discussion would start to progress whether or not folks were fond of 
Cassandra or ...

Thanks,
Kevin

From: Ed Leafe [e...@leafe.com]
Sent: Wednesday, October 07, 2015 5:24 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Scheduler proposal

On Oct 7, 2015, at 2:28 PM, Zane Bitter <zbit...@redhat.com> wrote:

> It seems to me (disclaimer: not a Nova dev) that which database to use is 
> completely irrelevant to your proposal,

Well, not entirely. The difference is that what Cassandra offers that separates 
it from other DBs is exactly the feature that we need. The solution to the 
scheduler isn't to simply "use a database".

> which is really about moving the scheduling from a distributed collection of 
> Python processes with ad-hoc (or sometimes completely missing) 
> synchronisation into the database to take advantage of its well-defined 
> semantics. But you've framed it in such a way as to guarantee that this never 
> gets discussed, because everyone will be too busy arguing about whether or 
> not Cassandra is better than Galera.

Understood - all one has to do is review the original thread from back in July 
to see this happening. But the reason that I framed it then as an experiment in 
which we would come up with measures of success we could all agree on up-front 
was so that if someone else thought that Product Foo would be even better, we 
could set up a similar test bed and try it out. IOW, instead of bikeshedding, 
if you want a different color, you build another shed and we can all have a 
look.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Scheduler proposal

2015-10-07 Thread Zane Bitter

On 07/10/15 13:36, Ed Leafe wrote:

Several months ago I proposed an experiment [0] to see if switching the data 
model for the Nova scheduler to use Cassandra as the backend would be a 
significant improvement as opposed to the current design using multiple copies 
of the same data (compute_node in MySQL DB, HostState in memory in the 
scheduler, ResourceTracker in memory in the compute node) and trying to keep 
them all in sync via passing messages.


It seems to me (disclaimer: not a Nova dev) that which database to use 
is completely irrelevant to your proposal, which is really about moving 
the scheduling from a distributed collection of Python processes with 
ad-hoc (or sometimes completely missing) synchronisation into the 
database to take advantage of its well-defined semantics. But you've 
framed it in such a way as to guarantee that this never gets discussed, 
because everyone will be too busy arguing about whether or not Cassandra 
is better than Galera.


cheers,
Zane.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


  1   2   >