Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-18 Thread Sylvain Bauza
2014-03-18 14:07 GMT+01:00 Russell Bryant :

>
> I think it's great to see discussion of better ways to approach these
> things, but it would have to be Juno work.
>
>
+1. There are various blueprints about the scheduler in progress, related
to either splitting it out or scaling it, and IMHO this concurrency problem
should be discussed during the Juno summit in order to make sure there
won't be duplicate efforts.

-Sylvain

--
> Russell Bryant
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-18 Thread Russell Bryant
On 03/17/2014 01:54 PM, John Garbutt wrote:
> On 15 March 2014 18:39, Chris Friesen  wrote:
>> Hi,
>>
>> I'm curious why the specified git commit chose to fix the anti-affinity race
>> condition by aborting the boot and triggering a reschedule.
>>
>> It seems to me that it would have been more elegant for the scheduler to do
>> a database transaction that would atomically check that the chosen host was
>> not already part of the group, and then add the instance (with the chosen
>> host) to the group.  If the check fails then the scheduler could update the
>> group_hosts list and reschedule.  This would prevent the race condition in
>> the first place rather than detecting it later and trying to work around it.
>>
>> This would require setting the "host" field in the instance at the time of
>> scheduling rather than the time of instance creation, but that seems like it
>> should work okay.  Maybe I'm missing something though...
> 
> We deal with memory races in the same way as this today, when they
> race against the scheduler.
> 
> Given the scheduler split, writing that value into the nova db from
> the scheduler would be a step backwards, and it probably breaks lots
> of code that assumes the host is not set until much later.

This is exactly the reason I did it this way.  It fits the existing
pattern with how we deal with host scheduling races today.  We do the
final claiming and validation on the compute node itself and kick back
to the scheduler if something doesn't work out.  Alternatives are *way*
too risky to be doing in feature freeze, IMO.

I think it's great to see discussion of better ways to approach these
things, but it would have to be Juno work.

-- 
Russell Bryant

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-18 Thread Sylvain Bauza
Hi Chris,


2014-03-18 0:36 GMT+01:00 Chris Friesen :

> On 03/17/2014 05:01 PM, Sylvain Bauza wrote:
>
>
>> There are 2 distinct cases :
>> 1. there are multiple schedulers involved in the decision
>> 2. there is one single scheduler but there is a race condition on it
>>
>
>
>  About 1., I agree we need to see how the scheduler (and later on Gantt)
>> could address decision-making based on distributed engines. At least, I
>> consider the no-db scheduler blueprint responsible for using memcache
>> instead of a relational DB could help some of these issues, as memcached
>> can be distributed efficiently.
>>
>
> With a central database we could do a single atomic transaction that looks
> something like "select the first host A from list of hosts L that is not in
> the list of hosts used by servers in group G and then set the host field
> for server S to A".  In that context simultaneous updates can't happen
> because they're serialized by the central database.
>
> How would one handle the above for simultaneous scheduling operations
> without a centralized data store?  (I've never played with memcached, so
> I'm not really familiar with what it can do.)
>
>
See the rationale here for memcached-based scheduler :
https://blueprints.launchpad.net/nova/+spec/no-db-scheduler
The idea is to leverage the capabilities of distributed memcached servers
with synchronization so that the decision would be scalable. As said in the
blueprint, another way would be to make use of RPC fanouts, but that's
something Openstack in general tries to avoid.



>
>  About 2., that's a concurrency issue which can be addressed thanks to
>> common practices for synchronizing actions. IMHO, a local lock can be
>> enough for ensuring isolation
>>
>
> It's not that simple though.  Currently the scheduler makes a decision,
> but the results of that decision aren't actually kept in the scheduler or
> written back to the db until much later when the instance is actually
> spawned on the compute node.  So when the next scheduler request comes in
> we violate the scheduling policy.  Local locking wouldn't help this.
>
>
>
Uh, you're right, missed that crucial point. That said, we should consider
that as a classlcal problem of placement with deferral action. One
possibility would be to consider that the host is locked to this group at
the scheduling decision time, even if the first instance hasn't yet booted.
Consider it as a "cache" with TTL if you wish. Thus, that implies the
scheduler would need to have a feedback value from the compute node saying
that the instance really booted. If no ACK comes from the compute node,
once the TTL vanishes, the lock is freed.

-Sylvain



> Chris
>
>
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Chris Friesen

On 03/17/2014 05:01 PM, Sylvain Bauza wrote:



There are 2 distinct cases :
1. there are multiple schedulers involved in the decision
2. there is one single scheduler but there is a race condition on it




About 1., I agree we need to see how the scheduler (and later on Gantt)
could address decision-making based on distributed engines. At least, I
consider the no-db scheduler blueprint responsible for using memcache
instead of a relational DB could help some of these issues, as memcached
can be distributed efficiently.


With a central database we could do a single atomic transaction that 
looks something like "select the first host A from list of hosts L that 
is not in the list of hosts used by servers in group G and then set the 
host field for server S to A".  In that context simultaneous updates 
can't happen because they're serialized by the central database.


How would one handle the above for simultaneous scheduling operations 
without a centralized data store?  (I've never played with memcached, so 
I'm not really familiar with what it can do.)



About 2., that's a concurrency issue which can be addressed thanks to
common practices for synchronizing actions. IMHO, a local lock can be
enough for ensuring isolation


It's not that simple though.  Currently the scheduler makes a decision, 
but the results of that decision aren't actually kept in the scheduler 
or written back to the db until much later when the instance is actually 
spawned on the compute node.  So when the next scheduler request comes 
in we violate the scheduling policy.  Local locking wouldn't help this.


Chris




___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Sylvain Bauza
Hi Chris,




2014-03-17 23:08 GMT+01:00 Chris Friesen :

> On 03/17/2014 02:30 PM, Sylvain Bauza wrote:
>
>> There is a global concern here about how an holistic scheduler can
>> perform decisions, and from which key metrics.
>> The current effort is leading to having the Gantt DB updated thanks to
>> resource tracker for scheduling appropriately the hosts.
>>
>> If we consider these metrics as not enough, ie. that Gantt should
>> perform an active check to another project, that's something which needs
>> to be considered carefully. IMHO, on that case, Gantt should only access
>> metrics thanks to the project REST API (and python client) in order to
>> make sure that rolling upgrades could happen.
>> tl;dr: If Gantt requires accessing Nova data, it should request Nova
>> REST API, and not perform database access directly (even thru the
>> conductor)
>>
>
> Consider the case in point.
>
> 1) We create a server group with anti-affinity policy.  (So no two
> instances in the group should run on the same compute node.)
> 2) We boot a server in this group.
> 3) Either simultaneously (on a different scheduler) or immediately after
> (on the same scheduler) we boot another server in the same group.
>
> Ideally the scheduler should enforce the policy without any races.
> However, in the current code we don't update the instance entry in the
> database with the chosen host until we actually try and create it on the
> host.  Because of this we can end up putting both of them on the same
> compute node.
>
>
There are 2 distinct cases :
1. there are multiple schedulers involved in the decision
2. there is one single scheduler but there is a race condition on it

About 1., I agree we need to see how the scheduler (and later on Gantt)
could address decision-making based on distributed engines. At least, I
consider the no-db scheduler blueprint responsible for using memcache
instead of a relational DB could help some of these issues, as memcached
can be distributed efficiently.

About 2., that's a concurrency issue which can be addressed thanks to
common practices for synchronizing actions. IMHO, a local lock can be
enough for ensuring isolation



> Currently we only detect the problem when we go to actually boot the
> instance on the compute node because we have a special-case check to
> validate the policy.  Personally I think this is sort of a hack and it
> would be better to detect the problem within the scheduler itself.
>
> This is something that the scheduler should reasonably consider.  I see it
> as effectively consuming resources, except that in this case the resource
> is "the set of compute nodes not used by servers in the server group".
>
>
>
Agree. IMHO, scheduler could take decisions based on inputs and should
guarantee the result. That said, at the moment, we need to address the
issue at the compute manager level, because of the point above.

-Sylvain


> Chris
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Joshua Harlow
This begins to sound like a hierarchical reservation system to me. Are 
databases even capable of doing this correctly?

If I was going to do something like this say in zookeeper it would appear that 
this is just a atomic write to paths to resources (using the concept of a 
zookeeper txn to ensure this write happens atomically). With gantt will there 
be read-db slaves, or just 1 database. Will there be some required hierarchal 
locking scheme (always lock in the same order) like zookeeper would require (to 
avoid deadlock)? If more than 1 db (master-master, master-slave?), how will 
this work? Forgive me for my limited DB knowledge, but I thought RDBMs used 
MVCC which means that the read could be different data than what is written (so 
the write will fail?). What about using something like raft,zookeeper,…

-Josh

From: Chris Friesen 
mailto:chris.frie...@windriver.com>>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" 
mailto:openstack-dev@lists.openstack.org>>
Date: Monday, March 17, 2014 at 2:08 PM
To: 
"openstack-dev@lists.openstack.org<mailto:openstack-dev@lists.openstack.org>" 
mailto:openstack-dev@lists.openstack.org>>
Subject: Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity 
race condition on boot"

On 03/17/2014 02:30 PM, Sylvain Bauza wrote:
There is a global concern here about how an holistic scheduler can
perform decisions, and from which key metrics.
The current effort is leading to having the Gantt DB updated thanks to
resource tracker for scheduling appropriately the hosts.

If we consider these metrics as not enough, ie. that Gantt should
perform an active check to another project, that's something which needs
to be considered carefully. IMHO, on that case, Gantt should only access
metrics thanks to the project REST API (and python client) in order to
make sure that rolling upgrades could happen.
tl;dr: If Gantt requires accessing Nova data, it should request Nova
REST API, and not perform database access directly (even thru the conductor)

Consider the case in point.

1) We create a server group with anti-affinity policy.  (So no two
instances in the group should run on the same compute node.)
2) We boot a server in this group.
3) Either simultaneously (on a different scheduler) or immediately after
(on the same scheduler) we boot another server in the same group.

Ideally the scheduler should enforce the policy without any races.
However, in the current code we don't update the instance entry in the
database with the chosen host until we actually try and create it on the
host.  Because of this we can end up putting both of them on the same
compute node.

Currently we only detect the problem when we go to actually boot the
instance on the compute node because we have a special-case check to
validate the policy.  Personally I think this is sort of a hack and it
would be better to detect the problem within the scheduler itself.

This is something that the scheduler should reasonably consider.  I see
it as effectively consuming resources, except that in this case the
resource is "the set of compute nodes not used by servers in the server
group".

Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org<mailto:OpenStack-dev@lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Joe Gordon
On Mon, Mar 17, 2014 at 12:52 PM, Jay Pipes  wrote:

> On Mon, 2014-03-17 at 12:39 -0700, Joe Gordon wrote:
> > On Mon, Mar 17, 2014 at 12:29 PM, Andrew Laski
> >  wrote:
> > On 03/17/14 at 01:11pm, Chris Friesen wrote:
> > On 03/17/2014 11:59 AM, John Garbutt wrote:
> > On 17 March 2014 17:54, John Garbutt
> >  wrote:
> >
> > Given the scheduler split, writing
> > that value into the nova db from
> > the scheduler would be a step
> > backwards, and it probably breaks lots
> > of code that assumes the host is not
> > set until much later.
> >
> > Why would that be a step backwards?  The scheduler has
> > picked a host for the instance, so it seems reasonable
> > to record that information in the instance itself as
> > early as possible (to be incorporated into other
> > decision-making) rather than have it be implicit in
> > the destination of the next RPC message.
> >
> > Now I could believe that we have code that assumes
> > that having "instance.host" set implies that it's
> > already running on that host, but that's a different
> > issue.
> >
> > I forgot to mention, I am starting to be a fan
> > of a two-phase commit
> > approach, which could deal with these kinds of
> > things in a more
> > explicit way, before starting the main boot
> > process.
> >
> > Its not as elegant as a database transaction,
> > but that doesn't seems
> > possible in the log run, but there could well
> > be something I am
> > missing here too.
> >
> > I'm not an expert in this area, so I'm curious why you
> > think that database transactions wouldn't be possible
> > in the long run.
> >
> >
> > There has been some effort around splitting the scheduler out
> > of Nova and into its own project.  So down the road the
> > scheduler may not have direct access to the Nova db.
> >
> >
> > If we do pull out the nova scheduler it can have its own DB, so I
> > don't think this should be an issue.
>
> Just playing devil's advocate here, but even if Gantt had its own
> database, would that necessarily mean that there would be only a single
> database across the entire deployment? I'm thinking specifically in the
> case of cells, where presumably, scheduling requests would jump through
> multiple layers of Gantt services, would a single database transaction
> really be possible to effectively fence the entire scheduling request?
>


So that opens the whole can of gantt and cells worms.  I would rather
evaluate design decisions more around what exists today and less on what we
think will exist in the future (although we definitely don't want to design
our selves into a corner).  I'm just not very keen on the answer 'we
shouldn't do x because of this thing we talked about but haven't done.'

That being said, this debate gets more complicated when you factor in the
overhead of sqlalchemy, if we can drop that overhead we solve a lot of
problems all at once (db is used all over the place both directly and
through conductor, and sqlalchemy can have a 10x+ overhead).

For historical reasons we have spent a lot of time trying to decouple the
SQL DB from the rest of the codebase, because we left the door open for
alternate DB backends (re: noSQL).  I think the ship has sailed on that one
and we shouldn't spend worry about designing things around possibly adding
in a noSQL backend in the future.



>
> Best,
> -jay
>
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Chris Friesen

On 03/17/2014 02:30 PM, Sylvain Bauza wrote:

There is a global concern here about how an holistic scheduler can
perform decisions, and from which key metrics.
The current effort is leading to having the Gantt DB updated thanks to
resource tracker for scheduling appropriately the hosts.

If we consider these metrics as not enough, ie. that Gantt should
perform an active check to another project, that's something which needs
to be considered carefully. IMHO, on that case, Gantt should only access
metrics thanks to the project REST API (and python client) in order to
make sure that rolling upgrades could happen.
tl;dr: If Gantt requires accessing Nova data, it should request Nova
REST API, and not perform database access directly (even thru the conductor)


Consider the case in point.

1) We create a server group with anti-affinity policy.  (So no two 
instances in the group should run on the same compute node.)

2) We boot a server in this group.
3) Either simultaneously (on a different scheduler) or immediately after 
(on the same scheduler) we boot another server in the same group.


Ideally the scheduler should enforce the policy without any races. 
However, in the current code we don't update the instance entry in the 
database with the chosen host until we actually try and create it on the 
host.  Because of this we can end up putting both of them on the same 
compute node.


Currently we only detect the problem when we go to actually boot the 
instance on the compute node because we have a special-case check to 
validate the policy.  Personally I think this is sort of a hack and it 
would be better to detect the problem within the scheduler itself.


This is something that the scheduler should reasonably consider.  I see 
it as effectively consuming resources, except that in this case the 
resource is "the set of compute nodes not used by servers in the server 
group".


Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Sylvain Bauza
There is a global concern here about how an holistic scheduler can perform
decisions, and from which key metrics.
The current effort is leading to having the Gantt DB updated thanks to
resource tracker for scheduling appropriately the hosts.

If we consider these metrics as not enough, ie. that Gantt should perform
an active check to another project, that's something which needs to be
considered carefully. IMHO, on that case, Gantt should only access metrics
thanks to the project REST API (and python client) in order to make sure
that rolling upgrades could happen.
tl;dr: If Gantt requires accessing Nova data, it should request Nova REST
API, and not perform database access directly (even thru the conductor)

-Sylvain


2014-03-17 21:10 GMT+01:00 Chris Friesen :

> On 03/17/2014 01:29 PM, Andrew Laski wrote:
>
>> On 03/17/14 at 01:11pm, Chris Friesen wrote:
>>
>>> On 03/17/2014 11:59 AM, John Garbutt wrote:
>>>
 On 17 March 2014 17:54, John Garbutt  wrote:

>>>
>>>  Given the scheduler split, writing that value into the nova db from
> the scheduler would be a step backwards, and it probably breaks lots
> of code that assumes the host is not set until much later.
>

>>> Why would that be a step backwards?  The scheduler has picked a host
>>> for the instance, so it seems reasonable to record that information in
>>> the instance itself as early as possible (to be incorporated into
>>> other decision-making) rather than have it be implicit in the
>>> destination of the next RPC message.
>>>
>>> Now I could believe that we have code that assumes that having
>>> "instance.host" set implies that it's already running on that host,
>>> but that's a different issue.
>>>
>>>  I forgot to mention, I am starting to be a fan of a two-phase commit
 approach, which could deal with these kinds of things in a more
 explicit way, before starting the main boot process.

 Its not as elegant as a database transaction, but that doesn't seems
 possible in the log run, but there could well be something I am
 missing here too.

>>>
>>> I'm not an expert in this area, so I'm curious why you think that
>>> database transactions wouldn't be possible in the long run.
>>>
>>
>> There has been some effort around splitting the scheduler out of Nova
>> and into its own project.  So down the road the scheduler may not have
>> direct access to the Nova db.
>>
>
>
> Even if the scheduler itself doesn't have access to the nova DB, at some
> point we need to return back from the scheduler into a nova service
> (presumably nova-conductor) at which point we could update the nova db with
> the scheduler's decision and at that point we could check for conflicts and
> reschedule if necessary.
>
>
> Chris
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Chris Friesen

On 03/17/2014 01:29 PM, Andrew Laski wrote:

On 03/17/14 at 01:11pm, Chris Friesen wrote:

On 03/17/2014 11:59 AM, John Garbutt wrote:

On 17 March 2014 17:54, John Garbutt  wrote:



Given the scheduler split, writing that value into the nova db from
the scheduler would be a step backwards, and it probably breaks lots
of code that assumes the host is not set until much later.


Why would that be a step backwards?  The scheduler has picked a host
for the instance, so it seems reasonable to record that information in
the instance itself as early as possible (to be incorporated into
other decision-making) rather than have it be implicit in the
destination of the next RPC message.

Now I could believe that we have code that assumes that having
"instance.host" set implies that it's already running on that host,
but that's a different issue.


I forgot to mention, I am starting to be a fan of a two-phase commit
approach, which could deal with these kinds of things in a more
explicit way, before starting the main boot process.

Its not as elegant as a database transaction, but that doesn't seems
possible in the log run, but there could well be something I am
missing here too.


I'm not an expert in this area, so I'm curious why you think that
database transactions wouldn't be possible in the long run.


There has been some effort around splitting the scheduler out of Nova
and into its own project.  So down the road the scheduler may not have
direct access to the Nova db.



Even if the scheduler itself doesn't have access to the nova DB, at some 
point we need to return back from the scheduler into a nova service 
(presumably nova-conductor) at which point we could update the nova db 
with the scheduler's decision and at that point we could check for 
conflicts and reschedule if necessary.


Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Jay Pipes
On Mon, 2014-03-17 at 12:39 -0700, Joe Gordon wrote:
> On Mon, Mar 17, 2014 at 12:29 PM, Andrew Laski
>  wrote:
> On 03/17/14 at 01:11pm, Chris Friesen wrote:
> On 03/17/2014 11:59 AM, John Garbutt wrote:
> On 17 March 2014 17:54, John Garbutt
>  wrote:
> 
> Given the scheduler split, writing
> that value into the nova db from
> the scheduler would be a step
> backwards, and it probably breaks lots
> of code that assumes the host is not
> set until much later.
> 
> Why would that be a step backwards?  The scheduler has
> picked a host for the instance, so it seems reasonable
> to record that information in the instance itself as
> early as possible (to be incorporated into other
> decision-making) rather than have it be implicit in
> the destination of the next RPC message.
> 
> Now I could believe that we have code that assumes
> that having "instance.host" set implies that it's
> already running on that host, but that's a different
> issue.
> 
> I forgot to mention, I am starting to be a fan
> of a two-phase commit
> approach, which could deal with these kinds of
> things in a more
> explicit way, before starting the main boot
> process.
> 
> Its not as elegant as a database transaction,
> but that doesn't seems
> possible in the log run, but there could well
> be something I am
> missing here too.
> 
> I'm not an expert in this area, so I'm curious why you
> think that database transactions wouldn't be possible
> in the long run.
> 
> 
> There has been some effort around splitting the scheduler out
> of Nova and into its own project.  So down the road the
> scheduler may not have direct access to the Nova db.
> 
> 
> If we do pull out the nova scheduler it can have its own DB, so I
> don't think this should be an issue.

Just playing devil's advocate here, but even if Gantt had its own
database, would that necessarily mean that there would be only a single
database across the entire deployment? I'm thinking specifically in the
case of cells, where presumably, scheduling requests would jump through
multiple layers of Gantt services, would a single database transaction
really be possible to effectively fence the entire scheduling request?

Best,
-jay



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Joe Gordon
On Mon, Mar 17, 2014 at 12:29 PM, Andrew Laski
wrote:

> On 03/17/14 at 01:11pm, Chris Friesen wrote:
>
>> On 03/17/2014 11:59 AM, John Garbutt wrote:
>>
>>> On 17 March 2014 17:54, John Garbutt  wrote:
>>>
>>
>>  Given the scheduler split, writing that value into the nova db from
 the scheduler would be a step backwards, and it probably breaks lots
 of code that assumes the host is not set until much later.

>>>
>> Why would that be a step backwards?  The scheduler has picked a host for
>> the instance, so it seems reasonable to record that information in the
>> instance itself as early as possible (to be incorporated into other
>> decision-making) rather than have it be implicit in the destination of the
>> next RPC message.
>>
>> Now I could believe that we have code that assumes that having
>> "instance.host" set implies that it's already running on that host, but
>> that's a different issue.
>>
>>  I forgot to mention, I am starting to be a fan of a two-phase commit
>>> approach, which could deal with these kinds of things in a more
>>> explicit way, before starting the main boot process.
>>>
>>> Its not as elegant as a database transaction, but that doesn't seems
>>> possible in the log run, but there could well be something I am
>>> missing here too.
>>>
>>
>> I'm not an expert in this area, so I'm curious why you think that
>> database transactions wouldn't be possible in the long run.
>>
>
> There has been some effort around splitting the scheduler out of Nova and
> into its own project.  So down the road the scheduler may not have direct
> access to the Nova db.


If we do pull out the nova scheduler it can have its own DB, so I don't
think this should be an issue.


>
>
>
>> Given that the database is one of the few services that isn't prone to
>> races, it seems reasonable to me to implement decision-making as
>> transactions within the database.
>>
>> Where possible it seems to make a lot more sense to have the database do
>> an atomic transaction than to scan the database, extract a bunch of
>> (potentially unnecessary) data and transfer it over the network, do logic
>> in python, send the result back over the network and update the database
>> with the result.
>>
>> Chris
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Andrew Laski

On 03/17/14 at 01:11pm, Chris Friesen wrote:

On 03/17/2014 11:59 AM, John Garbutt wrote:

On 17 March 2014 17:54, John Garbutt  wrote:



Given the scheduler split, writing that value into the nova db from
the scheduler would be a step backwards, and it probably breaks lots
of code that assumes the host is not set until much later.


Why would that be a step backwards?  The scheduler has picked a host 
for the instance, so it seems reasonable to record that information 
in the instance itself as early as possible (to be incorporated into 
other decision-making) rather than have it be implicit in the 
destination of the next RPC message.


Now I could believe that we have code that assumes that having 
"instance.host" set implies that it's already running on that host, 
but that's a different issue.



I forgot to mention, I am starting to be a fan of a two-phase commit
approach, which could deal with these kinds of things in a more
explicit way, before starting the main boot process.

Its not as elegant as a database transaction, but that doesn't seems
possible in the log run, but there could well be something I am
missing here too.


I'm not an expert in this area, so I'm curious why you think that 
database transactions wouldn't be possible in the long run.


There has been some effort around splitting the scheduler out of Nova 
and into its own project.  So down the road the scheduler may not have 
direct access to the Nova db.




Given that the database is one of the few services that isn't prone 
to races, it seems reasonable to me to implement decision-making as 
transactions within the database.


Where possible it seems to make a lot more sense to have the database 
do an atomic transaction than to scan the database, extract a bunch 
of (potentially unnecessary) data and transfer it over the network, 
do logic in python, send the result back over the network and update 
the database with the result.


Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread Chris Friesen

On 03/17/2014 11:59 AM, John Garbutt wrote:

On 17 March 2014 17:54, John Garbutt  wrote:



Given the scheduler split, writing that value into the nova db from
the scheduler would be a step backwards, and it probably breaks lots
of code that assumes the host is not set until much later.


Why would that be a step backwards?  The scheduler has picked a host for 
the instance, so it seems reasonable to record that information in the 
instance itself as early as possible (to be incorporated into other 
decision-making) rather than have it be implicit in the destination of 
the next RPC message.


Now I could believe that we have code that assumes that having 
"instance.host" set implies that it's already running on that host, but 
that's a different issue.



I forgot to mention, I am starting to be a fan of a two-phase commit
approach, which could deal with these kinds of things in a more
explicit way, before starting the main boot process.

Its not as elegant as a database transaction, but that doesn't seems
possible in the log run, but there could well be something I am
missing here too.


I'm not an expert in this area, so I'm curious why you think that 
database transactions wouldn't be possible in the long run.


Given that the database is one of the few services that isn't prone to 
races, it seems reasonable to me to implement decision-making as 
transactions within the database.


Where possible it seems to make a lot more sense to have the database do 
an atomic transaction than to scan the database, extract a bunch of 
(potentially unnecessary) data and transfer it over the network, do 
logic in python, send the result back over the network and update the 
database with the result.


Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread John Garbutt
On 17 March 2014 17:54, John Garbutt  wrote:
> On 15 March 2014 18:39, Chris Friesen  wrote:
>> Hi,
>>
>> I'm curious why the specified git commit chose to fix the anti-affinity race
>> condition by aborting the boot and triggering a reschedule.
>>
>> It seems to me that it would have been more elegant for the scheduler to do
>> a database transaction that would atomically check that the chosen host was
>> not already part of the group, and then add the instance (with the chosen
>> host) to the group.  If the check fails then the scheduler could update the
>> group_hosts list and reschedule.  This would prevent the race condition in
>> the first place rather than detecting it later and trying to work around it.
>>
>> This would require setting the "host" field in the instance at the time of
>> scheduling rather than the time of instance creation, but that seems like it
>> should work okay.  Maybe I'm missing something though...
>
> We deal with memory races in the same way as this today, when they
> race against the scheduler.
>
> Given the scheduler split, writing that value into the nova db from
> the scheduler would be a step backwards, and it probably breaks lots
> of code that assumes the host is not set until much later.

I forgot to mention, I am starting to be a fan of a two-phase commit
approach, which could deal with these kinds of things in a more
explicit way, before starting the main boot process.

Its not as elegant as a database transaction, but that doesn't seems
possible in the log run, but there could well be something I am
missing here too.

John

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-17 Thread John Garbutt
On 15 March 2014 18:39, Chris Friesen  wrote:
> Hi,
>
> I'm curious why the specified git commit chose to fix the anti-affinity race
> condition by aborting the boot and triggering a reschedule.
>
> It seems to me that it would have been more elegant for the scheduler to do
> a database transaction that would atomically check that the chosen host was
> not already part of the group, and then add the instance (with the chosen
> host) to the group.  If the check fails then the scheduler could update the
> group_hosts list and reschedule.  This would prevent the race condition in
> the first place rather than detecting it later and trying to work around it.
>
> This would require setting the "host" field in the instance at the time of
> scheduling rather than the time of instance creation, but that seems like it
> should work okay.  Maybe I'm missing something though...

We deal with memory races in the same way as this today, when they
race against the scheduler.

Given the scheduler split, writing that value into the nova db from
the scheduler would be a step backwards, and it probably breaks lots
of code that assumes the host is not set until much later.

John

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

2014-03-15 Thread Chris Friesen

Hi,

I'm curious why the specified git commit chose to fix the anti-affinity 
race condition by aborting the boot and triggering a reschedule.


It seems to me that it would have been more elegant for the scheduler to 
do a database transaction that would atomically check that the chosen 
host was not already part of the group, and then add the instance (with 
the chosen host) to the group.  If the check fails then the scheduler 
could update the group_hosts list and reschedule.  This would prevent 
the race condition in the first place rather than detecting it later and 
trying to work around it.


This would require setting the "host" field in the instance at the time 
of scheduling rather than the time of instance creation, but that seems 
like it should work okay.  Maybe I'm missing something though...


Thanks,
Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev