Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
2014-03-18 14:07 GMT+01:00 Russell Bryant : > > I think it's great to see discussion of better ways to approach these > things, but it would have to be Juno work. > > +1. There are various blueprints about the scheduler in progress, related to either splitting it out or scaling it, and IMHO this concurrency problem should be discussed during the Juno summit in order to make sure there won't be duplicate efforts. -Sylvain -- > Russell Bryant > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On 03/17/2014 01:54 PM, John Garbutt wrote: > On 15 March 2014 18:39, Chris Friesen wrote: >> Hi, >> >> I'm curious why the specified git commit chose to fix the anti-affinity race >> condition by aborting the boot and triggering a reschedule. >> >> It seems to me that it would have been more elegant for the scheduler to do >> a database transaction that would atomically check that the chosen host was >> not already part of the group, and then add the instance (with the chosen >> host) to the group. If the check fails then the scheduler could update the >> group_hosts list and reschedule. This would prevent the race condition in >> the first place rather than detecting it later and trying to work around it. >> >> This would require setting the "host" field in the instance at the time of >> scheduling rather than the time of instance creation, but that seems like it >> should work okay. Maybe I'm missing something though... > > We deal with memory races in the same way as this today, when they > race against the scheduler. > > Given the scheduler split, writing that value into the nova db from > the scheduler would be a step backwards, and it probably breaks lots > of code that assumes the host is not set until much later. This is exactly the reason I did it this way. It fits the existing pattern with how we deal with host scheduling races today. We do the final claiming and validation on the compute node itself and kick back to the scheduler if something doesn't work out. Alternatives are *way* too risky to be doing in feature freeze, IMO. I think it's great to see discussion of better ways to approach these things, but it would have to be Juno work. -- Russell Bryant ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
Hi Chris, 2014-03-18 0:36 GMT+01:00 Chris Friesen : > On 03/17/2014 05:01 PM, Sylvain Bauza wrote: > > >> There are 2 distinct cases : >> 1. there are multiple schedulers involved in the decision >> 2. there is one single scheduler but there is a race condition on it >> > > > About 1., I agree we need to see how the scheduler (and later on Gantt) >> could address decision-making based on distributed engines. At least, I >> consider the no-db scheduler blueprint responsible for using memcache >> instead of a relational DB could help some of these issues, as memcached >> can be distributed efficiently. >> > > With a central database we could do a single atomic transaction that looks > something like "select the first host A from list of hosts L that is not in > the list of hosts used by servers in group G and then set the host field > for server S to A". In that context simultaneous updates can't happen > because they're serialized by the central database. > > How would one handle the above for simultaneous scheduling operations > without a centralized data store? (I've never played with memcached, so > I'm not really familiar with what it can do.) > > See the rationale here for memcached-based scheduler : https://blueprints.launchpad.net/nova/+spec/no-db-scheduler The idea is to leverage the capabilities of distributed memcached servers with synchronization so that the decision would be scalable. As said in the blueprint, another way would be to make use of RPC fanouts, but that's something Openstack in general tries to avoid. > > About 2., that's a concurrency issue which can be addressed thanks to >> common practices for synchronizing actions. IMHO, a local lock can be >> enough for ensuring isolation >> > > It's not that simple though. Currently the scheduler makes a decision, > but the results of that decision aren't actually kept in the scheduler or > written back to the db until much later when the instance is actually > spawned on the compute node. So when the next scheduler request comes in > we violate the scheduling policy. Local locking wouldn't help this. > > > Uh, you're right, missed that crucial point. That said, we should consider that as a classlcal problem of placement with deferral action. One possibility would be to consider that the host is locked to this group at the scheduling decision time, even if the first instance hasn't yet booted. Consider it as a "cache" with TTL if you wish. Thus, that implies the scheduler would need to have a feedback value from the compute node saying that the instance really booted. If no ACK comes from the compute node, once the TTL vanishes, the lock is freed. -Sylvain > Chris > > > > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On 03/17/2014 05:01 PM, Sylvain Bauza wrote: There are 2 distinct cases : 1. there are multiple schedulers involved in the decision 2. there is one single scheduler but there is a race condition on it About 1., I agree we need to see how the scheduler (and later on Gantt) could address decision-making based on distributed engines. At least, I consider the no-db scheduler blueprint responsible for using memcache instead of a relational DB could help some of these issues, as memcached can be distributed efficiently. With a central database we could do a single atomic transaction that looks something like "select the first host A from list of hosts L that is not in the list of hosts used by servers in group G and then set the host field for server S to A". In that context simultaneous updates can't happen because they're serialized by the central database. How would one handle the above for simultaneous scheduling operations without a centralized data store? (I've never played with memcached, so I'm not really familiar with what it can do.) About 2., that's a concurrency issue which can be addressed thanks to common practices for synchronizing actions. IMHO, a local lock can be enough for ensuring isolation It's not that simple though. Currently the scheduler makes a decision, but the results of that decision aren't actually kept in the scheduler or written back to the db until much later when the instance is actually spawned on the compute node. So when the next scheduler request comes in we violate the scheduling policy. Local locking wouldn't help this. Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
Hi Chris, 2014-03-17 23:08 GMT+01:00 Chris Friesen : > On 03/17/2014 02:30 PM, Sylvain Bauza wrote: > >> There is a global concern here about how an holistic scheduler can >> perform decisions, and from which key metrics. >> The current effort is leading to having the Gantt DB updated thanks to >> resource tracker for scheduling appropriately the hosts. >> >> If we consider these metrics as not enough, ie. that Gantt should >> perform an active check to another project, that's something which needs >> to be considered carefully. IMHO, on that case, Gantt should only access >> metrics thanks to the project REST API (and python client) in order to >> make sure that rolling upgrades could happen. >> tl;dr: If Gantt requires accessing Nova data, it should request Nova >> REST API, and not perform database access directly (even thru the >> conductor) >> > > Consider the case in point. > > 1) We create a server group with anti-affinity policy. (So no two > instances in the group should run on the same compute node.) > 2) We boot a server in this group. > 3) Either simultaneously (on a different scheduler) or immediately after > (on the same scheduler) we boot another server in the same group. > > Ideally the scheduler should enforce the policy without any races. > However, in the current code we don't update the instance entry in the > database with the chosen host until we actually try and create it on the > host. Because of this we can end up putting both of them on the same > compute node. > > There are 2 distinct cases : 1. there are multiple schedulers involved in the decision 2. there is one single scheduler but there is a race condition on it About 1., I agree we need to see how the scheduler (and later on Gantt) could address decision-making based on distributed engines. At least, I consider the no-db scheduler blueprint responsible for using memcache instead of a relational DB could help some of these issues, as memcached can be distributed efficiently. About 2., that's a concurrency issue which can be addressed thanks to common practices for synchronizing actions. IMHO, a local lock can be enough for ensuring isolation > Currently we only detect the problem when we go to actually boot the > instance on the compute node because we have a special-case check to > validate the policy. Personally I think this is sort of a hack and it > would be better to detect the problem within the scheduler itself. > > This is something that the scheduler should reasonably consider. I see it > as effectively consuming resources, except that in this case the resource > is "the set of compute nodes not used by servers in the server group". > > > Agree. IMHO, scheduler could take decisions based on inputs and should guarantee the result. That said, at the moment, we need to address the issue at the compute manager level, because of the point above. -Sylvain > Chris > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
This begins to sound like a hierarchical reservation system to me. Are databases even capable of doing this correctly? If I was going to do something like this say in zookeeper it would appear that this is just a atomic write to paths to resources (using the concept of a zookeeper txn to ensure this write happens atomically). With gantt will there be read-db slaves, or just 1 database. Will there be some required hierarchal locking scheme (always lock in the same order) like zookeeper would require (to avoid deadlock)? If more than 1 db (master-master, master-slave?), how will this work? Forgive me for my limited DB knowledge, but I thought RDBMs used MVCC which means that the read could be different data than what is written (so the write will fail?). What about using something like raft,zookeeper,… -Josh From: Chris Friesen mailto:chris.frie...@windriver.com>> Reply-To: "OpenStack Development Mailing List (not for usage questions)" mailto:openstack-dev@lists.openstack.org>> Date: Monday, March 17, 2014 at 2:08 PM To: "openstack-dev@lists.openstack.org<mailto:openstack-dev@lists.openstack.org>" mailto:openstack-dev@lists.openstack.org>> Subject: Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot" On 03/17/2014 02:30 PM, Sylvain Bauza wrote: There is a global concern here about how an holistic scheduler can perform decisions, and from which key metrics. The current effort is leading to having the Gantt DB updated thanks to resource tracker for scheduling appropriately the hosts. If we consider these metrics as not enough, ie. that Gantt should perform an active check to another project, that's something which needs to be considered carefully. IMHO, on that case, Gantt should only access metrics thanks to the project REST API (and python client) in order to make sure that rolling upgrades could happen. tl;dr: If Gantt requires accessing Nova data, it should request Nova REST API, and not perform database access directly (even thru the conductor) Consider the case in point. 1) We create a server group with anti-affinity policy. (So no two instances in the group should run on the same compute node.) 2) We boot a server in this group. 3) Either simultaneously (on a different scheduler) or immediately after (on the same scheduler) we boot another server in the same group. Ideally the scheduler should enforce the policy without any races. However, in the current code we don't update the instance entry in the database with the chosen host until we actually try and create it on the host. Because of this we can end up putting both of them on the same compute node. Currently we only detect the problem when we go to actually boot the instance on the compute node because we have a special-case check to validate the policy. Personally I think this is sort of a hack and it would be better to detect the problem within the scheduler itself. This is something that the scheduler should reasonably consider. I see it as effectively consuming resources, except that in this case the resource is "the set of compute nodes not used by servers in the server group". Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org<mailto:OpenStack-dev@lists.openstack.org> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On Mon, Mar 17, 2014 at 12:52 PM, Jay Pipes wrote: > On Mon, 2014-03-17 at 12:39 -0700, Joe Gordon wrote: > > On Mon, Mar 17, 2014 at 12:29 PM, Andrew Laski > > wrote: > > On 03/17/14 at 01:11pm, Chris Friesen wrote: > > On 03/17/2014 11:59 AM, John Garbutt wrote: > > On 17 March 2014 17:54, John Garbutt > > wrote: > > > > Given the scheduler split, writing > > that value into the nova db from > > the scheduler would be a step > > backwards, and it probably breaks lots > > of code that assumes the host is not > > set until much later. > > > > Why would that be a step backwards? The scheduler has > > picked a host for the instance, so it seems reasonable > > to record that information in the instance itself as > > early as possible (to be incorporated into other > > decision-making) rather than have it be implicit in > > the destination of the next RPC message. > > > > Now I could believe that we have code that assumes > > that having "instance.host" set implies that it's > > already running on that host, but that's a different > > issue. > > > > I forgot to mention, I am starting to be a fan > > of a two-phase commit > > approach, which could deal with these kinds of > > things in a more > > explicit way, before starting the main boot > > process. > > > > Its not as elegant as a database transaction, > > but that doesn't seems > > possible in the log run, but there could well > > be something I am > > missing here too. > > > > I'm not an expert in this area, so I'm curious why you > > think that database transactions wouldn't be possible > > in the long run. > > > > > > There has been some effort around splitting the scheduler out > > of Nova and into its own project. So down the road the > > scheduler may not have direct access to the Nova db. > > > > > > If we do pull out the nova scheduler it can have its own DB, so I > > don't think this should be an issue. > > Just playing devil's advocate here, but even if Gantt had its own > database, would that necessarily mean that there would be only a single > database across the entire deployment? I'm thinking specifically in the > case of cells, where presumably, scheduling requests would jump through > multiple layers of Gantt services, would a single database transaction > really be possible to effectively fence the entire scheduling request? > So that opens the whole can of gantt and cells worms. I would rather evaluate design decisions more around what exists today and less on what we think will exist in the future (although we definitely don't want to design our selves into a corner). I'm just not very keen on the answer 'we shouldn't do x because of this thing we talked about but haven't done.' That being said, this debate gets more complicated when you factor in the overhead of sqlalchemy, if we can drop that overhead we solve a lot of problems all at once (db is used all over the place both directly and through conductor, and sqlalchemy can have a 10x+ overhead). For historical reasons we have spent a lot of time trying to decouple the SQL DB from the rest of the codebase, because we left the door open for alternate DB backends (re: noSQL). I think the ship has sailed on that one and we shouldn't spend worry about designing things around possibly adding in a noSQL backend in the future. > > Best, > -jay > > > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On 03/17/2014 02:30 PM, Sylvain Bauza wrote: There is a global concern here about how an holistic scheduler can perform decisions, and from which key metrics. The current effort is leading to having the Gantt DB updated thanks to resource tracker for scheduling appropriately the hosts. If we consider these metrics as not enough, ie. that Gantt should perform an active check to another project, that's something which needs to be considered carefully. IMHO, on that case, Gantt should only access metrics thanks to the project REST API (and python client) in order to make sure that rolling upgrades could happen. tl;dr: If Gantt requires accessing Nova data, it should request Nova REST API, and not perform database access directly (even thru the conductor) Consider the case in point. 1) We create a server group with anti-affinity policy. (So no two instances in the group should run on the same compute node.) 2) We boot a server in this group. 3) Either simultaneously (on a different scheduler) or immediately after (on the same scheduler) we boot another server in the same group. Ideally the scheduler should enforce the policy without any races. However, in the current code we don't update the instance entry in the database with the chosen host until we actually try and create it on the host. Because of this we can end up putting both of them on the same compute node. Currently we only detect the problem when we go to actually boot the instance on the compute node because we have a special-case check to validate the policy. Personally I think this is sort of a hack and it would be better to detect the problem within the scheduler itself. This is something that the scheduler should reasonably consider. I see it as effectively consuming resources, except that in this case the resource is "the set of compute nodes not used by servers in the server group". Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
There is a global concern here about how an holistic scheduler can perform decisions, and from which key metrics. The current effort is leading to having the Gantt DB updated thanks to resource tracker for scheduling appropriately the hosts. If we consider these metrics as not enough, ie. that Gantt should perform an active check to another project, that's something which needs to be considered carefully. IMHO, on that case, Gantt should only access metrics thanks to the project REST API (and python client) in order to make sure that rolling upgrades could happen. tl;dr: If Gantt requires accessing Nova data, it should request Nova REST API, and not perform database access directly (even thru the conductor) -Sylvain 2014-03-17 21:10 GMT+01:00 Chris Friesen : > On 03/17/2014 01:29 PM, Andrew Laski wrote: > >> On 03/17/14 at 01:11pm, Chris Friesen wrote: >> >>> On 03/17/2014 11:59 AM, John Garbutt wrote: >>> On 17 March 2014 17:54, John Garbutt wrote: >>> >>> Given the scheduler split, writing that value into the nova db from > the scheduler would be a step backwards, and it probably breaks lots > of code that assumes the host is not set until much later. > >>> Why would that be a step backwards? The scheduler has picked a host >>> for the instance, so it seems reasonable to record that information in >>> the instance itself as early as possible (to be incorporated into >>> other decision-making) rather than have it be implicit in the >>> destination of the next RPC message. >>> >>> Now I could believe that we have code that assumes that having >>> "instance.host" set implies that it's already running on that host, >>> but that's a different issue. >>> >>> I forgot to mention, I am starting to be a fan of a two-phase commit approach, which could deal with these kinds of things in a more explicit way, before starting the main boot process. Its not as elegant as a database transaction, but that doesn't seems possible in the log run, but there could well be something I am missing here too. >>> >>> I'm not an expert in this area, so I'm curious why you think that >>> database transactions wouldn't be possible in the long run. >>> >> >> There has been some effort around splitting the scheduler out of Nova >> and into its own project. So down the road the scheduler may not have >> direct access to the Nova db. >> > > > Even if the scheduler itself doesn't have access to the nova DB, at some > point we need to return back from the scheduler into a nova service > (presumably nova-conductor) at which point we could update the nova db with > the scheduler's decision and at that point we could check for conflicts and > reschedule if necessary. > > > Chris > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On 03/17/2014 01:29 PM, Andrew Laski wrote: On 03/17/14 at 01:11pm, Chris Friesen wrote: On 03/17/2014 11:59 AM, John Garbutt wrote: On 17 March 2014 17:54, John Garbutt wrote: Given the scheduler split, writing that value into the nova db from the scheduler would be a step backwards, and it probably breaks lots of code that assumes the host is not set until much later. Why would that be a step backwards? The scheduler has picked a host for the instance, so it seems reasonable to record that information in the instance itself as early as possible (to be incorporated into other decision-making) rather than have it be implicit in the destination of the next RPC message. Now I could believe that we have code that assumes that having "instance.host" set implies that it's already running on that host, but that's a different issue. I forgot to mention, I am starting to be a fan of a two-phase commit approach, which could deal with these kinds of things in a more explicit way, before starting the main boot process. Its not as elegant as a database transaction, but that doesn't seems possible in the log run, but there could well be something I am missing here too. I'm not an expert in this area, so I'm curious why you think that database transactions wouldn't be possible in the long run. There has been some effort around splitting the scheduler out of Nova and into its own project. So down the road the scheduler may not have direct access to the Nova db. Even if the scheduler itself doesn't have access to the nova DB, at some point we need to return back from the scheduler into a nova service (presumably nova-conductor) at which point we could update the nova db with the scheduler's decision and at that point we could check for conflicts and reschedule if necessary. Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On Mon, 2014-03-17 at 12:39 -0700, Joe Gordon wrote: > On Mon, Mar 17, 2014 at 12:29 PM, Andrew Laski > wrote: > On 03/17/14 at 01:11pm, Chris Friesen wrote: > On 03/17/2014 11:59 AM, John Garbutt wrote: > On 17 March 2014 17:54, John Garbutt > wrote: > > Given the scheduler split, writing > that value into the nova db from > the scheduler would be a step > backwards, and it probably breaks lots > of code that assumes the host is not > set until much later. > > Why would that be a step backwards? The scheduler has > picked a host for the instance, so it seems reasonable > to record that information in the instance itself as > early as possible (to be incorporated into other > decision-making) rather than have it be implicit in > the destination of the next RPC message. > > Now I could believe that we have code that assumes > that having "instance.host" set implies that it's > already running on that host, but that's a different > issue. > > I forgot to mention, I am starting to be a fan > of a two-phase commit > approach, which could deal with these kinds of > things in a more > explicit way, before starting the main boot > process. > > Its not as elegant as a database transaction, > but that doesn't seems > possible in the log run, but there could well > be something I am > missing here too. > > I'm not an expert in this area, so I'm curious why you > think that database transactions wouldn't be possible > in the long run. > > > There has been some effort around splitting the scheduler out > of Nova and into its own project. So down the road the > scheduler may not have direct access to the Nova db. > > > If we do pull out the nova scheduler it can have its own DB, so I > don't think this should be an issue. Just playing devil's advocate here, but even if Gantt had its own database, would that necessarily mean that there would be only a single database across the entire deployment? I'm thinking specifically in the case of cells, where presumably, scheduling requests would jump through multiple layers of Gantt services, would a single database transaction really be possible to effectively fence the entire scheduling request? Best, -jay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On Mon, Mar 17, 2014 at 12:29 PM, Andrew Laski wrote: > On 03/17/14 at 01:11pm, Chris Friesen wrote: > >> On 03/17/2014 11:59 AM, John Garbutt wrote: >> >>> On 17 March 2014 17:54, John Garbutt wrote: >>> >> >> Given the scheduler split, writing that value into the nova db from the scheduler would be a step backwards, and it probably breaks lots of code that assumes the host is not set until much later. >>> >> Why would that be a step backwards? The scheduler has picked a host for >> the instance, so it seems reasonable to record that information in the >> instance itself as early as possible (to be incorporated into other >> decision-making) rather than have it be implicit in the destination of the >> next RPC message. >> >> Now I could believe that we have code that assumes that having >> "instance.host" set implies that it's already running on that host, but >> that's a different issue. >> >> I forgot to mention, I am starting to be a fan of a two-phase commit >>> approach, which could deal with these kinds of things in a more >>> explicit way, before starting the main boot process. >>> >>> Its not as elegant as a database transaction, but that doesn't seems >>> possible in the log run, but there could well be something I am >>> missing here too. >>> >> >> I'm not an expert in this area, so I'm curious why you think that >> database transactions wouldn't be possible in the long run. >> > > There has been some effort around splitting the scheduler out of Nova and > into its own project. So down the road the scheduler may not have direct > access to the Nova db. If we do pull out the nova scheduler it can have its own DB, so I don't think this should be an issue. > > > >> Given that the database is one of the few services that isn't prone to >> races, it seems reasonable to me to implement decision-making as >> transactions within the database. >> >> Where possible it seems to make a lot more sense to have the database do >> an atomic transaction than to scan the database, extract a bunch of >> (potentially unnecessary) data and transfer it over the network, do logic >> in python, send the result back over the network and update the database >> with the result. >> >> Chris >> >> ___ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On 03/17/14 at 01:11pm, Chris Friesen wrote: On 03/17/2014 11:59 AM, John Garbutt wrote: On 17 March 2014 17:54, John Garbutt wrote: Given the scheduler split, writing that value into the nova db from the scheduler would be a step backwards, and it probably breaks lots of code that assumes the host is not set until much later. Why would that be a step backwards? The scheduler has picked a host for the instance, so it seems reasonable to record that information in the instance itself as early as possible (to be incorporated into other decision-making) rather than have it be implicit in the destination of the next RPC message. Now I could believe that we have code that assumes that having "instance.host" set implies that it's already running on that host, but that's a different issue. I forgot to mention, I am starting to be a fan of a two-phase commit approach, which could deal with these kinds of things in a more explicit way, before starting the main boot process. Its not as elegant as a database transaction, but that doesn't seems possible in the log run, but there could well be something I am missing here too. I'm not an expert in this area, so I'm curious why you think that database transactions wouldn't be possible in the long run. There has been some effort around splitting the scheduler out of Nova and into its own project. So down the road the scheduler may not have direct access to the Nova db. Given that the database is one of the few services that isn't prone to races, it seems reasonable to me to implement decision-making as transactions within the database. Where possible it seems to make a lot more sense to have the database do an atomic transaction than to scan the database, extract a bunch of (potentially unnecessary) data and transfer it over the network, do logic in python, send the result back over the network and update the database with the result. Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On 03/17/2014 11:59 AM, John Garbutt wrote: On 17 March 2014 17:54, John Garbutt wrote: Given the scheduler split, writing that value into the nova db from the scheduler would be a step backwards, and it probably breaks lots of code that assumes the host is not set until much later. Why would that be a step backwards? The scheduler has picked a host for the instance, so it seems reasonable to record that information in the instance itself as early as possible (to be incorporated into other decision-making) rather than have it be implicit in the destination of the next RPC message. Now I could believe that we have code that assumes that having "instance.host" set implies that it's already running on that host, but that's a different issue. I forgot to mention, I am starting to be a fan of a two-phase commit approach, which could deal with these kinds of things in a more explicit way, before starting the main boot process. Its not as elegant as a database transaction, but that doesn't seems possible in the log run, but there could well be something I am missing here too. I'm not an expert in this area, so I'm curious why you think that database transactions wouldn't be possible in the long run. Given that the database is one of the few services that isn't prone to races, it seems reasonable to me to implement decision-making as transactions within the database. Where possible it seems to make a lot more sense to have the database do an atomic transaction than to scan the database, extract a bunch of (potentially unnecessary) data and transfer it over the network, do logic in python, send the result back over the network and update the database with the result. Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On 17 March 2014 17:54, John Garbutt wrote: > On 15 March 2014 18:39, Chris Friesen wrote: >> Hi, >> >> I'm curious why the specified git commit chose to fix the anti-affinity race >> condition by aborting the boot and triggering a reschedule. >> >> It seems to me that it would have been more elegant for the scheduler to do >> a database transaction that would atomically check that the chosen host was >> not already part of the group, and then add the instance (with the chosen >> host) to the group. If the check fails then the scheduler could update the >> group_hosts list and reschedule. This would prevent the race condition in >> the first place rather than detecting it later and trying to work around it. >> >> This would require setting the "host" field in the instance at the time of >> scheduling rather than the time of instance creation, but that seems like it >> should work okay. Maybe I'm missing something though... > > We deal with memory races in the same way as this today, when they > race against the scheduler. > > Given the scheduler split, writing that value into the nova db from > the scheduler would be a step backwards, and it probably breaks lots > of code that assumes the host is not set until much later. I forgot to mention, I am starting to be a fan of a two-phase commit approach, which could deal with these kinds of things in a more explicit way, before starting the main boot process. Its not as elegant as a database transaction, but that doesn't seems possible in the log run, but there could well be something I am missing here too. John ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
On 15 March 2014 18:39, Chris Friesen wrote: > Hi, > > I'm curious why the specified git commit chose to fix the anti-affinity race > condition by aborting the boot and triggering a reschedule. > > It seems to me that it would have been more elegant for the scheduler to do > a database transaction that would atomically check that the chosen host was > not already part of the group, and then add the instance (with the chosen > host) to the group. If the check fails then the scheduler could update the > group_hosts list and reschedule. This would prevent the race condition in > the first place rather than detecting it later and trying to work around it. > > This would require setting the "host" field in the instance at the time of > scheduling rather than the time of instance creation, but that seems like it > should work okay. Maybe I'm missing something though... We deal with memory races in the same way as this today, when they race against the scheduler. Given the scheduler split, writing that value into the nova db from the scheduler would be a step backwards, and it probably breaks lots of code that assumes the host is not set until much later. John ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"
Hi, I'm curious why the specified git commit chose to fix the anti-affinity race condition by aborting the boot and triggering a reschedule. It seems to me that it would have been more elegant for the scheduler to do a database transaction that would atomically check that the chosen host was not already part of the group, and then add the instance (with the chosen host) to the group. If the check fails then the scheduler could update the group_hosts list and reschedule. This would prevent the race condition in the first place rather than detecting it later and trying to work around it. This would require setting the "host" field in the instance at the time of scheduling rather than the time of instance creation, but that seems like it should work okay. Maybe I'm missing something though... Thanks, Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev