LGTM to the patch, that is a step in the right direction. And I agree to postpone a more complete fix to when we'll have the new locking.
Thanks, Michele On Fri, Nov 29, 2013 at 8:17 PM, Petr Pudlák <[email protected]> wrote: > Hi Dimitris, > > yes, the patch isn't meant to fix the original problem, only the one Thomas > brought up with LUInstanceCreate. > > We discussed the main issue, and it seems there isn't a good solution for > this. One is what you suggest (thank you), but the drawback is that we need > to lock really all, which could be a big bottleneck (although we could later > release the instances not belonging to the group). I guess many users would > rather risk one job failing rather om this manner than requiring to have all > instances blocked. > > Another view is just saying that optimistic locking is allowed to fail. This > is currently what many operations do. Perhaps we could just issue a better > error message, although for this we'd also have to touch the core LU > framework. > > We're currently working on moving the locking part into a separate daemon, > and after that there will be more room for improvement. In future, we could > implement automatic restarts of jobs that fail because of unsatisfied > optimistic locks, which would solve issues like this. > > With best regards, > Petr > > > On Fri, Nov 29, 2013 at 3:51 PM, Dimitris Aragiorgis <[email protected]> > wrote: >> >> Hi, >> >> * Thomas Thrainer <[email protected]> [2013-11-29 10:40:59 +0100]: >> >> > LGTM, thanks! >> > But as I was in involved quite a bit in the bug / discussion, maybe it >> > might make sense that somebody else takes a look too... >> > >> >> I think this patch does not actually fixes the issue 621 (explained in >> #3). >> >> The reported corner case was when LUInstanceRemove() and >> LUNetworkDisconnect() were running concurently. >> >> A workaround that we have tried is upon LUNetworkDisconnect() and >> LUNetworkConnect() try to acquire all cluster's instances in shared mode. >> By that _LS_ACQUIRE_ALL mode is set and not _LS_ACQUIRE_EXACT and thus the >> deleted lock does cause any problem. >> >> Still I don't think this is the proper way to go. Maybe exporting the >> locking >> mode up to the LU level just like oportunistic/share locking option >> (as chris suggested during an internal discussion) is a rational decision. >> >> What do you think? >> >> Besides that thanks for the patch, >> dimara >> >> >> > Thomas >> > >> > >> > On Thu, Nov 28, 2013 at 3:48 PM, Petr Pudlak <[email protected]> wrote: >> > >> > > This is required to prevent race conditions such as removing a network >> > > from a group and adding an instance at the same time. (See issue >> > > 621#2.) >> > > >> > > Signed-off-by: Petr Pudlak <[email protected]> >> > > --- >> > > lib/cmdlib/instance.py | 34 ++++++++++++++++++++++++++++++++++ >> > > 1 file changed, 34 insertions(+) >> > > >> > > diff --git a/lib/cmdlib/instance.py b/lib/cmdlib/instance.py >> > > index 12340e8..5d58544 100644 >> > > --- a/lib/cmdlib/instance.py >> > > +++ b/lib/cmdlib/instance.py >> > > @@ -559,6 +559,22 @@ class LUInstanceCreate(LogicalUnit): >> > > self.needed_locks[locking.LEVEL_NODE_RES] = \ >> > > CopyLockList(self.needed_locks[locking.LEVEL_NODE]) >> > > >> > > + # Optimistically acquire shared group locks (we're reading the >> > > + # configuration). We can't just call GetInstanceNodeGroups, >> > > because >> > > the >> > > + # instance doesn't exist yet. Therefore we lock all node groups >> > > of all >> > > + # nodes we have. >> > > + if self.needed_locks[locking.LEVEL_NODE] == locking.ALL_SET: >> > > + # In the case we lock all nodes for opportunistic allocation, >> > > we >> > > have no >> > > + # choice than to lock all groups, because they're allocated >> > > before >> > > nodes. >> > > + # This is sad, but true. At least we release all those we don't >> > > need in >> > > + # CheckPrereq later. >> > > + self.needed_locks[locking.LEVEL_NODEGROUP] = locking.ALL_SET >> > > + else: >> > > + self.needed_locks[locking.LEVEL_NODEGROUP] = \ >> > > + list(self.cfg.GetNodeGroupsFromNodes( >> > > + self.needed_locks[locking.LEVEL_NODE])) >> > > + self.share_locks[locking.LEVEL_NODEGROUP] = 1 >> > > + >> > > def DeclareLocks(self, level): >> > > if level == locking.LEVEL_NODE_RES and \ >> > > self.opportunistic_locks[locking.LEVEL_NODE]: >> > > @@ -846,6 +862,21 @@ class LUInstanceCreate(LogicalUnit): >> > > """Check prerequisites. >> > > >> > > """ >> > > + # Check that the optimistically acquired groups are correct wrt >> > > the >> > > + # acquired nodes >> > > + owned_groups = >> > > frozenset(self.owned_locks(locking.LEVEL_NODEGROUP)) >> > > + owned_nodes = frozenset(self.owned_locks(locking.LEVEL_NODE)) >> > > + cur_groups = list(self.cfg.GetNodeGroupsFromNodes(owned_nodes)) >> > > + if not owned_groups.issuperset(cur_groups): >> > > + raise errors.OpPrereqError("New instance %s's node groups >> > > changed >> > > since" >> > > + " locks were acquired, current >> > > groups >> > > are" >> > > + " are '%s', owning groups '%s'; >> > > retry >> > > the" >> > > + " operation" % >> > > + (self.op.instance_name, >> > > + utils.CommaJoin(cur_groups), >> > > + utils.CommaJoin(owned_groups)), >> > > + errors.ECODE_STATE) >> > > + >> > > self._CalculateFileStorageDir() >> > > >> > > if self.op.mode == constants.INSTANCE_IMPORT: >> > > @@ -957,6 +988,9 @@ class LUInstanceCreate(LogicalUnit): >> > > ReleaseLocks(self, locking.LEVEL_NODE, keep=keep_locks) >> > > ReleaseLocks(self, locking.LEVEL_NODE_RES, keep=keep_locks) >> > > ReleaseLocks(self, locking.LEVEL_NODE_ALLOC) >> > > + # Release all unneeded group locks >> > > + ReleaseLocks(self, locking.LEVEL_NODEGROUP, >> > > + keep=self.cfg.GetNodeGroupsFromNodes(keep_locks)) >> > > >> > > assert (self.owned_locks(locking.LEVEL_NODE) == >> > > self.owned_locks(locking.LEVEL_NODE_RES)), \ >> > > -- >> > > 1.8.4.1 >> > > >> > > >> > >> > >> > -- >> > Thomas Thrainer | Software Engineer | [email protected] | >> > >> > Google Germany GmbH >> > Dienerstr. 12 >> > 80331 München >> > >> > Registergericht und -nummer: Hamburg, HRB 86891 >> > Sitz der Gesellschaft: Hamburg >> > Geschäftsführer: Graham Law, Christine Elizabeth Flores >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.10 (GNU/Linux) >> >> iQEcBAEBAgAGBQJSmKn3AAoJEHFDHex6CBG9ID0H/jAeb200ErR/83NdINO/H8kq >> gnUbEqHGI9SqbUAOxBIY1y8OFFiM1MPL8zQRFqB/+YMAOkg7kWSmpM0Mb+eYW5nX >> Yap9OJXrxy9+xiZEiQuoAQdRzNJNQwZQ007e2ykGP6/jAR/5uPigyv4FUSJvwQ/P >> Plto1U7CvnRIg7HSxqB1WnsKpcVnXpzQWaonppX47QC+nucEbO37HcvXL6XuPO0d >> F1IxSoIbrF60D+yzR95RcBGG7rp8PHR1dOd9TRGETBsJjh41jV6AuoMKxAQ0+NYe >> UKRQlDHJEKEfMxinTxmhEQBF2OaNpQzHznxLcrMgq9WATL+PojgM4NzDmDNfa28= >> =ebb3 >> -----END PGP SIGNATURE----- >> > -- Google Germany GmbH Dienerstr. 12 80331 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Christine Elizabeth Flores
