LGTM to the patch, that is a step in the right direction.
And I agree to postpone a more complete fix to when we'll have the new locking.

Thanks,
Michele

On Fri, Nov 29, 2013 at 8:17 PM, Petr Pudlák <[email protected]> wrote:
>  Hi Dimitris,
>
> yes, the patch isn't meant to fix the original problem, only the one Thomas
> brought up with LUInstanceCreate.
>
> We discussed the main issue, and it seems there isn't a good solution for
> this. One is what you suggest (thank you), but the drawback is that we need
> to lock really all, which could be a big bottleneck (although we could later
> release the instances not belonging to the group). I guess many users would
> rather risk one job failing rather om this manner than requiring to have all
> instances blocked.
>
> Another view is just saying that optimistic locking is allowed to fail. This
> is currently what many operations do. Perhaps we could just issue a better
> error message, although for this we'd also have to touch the core LU
> framework.
>
> We're currently working on moving the locking part into a separate daemon,
> and after that there will be more room for improvement. In future, we could
> implement automatic restarts of jobs that fail because of unsatisfied
> optimistic locks, which would solve issues like this.
>
>   With best regards,
>   Petr
>
>
> On Fri, Nov 29, 2013 at 3:51 PM, Dimitris Aragiorgis <[email protected]>
> wrote:
>>
>> Hi,
>>
>> * Thomas Thrainer <[email protected]> [2013-11-29 10:40:59 +0100]:
>>
>> > LGTM, thanks!
>> > But as I was in involved quite a bit in the bug / discussion, maybe it
>> > might make sense that somebody else takes a look too...
>> >
>>
>> I think this patch does not actually fixes the issue 621 (explained in
>> #3).
>>
>> The reported corner case was when LUInstanceRemove() and
>> LUNetworkDisconnect() were running concurently.
>>
>> A workaround that we have tried is upon LUNetworkDisconnect() and
>> LUNetworkConnect() try to acquire all cluster's instances in shared mode.
>> By that _LS_ACQUIRE_ALL mode is set and not _LS_ACQUIRE_EXACT and thus the
>> deleted lock does cause any problem.
>>
>> Still I don't think this is the proper way to go. Maybe exporting the
>> locking
>> mode up to the LU level just like oportunistic/share locking option
>> (as chris suggested during an internal discussion) is a rational decision.
>>
>> What do you think?
>>
>> Besides that thanks for the patch,
>> dimara
>>
>>
>> > Thomas
>> >
>> >
>> > On Thu, Nov 28, 2013 at 3:48 PM, Petr Pudlak <[email protected]> wrote:
>> >
>> > > This is required to prevent race conditions such as removing a network
>> > > from a group and adding an instance at the same time. (See issue
>> > > 621#2.)
>> > >
>> > > Signed-off-by: Petr Pudlak <[email protected]>
>> > > ---
>> > >  lib/cmdlib/instance.py | 34 ++++++++++++++++++++++++++++++++++
>> > >  1 file changed, 34 insertions(+)
>> > >
>> > > diff --git a/lib/cmdlib/instance.py b/lib/cmdlib/instance.py
>> > > index 12340e8..5d58544 100644
>> > > --- a/lib/cmdlib/instance.py
>> > > +++ b/lib/cmdlib/instance.py
>> > > @@ -559,6 +559,22 @@ class LUInstanceCreate(LogicalUnit):
>> > >      self.needed_locks[locking.LEVEL_NODE_RES] = \
>> > >        CopyLockList(self.needed_locks[locking.LEVEL_NODE])
>> > >
>> > > +    # Optimistically acquire shared group locks (we're reading the
>> > > +    # configuration).  We can't just call GetInstanceNodeGroups,
>> > > because
>> > > the
>> > > +    # instance doesn't exist yet. Therefore we lock all node groups
>> > > of all
>> > > +    # nodes we have.
>> > > +    if self.needed_locks[locking.LEVEL_NODE] == locking.ALL_SET:
>> > > +      # In the case we lock all nodes for opportunistic allocation,
>> > > we
>> > > have no
>> > > +      # choice than to lock all groups, because they're allocated
>> > > before
>> > > nodes.
>> > > +      # This is sad, but true. At least we release all those we don't
>> > > need in
>> > > +      # CheckPrereq later.
>> > > +      self.needed_locks[locking.LEVEL_NODEGROUP] = locking.ALL_SET
>> > > +    else:
>> > > +      self.needed_locks[locking.LEVEL_NODEGROUP] = \
>> > > +        list(self.cfg.GetNodeGroupsFromNodes(
>> > > +          self.needed_locks[locking.LEVEL_NODE]))
>> > > +    self.share_locks[locking.LEVEL_NODEGROUP] = 1
>> > > +
>> > >    def DeclareLocks(self, level):
>> > >      if level == locking.LEVEL_NODE_RES and \
>> > >        self.opportunistic_locks[locking.LEVEL_NODE]:
>> > > @@ -846,6 +862,21 @@ class LUInstanceCreate(LogicalUnit):
>> > >      """Check prerequisites.
>> > >
>> > >      """
>> > > +    # Check that the optimistically acquired groups are correct wrt
>> > > the
>> > > +    # acquired nodes
>> > > +    owned_groups =
>> > > frozenset(self.owned_locks(locking.LEVEL_NODEGROUP))
>> > > +    owned_nodes = frozenset(self.owned_locks(locking.LEVEL_NODE))
>> > > +    cur_groups = list(self.cfg.GetNodeGroupsFromNodes(owned_nodes))
>> > > +    if not owned_groups.issuperset(cur_groups):
>> > > +      raise errors.OpPrereqError("New instance %s's node groups
>> > > changed
>> > > since"
>> > > +                                 " locks were acquired, current
>> > > groups
>> > > are"
>> > > +                                 " are '%s', owning groups '%s';
>> > > retry
>> > > the"
>> > > +                                 " operation" %
>> > > +                                 (self.op.instance_name,
>> > > +                                  utils.CommaJoin(cur_groups),
>> > > +                                  utils.CommaJoin(owned_groups)),
>> > > +                                 errors.ECODE_STATE)
>> > > +
>> > >      self._CalculateFileStorageDir()
>> > >
>> > >      if self.op.mode == constants.INSTANCE_IMPORT:
>> > > @@ -957,6 +988,9 @@ class LUInstanceCreate(LogicalUnit):
>> > >      ReleaseLocks(self, locking.LEVEL_NODE, keep=keep_locks)
>> > >      ReleaseLocks(self, locking.LEVEL_NODE_RES, keep=keep_locks)
>> > >      ReleaseLocks(self, locking.LEVEL_NODE_ALLOC)
>> > > +    # Release all unneeded group locks
>> > > +    ReleaseLocks(self, locking.LEVEL_NODEGROUP,
>> > > +                 keep=self.cfg.GetNodeGroupsFromNodes(keep_locks))
>> > >
>> > >      assert (self.owned_locks(locking.LEVEL_NODE) ==
>> > >              self.owned_locks(locking.LEVEL_NODE_RES)), \
>> > > --
>> > > 1.8.4.1
>> > >
>> > >
>> >
>> >
>> > --
>> > Thomas Thrainer | Software Engineer | [email protected] |
>> >
>> > Google Germany GmbH
>> > Dienerstr. 12
>> > 80331 München
>> >
>> > Registergericht und -nummer: Hamburg, HRB 86891
>> > Sitz der Gesellschaft: Hamburg
>> > Geschäftsführer: Graham Law, Christine Elizabeth Flores
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.10 (GNU/Linux)
>>
>> iQEcBAEBAgAGBQJSmKn3AAoJEHFDHex6CBG9ID0H/jAeb200ErR/83NdINO/H8kq
>> gnUbEqHGI9SqbUAOxBIY1y8OFFiM1MPL8zQRFqB/+YMAOkg7kWSmpM0Mb+eYW5nX
>> Yap9OJXrxy9+xiZEiQuoAQdRzNJNQwZQ007e2ykGP6/jAR/5uPigyv4FUSJvwQ/P
>> Plto1U7CvnRIg7HSxqB1WnsKpcVnXpzQWaonppX47QC+nucEbO37HcvXL6XuPO0d
>> F1IxSoIbrF60D+yzR95RcBGG7rp8PHR1dOd9TRGETBsJjh41jV6AuoMKxAQ0+NYe
>> UKRQlDHJEKEfMxinTxmhEQBF2OaNpQzHznxLcrMgq9WATL+PojgM4NzDmDNfa28=
>> =ebb3
>> -----END PGP SIGNATURE-----
>>
>



-- 
Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores

Reply via email to