On Thu, Mar 28, 2019 at 2:37 PM Xavi Hernandez <jaher...@redhat.com> wrote:
> On Thu, Mar 28, 2019 at 3:05 AM Raghavendra Gowdappa <rgowd...@redhat.com> > wrote: > >> >> >> On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez <jaher...@redhat.com> >> wrote: >> >>> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri < >>> pkara...@redhat.com> wrote: >>> >>>> >>>> >>>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez <jaher...@redhat.com> >>>> wrote: >>>> >>>>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri < >>>>> pkara...@redhat.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez <jaher...@redhat.com> >>>>>> wrote: >>>>>> >>>>>>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa < >>>>>>> rgowd...@redhat.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez < >>>>>>>> jaher...@redhat.com> wrote: >>>>>>>> >>>>>>>>> Hi Raghavendra, >>>>>>>>> >>>>>>>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa < >>>>>>>>> rgowd...@redhat.com> wrote: >>>>>>>>> >>>>>>>>>> All, >>>>>>>>>> >>>>>>>>>> Glusterfs cleans up POSIX locks held on an fd when the >>>>>>>>>> client/mount through which those locks are held disconnects from >>>>>>>>>> bricks/server. This helps Glusterfs to not run into a stale lock >>>>>>>>>> problem >>>>>>>>>> later (For eg., if application unlocks while the connection was still >>>>>>>>>> down). However, this means the lock is no longer exclusive as other >>>>>>>>>> applications/clients can acquire the same lock. To communicate that >>>>>>>>>> locks >>>>>>>>>> are no longer valid, we are planning to mark the fd (which has POSIX >>>>>>>>>> locks) >>>>>>>>>> bad on a disconnect so that any future operations on that fd will >>>>>>>>>> fail, >>>>>>>>>> forcing the application to re-open the fd and re-acquire locks it >>>>>>>>>> needs [1]. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Wouldn't it be better to retake the locks when the brick is >>>>>>>>> reconnected if the lock is still in use ? >>>>>>>>> >>>>>>>> >>>>>>>> There is also a possibility that clients may never reconnect. >>>>>>>> That's the primary reason why bricks assume the worst (client will not >>>>>>>> reconnect) and cleanup the locks. >>>>>>>> >>>>>>> >>>>>>> True, so it's fine to cleanup the locks. I'm not saying that locks >>>>>>> shouldn't be released on disconnect. The assumption is that if the >>>>>>> client >>>>>>> has really died, it will also disconnect from other bricks, who will >>>>>>> release the locks. So, eventually, another client will have enough >>>>>>> quorum >>>>>>> to attempt a lock that will succeed. In other words, if a client gets >>>>>>> disconnected from too many bricks simultaneously (loses Quorum), then >>>>>>> that >>>>>>> client can be considered as bad and can return errors to the >>>>>>> application. >>>>>>> This should also cause to release the locks on the remaining connected >>>>>>> bricks. >>>>>>> >>>>>>> On the other hand, if the disconnection is very short and the client >>>>>>> has not died, it will keep enough locked files (it has quorum) to avoid >>>>>>> other clients to successfully acquire a lock. In this case, if the >>>>>>> brick is >>>>>>> reconnected, all existing locks should be reacquired to recover the >>>>>>> original state before the disconnection. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>> BTW, the referenced bug is not public. Should we open another bug >>>>>>>>> to track this ? >>>>>>>>> >>>>>>>> >>>>>>>> I've just opened up the comment to give enough context. I'll open a >>>>>>>> bug upstream too. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Note that with AFR/replicate in picture we can prevent errors to >>>>>>>>>> application as long as Quorum number of children "never ever" lost >>>>>>>>>> connection with bricks after locks have been acquired. I am using >>>>>>>>>> the term >>>>>>>>>> "never ever" as locks are not healed back after re-connection and >>>>>>>>>> hence >>>>>>>>>> first disconnect would've marked the fd bad and the fd remains so >>>>>>>>>> even >>>>>>>>>> after re-connection happens. So, its not just Quorum number of >>>>>>>>>> children >>>>>>>>>> "currently online", but Quorum number of children "never having >>>>>>>>>> disconnected with bricks after locks are acquired". >>>>>>>>>> >>>>>>>>> >>>>>>>>> I think this requisite is not feasible. In a distributed file >>>>>>>>> system, sooner or later all bricks will be disconnected. It could be >>>>>>>>> because of failures or because an upgrade is done, but it will happen. >>>>>>>>> >>>>>>>>> The difference here is how long are fd's kept open. If >>>>>>>>> applications open and close files frequently enough (i.e. the fd is >>>>>>>>> not >>>>>>>>> kept open more time than it takes to have more than Quorum bricks >>>>>>>>> disconnected) then there's no problem. The problem can only appear on >>>>>>>>> applications that open files for a long time and also use posix >>>>>>>>> locks. In >>>>>>>>> this case, the only good solution I see is to retake the locks on >>>>>>>>> brick >>>>>>>>> reconnection. >>>>>>>>> >>>>>>>> >>>>>>>> Agree. But lock-healing should be done only by HA layers like >>>>>>>> AFR/EC as only they know whether there are enough online bricks to have >>>>>>>> prevented any conflicting lock. Protocol/client itself doesn't have >>>>>>>> enough >>>>>>>> information to do that. If its a plain distribute, I don't see a way to >>>>>>>> heal locks without loosing the property of exclusivity of locks. >>>>>>>> >>>>>>> >>>>>>> Lock-healing of locks acquired while a brick was disconnected need >>>>>>> to be handled by AFR/EC. However, locks already present at the moment of >>>>>>> disconnection could be recovered by client xlator itself as long as the >>>>>>> file has not been closed (which client xlator already knows). >>>>>>> >>>>>> >>>>>> What if another client (say mount-2) took locks at the time of >>>>>> disconnect from mount-1 and modified the file and unlocked? client xlator >>>>>> doing the heal may not be a good idea. >>>>>> >>>>> >>>>> To avoid that we should ensure that any lock/unlocks are sent to the >>>>> client, even if we know it's disconnected, so that client xlator can track >>>>> them. The alternative is to duplicate and maintain code both on AFR and EC >>>>> (and not sure if even in DHT depending on how we want to handle some >>>>> cases). >>>>> >>>> >>>> Didn't understand the solution. I wanted to highlight that client >>>> xlator by itself can't make a decision about healing locks because it >>>> doesn't know what happened on other replicas. If we have replica-3 volume >>>> and all 3 bricks get disconnected to their respective bricks. Now another >>>> mount process can take a lock on that file modify it and unlock. Now upon >>>> reconnection, the old mount process which had locks would think it always >>>> had the lock if client xlator independently tries to heal its own locks >>>> because file is not closed on it so far. But that is wrong. Let me know if >>>> it makes sense.... >>>> >>> >>> My point of view is that any configuration with these requirements will >>> have an appropriate quorum value so that it's impossible to have two or >>> more partitions of the nodes working at the same time. So, under this >>> assumptions, mount-1 can be in two situations: >>> >>> 1. It has lost a single brick and it's still operational. The other >>> bricks will continue locked and everything should work fine from the point >>> of view of the application. Any other application trying to get a lock will >>> fail due to lack of quorum. When the lost brick comes back and is >>> reconnected, client xlator will still have the fd reference and locks taken >>> (unless the application has released the lock or closed the fd, in which >>> case client xlator should get notified and clear that information), so it >>> should be able to recover the previous state. >>> >>> 2. It has lost 2 or 3 bricks. In this case mount-1 has lost quorum and >>> any operation going to that file should fail with EIO. AFR should send a >>> special request to client xlator so that it forgets any fd's and locks for >>> that file. If bricks reconnect after that, no fd reopen or lock recovery >>> will happen. Eventually the application should close the fd and retry >>> later. This may succeed to not, depending on whether mount-2 has taken the >>> lock already or not. >>> >>> So, it's true that client xlator doesn't know the state of the other >>> bricks, but it doesn't need to as long as AFR/EC strictly enforces quorum >>> and updates client xlator when quorum is lost. >>> >> >> Just curious. Is there any reason why you think delegating the actual >> responsibility of re-opening or forgetting the locks to protocol/client is >> better when compared to AFR/EC doing the actual work of re-opening files >> and reacquiring locks? Asking this because, in the case of plain >> distribute, DHT will also have to indicate Quorum loss on every disconnect >> (as Quorum consisted of just 1 brick). >> > > The basic reason is that doing that on AFR and EC requires code > duplication. The code is not expected to be simple either, so it can > contain bugs or it could require improvements eventually. Every time we > want to do a change, we should fix both AFR and EC, but this has not > happened in many cases in the past on features that are already duplicated > in AFR and EC, so it's quite unlikely that this will happen in the future. > That's a good reason. +1. > > Regarding the requirement of sending a quorum loss notification from DHT, > I agree it's a new thing, but it's way simpler to do than the fd and lock > heal logic. > > Xavi > > >> From what I understand, the design is the same one which me, Pranith, >> Anoop and Vijay had discussed (in essence) but varies in implementation >> details. >> >> >>> I haven't worked out all the details of this approach, but I think it >>> should work and it's simpler to maintain than trying to do the same for AFR >>> and EC. >>> >>> Xavi >>> >>> >>>> >>>>> A similar thing could be done for open fd, since the current solution >>>>> duplicates code in AFR and EC, but this is another topic... >>>>> >>>>> >>>>>> >>>>>>> >>>>>>> Xavi >>>>>>> >>>>>>> >>>>>>>> What I proposed is a short term solution. mid to long term solution >>>>>>>> should be lock healing feature implemented in AFR/EC. In fact I had >>>>>>>> this >>>>>>>> conversation with +Karampuri, Pranith <pkara...@redhat.com> before >>>>>>>> posting this msg to ML. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> However, this use case is not affected if the application don't >>>>>>>>>> acquire any POSIX locks. So, I am interested in knowing >>>>>>>>>> * whether your use cases use POSIX locks? >>>>>>>>>> * Is it feasible for your application to re-open fds and >>>>>>>>>> re-acquire locks on seeing EBADFD errors? >>>>>>>>>> >>>>>>>>> >>>>>>>>> I think that many applications are not prepared to handle that. >>>>>>>>> >>>>>>>> >>>>>>>> I too suspected that and in fact not too happy with the solution. >>>>>>>> But went ahead with this mail as I heard implementing lock-heal in AFR >>>>>>>> will take time and hence there are no alternative short term solutions. >>>>>>>> >>>>>>> >>>>>>>> >>>>>>>>> Xavi >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7 >>>>>>>>>> >>>>>>>>>> regards, >>>>>>>>>> Raghavendra >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> gluster-us...@gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> >>>>>> >>>>>> -- >>>>>> Pranith >>>>>> >>>>> >>>> >>>> -- >>>> Pranith >>>> >>>
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel