On Wed, 27 Mar 2019, 18:26 Pranith Kumar Karampuri, <pkara...@redhat.com> wrote:
> > > On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez <jaher...@redhat.com> > wrote: > >> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri < >> pkara...@redhat.com> wrote: >> >>> >>> >>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez <jaher...@redhat.com> >>> wrote: >>> >>>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri < >>>> pkara...@redhat.com> wrote: >>>> >>>>> >>>>> >>>>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez <jaher...@redhat.com> >>>>> wrote: >>>>> >>>>>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa < >>>>>> rgowd...@redhat.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <jaher...@redhat.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Raghavendra, >>>>>>>> >>>>>>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa < >>>>>>>> rgowd...@redhat.com> wrote: >>>>>>>> >>>>>>>>> All, >>>>>>>>> >>>>>>>>> Glusterfs cleans up POSIX locks held on an fd when the >>>>>>>>> client/mount through which those locks are held disconnects from >>>>>>>>> bricks/server. This helps Glusterfs to not run into a stale lock >>>>>>>>> problem >>>>>>>>> later (For eg., if application unlocks while the connection was still >>>>>>>>> down). However, this means the lock is no longer exclusive as other >>>>>>>>> applications/clients can acquire the same lock. To communicate that >>>>>>>>> locks >>>>>>>>> are no longer valid, we are planning to mark the fd (which has POSIX >>>>>>>>> locks) >>>>>>>>> bad on a disconnect so that any future operations on that fd will >>>>>>>>> fail, >>>>>>>>> forcing the application to re-open the fd and re-acquire locks it >>>>>>>>> needs [1]. >>>>>>>>> >>>>>>>> >>>>>>>> Wouldn't it be better to retake the locks when the brick is >>>>>>>> reconnected if the lock is still in use ? >>>>>>>> >>>>>>> >>>>>>> There is also a possibility that clients may never reconnect. >>>>>>> That's the primary reason why bricks assume the worst (client will not >>>>>>> reconnect) and cleanup the locks. >>>>>>> >>>>>> >>>>>> True, so it's fine to cleanup the locks. I'm not saying that locks >>>>>> shouldn't be released on disconnect. The assumption is that if the client >>>>>> has really died, it will also disconnect from other bricks, who will >>>>>> release the locks. So, eventually, another client will have enough quorum >>>>>> to attempt a lock that will succeed. In other words, if a client gets >>>>>> disconnected from too many bricks simultaneously (loses Quorum), then >>>>>> that >>>>>> client can be considered as bad and can return errors to the application. >>>>>> This should also cause to release the locks on the remaining connected >>>>>> bricks. >>>>>> >>>>>> On the other hand, if the disconnection is very short and the client >>>>>> has not died, it will keep enough locked files (it has quorum) to avoid >>>>>> other clients to successfully acquire a lock. In this case, if the brick >>>>>> is >>>>>> reconnected, all existing locks should be reacquired to recover the >>>>>> original state before the disconnection. >>>>>> >>>>>> >>>>>>> >>>>>>>> BTW, the referenced bug is not public. Should we open another bug >>>>>>>> to track this ? >>>>>>>> >>>>>>> >>>>>>> I've just opened up the comment to give enough context. I'll open a >>>>>>> bug upstream too. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Note that with AFR/replicate in picture we can prevent errors to >>>>>>>>> application as long as Quorum number of children "never ever" lost >>>>>>>>> connection with bricks after locks have been acquired. I am using the >>>>>>>>> term >>>>>>>>> "never ever" as locks are not healed back after re-connection and >>>>>>>>> hence >>>>>>>>> first disconnect would've marked the fd bad and the fd remains so even >>>>>>>>> after re-connection happens. So, its not just Quorum number of >>>>>>>>> children >>>>>>>>> "currently online", but Quorum number of children "never having >>>>>>>>> disconnected with bricks after locks are acquired". >>>>>>>>> >>>>>>>> >>>>>>>> I think this requisite is not feasible. In a distributed file >>>>>>>> system, sooner or later all bricks will be disconnected. It could be >>>>>>>> because of failures or because an upgrade is done, but it will happen. >>>>>>>> >>>>>>>> The difference here is how long are fd's kept open. If applications >>>>>>>> open and close files frequently enough (i.e. the fd is not kept open >>>>>>>> more >>>>>>>> time than it takes to have more than Quorum bricks disconnected) then >>>>>>>> there's no problem. The problem can only appear on applications that >>>>>>>> open >>>>>>>> files for a long time and also use posix locks. In this case, the only >>>>>>>> good >>>>>>>> solution I see is to retake the locks on brick reconnection. >>>>>>>> >>>>>>> >>>>>>> Agree. But lock-healing should be done only by HA layers like AFR/EC >>>>>>> as only they know whether there are enough online bricks to have >>>>>>> prevented >>>>>>> any conflicting lock. Protocol/client itself doesn't have enough >>>>>>> information to do that. If its a plain distribute, I don't see a way to >>>>>>> heal locks without loosing the property of exclusivity of locks. >>>>>>> >>>>>> >>>>>> Lock-healing of locks acquired while a brick was disconnected need to >>>>>> be handled by AFR/EC. However, locks already present at the moment of >>>>>> disconnection could be recovered by client xlator itself as long as the >>>>>> file has not been closed (which client xlator already knows). >>>>>> >>>>> >>>>> What if another client (say mount-2) took locks at the time of >>>>> disconnect from mount-1 and modified the file and unlocked? client xlator >>>>> doing the heal may not be a good idea. >>>>> >>>> >>>> To avoid that we should ensure that any lock/unlocks are sent to the >>>> client, even if we know it's disconnected, so that client xlator can track >>>> them. The alternative is to duplicate and maintain code both on AFR and EC >>>> (and not sure if even in DHT depending on how we want to handle some >>>> cases). >>>> >>> >>> Didn't understand the solution. I wanted to highlight that client xlator >>> by itself can't make a decision about healing locks because it doesn't know >>> what happened on other replicas. If we have replica-3 volume and all 3 >>> bricks get disconnected to their respective bricks. Now another mount >>> process can take a lock on that file modify it and unlock. Now upon >>> reconnection, the old mount process which had locks would think it always >>> had the lock if client xlator independently tries to heal its own locks >>> because file is not closed on it so far. But that is wrong. Let me know if >>> it makes sense.... >>> >> >> My point of view is that any configuration with these requirements will >> have an appropriate quorum value so that it's impossible to have two or >> more partitions of the nodes working at the same time. So, under this >> assumptions, mount-1 can be in two situations: >> >> 1. It has lost a single brick and it's still operational. The other >> bricks will continue locked and everything should work fine from the point >> of view of the application. Any other application trying to get a lock will >> fail due to lack of quorum. When the lost brick comes back and is >> reconnected, client xlator will still have the fd reference and locks taken >> (unless the application has released the lock or closed the fd, in which >> case client xlator should get notified and clear that information), so it >> should be able to recover the previous state. >> > > Application could be in blocked state as well if it tries to get blocking > lock. So as soon as a disconnect happens, the lock will be granted on that > brick to one of the blocked locks. On the other two bricks it would still > be blocked. Trying to heal that will require a new operation that is not > already present in locks code, which should be able to tell client as well > about either changing the lock state to blocked on that brick or to retry > lock operation. > Yes, but this problem exists even if the lock-heal is done by AFR/EC. This is something that needs to be solved anyway, but it's independent of who does the lock-heal. > >> >> 2. It has lost 2 or 3 bricks. In this case mount-1 has lost quorum and >> any operation going to that file should fail with EIO. AFR should send a >> special request to client xlator so that it forgets any fd's and locks for >> that file. If bricks reconnect after that, no fd reopen or lock recovery >> will happen. Eventually the application should close the fd and retry >> later. This may succeed to not, depending on whether mount-2 has taken the >> lock already or not. >> >> So, it's true that client xlator doesn't know the state of the other >> bricks, but it doesn't need to as long as AFR/EC strictly enforces quorum >> and updates client xlator when quorum is lost. >> > > This part seems good. > > >> >> I haven't worked out all the details of this approach, but I think it >> should work and it's simpler to maintain than trying to do the same for AFR >> and EC. >> > > Let us spend some time on this on #gluster-dev when you get some time > tomorrow to figure out the complete solution which handles the corner cases > too. > > >> >> Xavi >> >> >>> >>>> A similar thing could be done for open fd, since the current solution >>>> duplicates code in AFR and EC, but this is another topic... >>>> >>>> >>>>> >>>>>> >>>>>> Xavi >>>>>> >>>>>> >>>>>>> What I proposed is a short term solution. mid to long term solution >>>>>>> should be lock healing feature implemented in AFR/EC. In fact I had this >>>>>>> conversation with +Karampuri, Pranith <pkara...@redhat.com> before >>>>>>> posting this msg to ML. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>> However, this use case is not affected if the application don't >>>>>>>>> acquire any POSIX locks. So, I am interested in knowing >>>>>>>>> * whether your use cases use POSIX locks? >>>>>>>>> * Is it feasible for your application to re-open fds and >>>>>>>>> re-acquire locks on seeing EBADFD errors? >>>>>>>>> >>>>>>>> >>>>>>>> I think that many applications are not prepared to handle that. >>>>>>>> >>>>>>> >>>>>>> I too suspected that and in fact not too happy with the solution. >>>>>>> But went ahead with this mail as I heard implementing lock-heal in AFR >>>>>>> will take time and hence there are no alternative short term solutions. >>>>>>> >>>>>> >>>>>>> >>>>>>>> Xavi >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7 >>>>>>>>> >>>>>>>>> regards, >>>>>>>>> Raghavendra >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-users mailing list >>>>>>>>> Gluster-users@gluster.org >>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>>>> >>>>> >>>>> -- >>>>> Pranith >>>>> >>>> >>> >>> -- >>> Pranith >>> >> > > -- > Pranith >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users