On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez <jaher...@redhat.com> wrote:
> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri < > pkara...@redhat.com> wrote: > >> >> >> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez <jaher...@redhat.com> >> wrote: >> >>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri < >>> pkara...@redhat.com> wrote: >>> >>>> >>>> >>>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez <jaher...@redhat.com> >>>> wrote: >>>> >>>>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa < >>>>> rgowd...@redhat.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <jaher...@redhat.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Raghavendra, >>>>>>> >>>>>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa < >>>>>>> rgowd...@redhat.com> wrote: >>>>>>> >>>>>>>> All, >>>>>>>> >>>>>>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount >>>>>>>> through which those locks are held disconnects from bricks/server. This >>>>>>>> helps Glusterfs to not run into a stale lock problem later (For eg., if >>>>>>>> application unlocks while the connection was still down). However, this >>>>>>>> means the lock is no longer exclusive as other applications/clients can >>>>>>>> acquire the same lock. To communicate that locks are no longer valid, >>>>>>>> we >>>>>>>> are planning to mark the fd (which has POSIX locks) bad on a >>>>>>>> disconnect so >>>>>>>> that any future operations on that fd will fail, forcing the >>>>>>>> application to >>>>>>>> re-open the fd and re-acquire locks it needs [1]. >>>>>>>> >>>>>>> >>>>>>> Wouldn't it be better to retake the locks when the brick is >>>>>>> reconnected if the lock is still in use ? >>>>>>> >>>>>> >>>>>> There is also a possibility that clients may never reconnect. That's >>>>>> the primary reason why bricks assume the worst (client will not >>>>>> reconnect) >>>>>> and cleanup the locks. >>>>>> >>>>> >>>>> True, so it's fine to cleanup the locks. I'm not saying that locks >>>>> shouldn't be released on disconnect. The assumption is that if the client >>>>> has really died, it will also disconnect from other bricks, who will >>>>> release the locks. So, eventually, another client will have enough quorum >>>>> to attempt a lock that will succeed. In other words, if a client gets >>>>> disconnected from too many bricks simultaneously (loses Quorum), then that >>>>> client can be considered as bad and can return errors to the application. >>>>> This should also cause to release the locks on the remaining connected >>>>> bricks. >>>>> >>>>> On the other hand, if the disconnection is very short and the client >>>>> has not died, it will keep enough locked files (it has quorum) to avoid >>>>> other clients to successfully acquire a lock. In this case, if the brick >>>>> is >>>>> reconnected, all existing locks should be reacquired to recover the >>>>> original state before the disconnection. >>>>> >>>>> >>>>>> >>>>>>> BTW, the referenced bug is not public. Should we open another bug to >>>>>>> track this ? >>>>>>> >>>>>> >>>>>> I've just opened up the comment to give enough context. I'll open a >>>>>> bug upstream too. >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Note that with AFR/replicate in picture we can prevent errors to >>>>>>>> application as long as Quorum number of children "never ever" lost >>>>>>>> connection with bricks after locks have been acquired. I am using the >>>>>>>> term >>>>>>>> "never ever" as locks are not healed back after re-connection and hence >>>>>>>> first disconnect would've marked the fd bad and the fd remains so even >>>>>>>> after re-connection happens. So, its not just Quorum number of children >>>>>>>> "currently online", but Quorum number of children "never having >>>>>>>> disconnected with bricks after locks are acquired". >>>>>>>> >>>>>>> >>>>>>> I think this requisite is not feasible. In a distributed file >>>>>>> system, sooner or later all bricks will be disconnected. It could be >>>>>>> because of failures or because an upgrade is done, but it will happen. >>>>>>> >>>>>>> The difference here is how long are fd's kept open. If applications >>>>>>> open and close files frequently enough (i.e. the fd is not kept open >>>>>>> more >>>>>>> time than it takes to have more than Quorum bricks disconnected) then >>>>>>> there's no problem. The problem can only appear on applications that >>>>>>> open >>>>>>> files for a long time and also use posix locks. In this case, the only >>>>>>> good >>>>>>> solution I see is to retake the locks on brick reconnection. >>>>>>> >>>>>> >>>>>> Agree. But lock-healing should be done only by HA layers like AFR/EC >>>>>> as only they know whether there are enough online bricks to have >>>>>> prevented >>>>>> any conflicting lock. Protocol/client itself doesn't have enough >>>>>> information to do that. If its a plain distribute, I don't see a way to >>>>>> heal locks without loosing the property of exclusivity of locks. >>>>>> >>>>> >>>>> Lock-healing of locks acquired while a brick was disconnected need to >>>>> be handled by AFR/EC. However, locks already present at the moment of >>>>> disconnection could be recovered by client xlator itself as long as the >>>>> file has not been closed (which client xlator already knows). >>>>> >>>> >>>> What if another client (say mount-2) took locks at the time of >>>> disconnect from mount-1 and modified the file and unlocked? client xlator >>>> doing the heal may not be a good idea. >>>> >>> >>> To avoid that we should ensure that any lock/unlocks are sent to the >>> client, even if we know it's disconnected, so that client xlator can track >>> them. The alternative is to duplicate and maintain code both on AFR and EC >>> (and not sure if even in DHT depending on how we want to handle some >>> cases). >>> >> >> Didn't understand the solution. I wanted to highlight that client xlator >> by itself can't make a decision about healing locks because it doesn't know >> what happened on other replicas. If we have replica-3 volume and all 3 >> bricks get disconnected to their respective bricks. Now another mount >> process can take a lock on that file modify it and unlock. Now upon >> reconnection, the old mount process which had locks would think it always >> had the lock if client xlator independently tries to heal its own locks >> because file is not closed on it so far. But that is wrong. Let me know if >> it makes sense.... >> > > My point of view is that any configuration with these requirements will > have an appropriate quorum value so that it's impossible to have two or > more partitions of the nodes working at the same time. So, under this > assumptions, mount-1 can be in two situations: > > 1. It has lost a single brick and it's still operational. The other bricks > will continue locked and everything should work fine from the point of view > of the application. Any other application trying to get a lock will fail > due to lack of quorum. When the lost brick comes back and is reconnected, > client xlator will still have the fd reference and locks taken (unless the > application has released the lock or closed the fd, in which case client > xlator should get notified and clear that information), so it should be > able to recover the previous state. > > 2. It has lost 2 or 3 bricks. In this case mount-1 has lost quorum and any > operation going to that file should fail with EIO. AFR should send a > special request to client xlator so that it forgets any fd's and locks for > that file. If bricks reconnect after that, no fd reopen or lock recovery > will happen. Eventually the application should close the fd and retry > later. This may succeed to not, depending on whether mount-2 has taken the > lock already or not. > > So, it's true that client xlator doesn't know the state of the other > bricks, but it doesn't need to as long as AFR/EC strictly enforces quorum > and updates client xlator when quorum is lost. > Just curious. Is there any reason why you think delegating the actual responsibility of re-opening or forgetting the locks to protocol/client is better when compared to AFR/EC doing the actual work of re-opening files and reacquiring locks? Asking this because, in the case of plain distribute, DHT will also have to indicate Quorum loss on every disconnect (as Quorum consisted of just 1 brick). >From what I understand, the design is the same one which me, Pranith, Anoop and Vijay had discussed (in essence) but varies in implementation details. > I haven't worked out all the details of this approach, but I think it > should work and it's simpler to maintain than trying to do the same for AFR > and EC. > > Xavi > > >> >>> A similar thing could be done for open fd, since the current solution >>> duplicates code in AFR and EC, but this is another topic... >>> >>> >>>> >>>>> >>>>> Xavi >>>>> >>>>> >>>>>> What I proposed is a short term solution. mid to long term solution >>>>>> should be lock healing feature implemented in AFR/EC. In fact I had this >>>>>> conversation with +Karampuri, Pranith <pkara...@redhat.com> before >>>>>> posting this msg to ML. >>>>>> >>>>>> >>>>>>> >>>>>>>> However, this use case is not affected if the application don't >>>>>>>> acquire any POSIX locks. So, I am interested in knowing >>>>>>>> * whether your use cases use POSIX locks? >>>>>>>> * Is it feasible for your application to re-open fds and re-acquire >>>>>>>> locks on seeing EBADFD errors? >>>>>>>> >>>>>>> >>>>>>> I think that many applications are not prepared to handle that. >>>>>>> >>>>>> >>>>>> I too suspected that and in fact not too happy with the solution. But >>>>>> went ahead with this mail as I heard implementing lock-heal in AFR will >>>>>> take time and hence there are no alternative short term solutions. >>>>>> >>>>> >>>>>> >>>>>>> Xavi >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7 >>>>>>>> >>>>>>>> regards, >>>>>>>> Raghavendra >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users@gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> >>>> >>>> -- >>>> Pranith >>>> >>> >> >> -- >> Pranith >> >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users