On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri <skod...@redhat.com> wrote:
> > > On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote: > > > > > > On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri <skod...@redhat.com > > <mailto:skod...@redhat.com>> wrote: > > > > > > > > On 3/29/19 11:55 PM, Xavi Hernandez wrote: > > > Hi all, > > > > > > there is one potential problem with posix locks when used in a > > > replicated or dispersed volume. > > > > > > Some background: > > > > > > Posix locks allow any process to lock a region of a file multiple > > times, > > > but a single unlock on a given region will release all previous > > locks. > > > Locked regions can be different for each lock request and they can > > > overlap. The resulting lock will cover the union of all locked > > regions. > > > A single unlock (the region doesn't necessarily need to match any > > of the > > > ranges used for locking) will create a "hole" in the currently > > locked > > > region, independently of how many times a lock request covered > > that region. > > > > > > For this reason, the locks xlator simply combines the locked > regions > > > that are requested, but it doesn't track each individual lock > range. > > > > > > Under normal circumstances this works fine. But there are some > cases > > > where this behavior is not sufficient. For example, suppose we > > have a > > > replica 3 volume with quorum = 2. Given the special nature of > posix > > > locks, AFR sends the lock request sequentially to each one of the > > > bricks, to avoid that conflicting lock requests from other > > clients could > > > require to unlock an already locked region on the client that has > > not > > > got enough successful locks (i.e. quorum). An unlock here not > > only would > > > cancel the current lock request. It would also cancel any > previously > > > acquired lock. > > > > > > > I may not have fully understood, please correct me. AFAIU, lk xlator > > merges locks only if both the lk-owner and the client opaque matches. > > > > In the case which you have mentioned above, considering clientA > > acquired > > locks on majority of quorum (say nodeA and nodeB) and clientB on > nodeC > > alone- clientB now has to unlock/cancel the lock it acquired on > nodeC. > > > > You are saying the it could pose a problem if there were already > > successful locks taken by clientB for the same region which would get > > unlocked by this particular unlock request..right? > > > > Assuming the previous locks acquired by clientB are shared (otherwise > > clientA wouldn't have got granted lock for the same region on nodeA & > > nodeB), they would still hold true on nodeA & nodeB as the unlock > > request was sent to only nodeC. Since the majority of quorum nodes > > still > > hold the locks by clientB, this isn't serious issue IMO. > > > > I haven't looked into heal part but would like to understand if this > is > > really an issue in normal scenarios as well. > > > > > > This is how I understood the code. Consider the following case: > > Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1 > > and lock-range 2-3 from mount-2. > > If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel > > lets say mount-2 issued unlock of 2-3 as well. > > > > nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with > > range 1-7, so the lock is granted and merged to give 1-15 > > nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which > > leads to EAGAIN which will trigger unlocks on granted lock in mount-1 > > which will end up doing unlock of 1-7 on nodeA leading to lock-range > > 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will > > have range 5-15. > > > > Let me know if my understanding is wrong. > > Both of us mentioned the same points. So in the example you gave , > mount-1 lost its previous lock on nodeA but majority of the quorum > (nodeB and nodeC) still have the previous lock (range: 5-15) intact. So > this shouldn't ideally lead to any issues as other conflicting locks are > blocked or failed by majority of the nodes (provided there are no brick > dis/re-connects). > But brick disconnects will happen (upgrades, disk failures, server maintenance, ...). Anyway, even without brick disconnects, in the previous example we have nodeA with range 8-15, and nodes B and C with range 5-15. If another lock from mount-2 comes for range 5-7, it will succeed on nodeA, but it will block on nodeB. At this point, mount-1 could attempt a lock on same range. It will block on nodeA, so we have a deadlock. In general, having discrepancies between bricks is not good because sooner or later it will cause some bad inconsistency. > Wrt to brick disconnects/re-connects, if we can get in general lock > healing (not getting into implementation details atm) support, that > should take care of correcting lock range on nodeA as well right? > The problem we have seen is that to be able to correctly heal currently acquired locks on brick reconnect, there are cases where we need to release a lock that has already been granted (because the current owner doesn't have enough quorum and a just recovered connection tries to claim/heal it). In this case we need to deal with locks that have already been merged, but without interfering with other existing locks that already have quorum. > That said I am not suggesting that we should stick to existing behavior, > just trying to get clarification to check if we can avoid any > overhead/side-effects with maintaining multiple locks. > Right now is the only way we have found to provide a correct solution both for some cases of concurrent lock/unlock requests, and lock healing. Regards, Xavi > Thanks, > Soumya > > > > > > > > Thanks, > > Soumya > > > > > However, when something goes wrong (a brick dies during a lock > > request, > > > or there's a network partition or some other weird situation), it > > could > > > happen that even using sequential locking, only one brick > > succeeds the > > > lock request. In this case, AFR should cancel the previous lock > > (and it > > > does), but this also cancels any previously acquired lock on that > > > region, which is not good. > > > > > > A similar thing can happen if we try to recover (heal) posix > > locks that > > > were active after a brick has been disconnected (for any reason) > and > > > then reconnected. > > > > > > To fix all these situations we need to change the way posix locks > > are > > > managed by locks xlator. One possibility would be to embed the > lock > > > request inside an inode transaction using inodelk. Since inodelks > > do not > > > suffer this problem, the follwing posix lock could be sent safely. > > > However this implies an additional network request, which could > > cause > > > some performance impact. Eager-locking could minimize the impact > > in some > > > cases. However this approach won't work for lock recovery after a > > > disconnect. > > > > > > Another possibility is to send a special partial posix lock > request > > > which won't be immediately merged with already existing locks once > > > granted. An additional confirmation request of the partial posix > > lock > > > will be required to fully grant the current lock and merge it > > with the > > > existing ones. This requires a new network request, which will add > > > latency, and makes everything more complex since there would be > more > > > combinations of states in which something could fail. > > > > > > So I think one possible solution would be the following: > > > > > > 1. Keep each posix lock as an independent object in locks xlator. > > This > > > will make it possible to "invalidate" any already granted lock > > without > > > affecting already established locks. > > > > > > 2. Additionally, we'll keep a sorted list of non-overlapping > > segments of > > > locked regions. And we'll count, for each region, how many locks > are > > > referencing it. One lock can reference multiple segments, and each > > > segment can be referenced by multiple locks. > > > > > > 3. An additional lock request that overlaps with an existing > > segment, > > > can cause this segment to be split to satisfy the non-overlapping > > property. > > > > > > 4. When an unlock request is received, all segments intersecting > > with > > > the region are eliminated (it may require some segment splits on > the > > > edges), and the unlocked region is subtracted from each lock > > associated > > > to the segment. If a lock gets an empty region, it's removed. > > > > > > 5. We'll create a special "remove lock" request that doesn't > > unlock a > > > region but removes an already granted lock. This will decrease the > > > number of references to each of the segments this lock was > > covering. If > > > some segment reaches 0, it's removed. Otherwise it remains there. > > This > > > special request will only be used internally to cancel already > > acquired > > > locks that cannot be fully granted due to quorum issues or any > other > > > problem. > > > > > > In some weird cases, the list of segments can be huge (many locks > > > overlapping only on a single byte, so each segment represents > > only one > > > byte). We can try to find some smarter structure that minimizes > this > > > problem or limit the number of segments (for example returning > > ENOLCK > > > when there are too many). > > > > > > What do you think ? > > > > > > Xavi > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel@gluster.org <mailto:Gluster-devel@gluster.org> > > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > > > > > > > > -- > > Pranith >
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel