On Thu, Mar 6, 2014 at 11:19 AM, Krishnan Parthasarathi <kpart...@redhat.com > wrote:
> > > ----- Original Message ----- > > On Thu, Mar 6, 2014 at 12:21 AM, Vijay Bellur <vbel...@redhat.com> > wrote: > > > > > Adding gluster-devel. > > > > > > > > > On 03/06/2014 01:15 PM, Krishnan Parthasarathi wrote: > > > > > >> All, > > >> > > >> In recent discussions around design (and implementation) of the > barrier > > >> feature, couple of things came to light. > > >> > > >> 1) changelog xlator needs barrier xlator to block unlink and rename > FOPs > > >> in the call path. This is apart from the current list of FOPs that > > >> are blocked > > >> in their call back path. > > >> This is to make sure that the changelog has a bounded queue of > unlink > > >> and rename FOPs, > > >> from the time barriering is enabled, to be drained, committed to > > >> changelog file and published. > > >> > > > > > Why is this necessary? > > The only consumer of changelog today, georeplication, can't tolerate > missing unlink/rename > entries from changelog, even with the initial xsync based crawl, until > changelog entries > are available for the master volume. > So, changelog xlator needs to ensure that the last rotated > (publishable) changelog should have entries for all the > unlink(s)/rename(s) that made > it to the snapshot. For this, changelog needs barrier xlator to block > unlink/rename > FOPs in the call path too. Hope that helps. > This sounds like a very changelog specific requirement. This is best addressed in the changelog translator itself. If unlink/rmdir/renames should not be "in progress" during a snapshot, then we need to hold off new ops in the call path, trigger a log rotation and the rotation should wait for completion of ongoing fops anyways. > > > > > > > 2) It is possible in a pure distribute volume that the following sequence > > >> of FOPs could result > > >> in snapshots of bricks disagreeing on inode type for a file or > > >> directory. > > >> > > >> t1: snap b1 > > >> t2: unlink /a > > >> t3: mkdir /a > > >> t4: snap b2 > > >> > > >> where, b1 and b2 are bricks of a pure distribute volume V. > > >> > > >> The above sequence can happen with the current barrier xlator design, > > >> since we allow unlink FOPs > > >> to go through to the disk and only block their acknowledgement to the > > >> application. This implies > > >> a concurrent mkdir on the same name could succeed, since DHT doesn't > > >> serialize unlink and mkdir FOPs, > > >> unlike AFR. > > >> > > >> Avati, > > >> > > >> I hear that you have a solution for problem 2). Could you please start > > >> the discussion on this thread? > > >> It would help us to decide how to go about with the barrier xlator > > >> implementation. > > >> > > > > > > > The solution is really a long pending implementation of dentry > > serialization in the resolver of protocol server. Today we allow multiple > > FOPs to happen in parallel which modify the same dentry. This results in > > hairy races (including non atomicity of rename) and has been kept open > for > > a while now. Implementing the dentry serialization in the resolver will > > "solve" 2 as a side effect. Hence that is a better approach than making > > changes in the barrier translator. > > > > I am not sure I understood how this works from the brief introduction > above. > Could you explain a bit? > By dentry serialization, I mean we should have only one operation modifying a <pargfid>/bname at a given time. This needs changes in the resolver of protocol server and possibly some changes in the inode table. This is really for solving rare races, and I think is something we need to work on independent of the snapshot requirements. Avati
_______________________________________________ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel