On Wed, Mar 15, 2017 at 11:31 PM, Soumya Koduri <skod...@redhat.com> wrote:
> Hi Rafi, > > I haven't thoroughly gone through design. But have few comments/queries > which I have posted inline for now . > > On 02/28/2017 01:11 PM, Mohammed Rafi K C wrote: > >> Thanks for the reply , Comments are inline >> >> >> >> On 02/28/2017 12:50 PM, Niels de Vos wrote: >> >>> On Tue, Feb 28, 2017 at 11:21:55AM +0530, Mohammed Rafi K C wrote: >>> >>>> Hi All, >>>> >>>> >>>> We discussed the problem $subject in the mail thread [1]. Based on the >>>> comments and suggestions I will summarize the design (Made as points for >>>> simplicity.) >>>> >>>> >>>> 1) As part of each fop, top layer will generate a time stamp and pass it >>>> to the down along with other param. >>>> >>>> 1.1) This will bring a dependency for NTP synced clients along with >>>> servers >>>> >>> What do you mean with "top layer"? Is this on the Gluster client, or >>> does the time get inserted on the bricks? >>> >> It is the top layer (master xlator) in client graph like fuse, gfapi, >> nfs . My mistake I should have mentioned . Sorry for that. >> > > These clients shouldn't include internal client processes like rebalance, > self-heal daemons right? IIUC from [1], we should avoid changing times > during rebalance and self-heals. > > Also what about fops generated from the underlying layers - > getxattr/setxattr which may modify these time attributes? > > >> >> >>> I think we should not require a hard dependency on NTP, but have it >>> strongly suggested. Having a synced time in a clustered environment is >>> always helpful for reading and matching logs. >>> >> Agreed, but if we go with option 1 where we generate time from client, >> then time will not be in sync if not done with NTP. >> >> >> >> >>> 1.2) There can be a diff in time if the fop stuck in the xlator for >>>> various reason, for ex: because of locks. >>>> >>> Or just slow networks? Blocking (mandatory?) locks should be handled >>> correctly. The time a FOP is blocked can be long. >>> >> True, the questions can this be included in timestamp valie, because if >> it generated from say fuse then when it reaches to the brick the time >> may have moved ahead. what do you think about it ? >> >> >> >>> 2) On the server posix layer stores the value in the memory (inode ctx) >>>> and will sync the data periodically to the disk as an extended attr >>>> >>> Will you use any timer thread for asynchronous update? > > >>>> 2.1) of course sync call also will force it. And fop comes for an >>>> inode which is not linked, we do the sync immediately. >>>> >>> Does it need to be in the posix layer? >>> >> >> You mean storing the time attr ? then it need not be , protocol/server >> is also another candidate but I feel posix is ahead in the race ;) . >> > > I agree with Shyam and Niels that posix layer doesn't seem right. Since > having this support comes with performance cost, how about a separate > xlator (which shall be optional)? > > >> >> >>> 3) Each time when inodes are created or initialized it read the data >>>> from disk and store it. >>>> >>>> >>>> 4) Before setting to inode_ctx we compare the timestamp stored and the >>>> timestamp received, and only store if the stored value is lesser than >>>> the current value. >>>> >>> If we choose not to set this attribute for self-heal/rebalance (as > stated above) daemons, we would need special handling for the requests sent > by them (i.e, to heal this time attribute as well on the destination > file/dir). > > >>>> >>>> 5) So in best case data will be stored and retrieved from the memory. We >>>> replace the values in iatt with the values in inode_ctx. >>>> >>>> >>>> 6) File ops that changes the parent directory attr time need to be >>>> consistent across all the distributed directories across the subvolumes. >>>> (for eg: a create call will change ctime and mtime of parent dir) >>>> >>>> 6.1) This has to handle separately because we only send the fop to >>>> the hashed subvolume. >>>> >>>> 6.2) We can asynchronously send the timeupdate setattr fop to the >>>> other subvoumes and change the values for parent directory if the file >>>> fops is successful on hashed subvolume. >>>> >>> > The same needs to be handled even during DHT directory healing right? > > >>>> 6.3) This will have a window where the times are inconsistent >>>> across dht subvolume (Please provide your suggestions) >>>> >>> Isn't this the same problem for 'normal' AFR volumes? I guess self-heal >>> needs to know how to pick the right value for the [cm]time xattr. >>> >> >> Yes and need to heal. Both self heal and dht. But till then there can be >> difference in values. >> > > Is this design targetting to synchronize only ctime/mtime? If 'atime' is > also considered , as the read/stat done by AFR shall modify atime only on > the first subvol, even AFR xlator needs to take care of updating other > subvols. Same goes with EC as well. > atime is updated on open which is sent to all subvols in AFR/EC > > Thanks, > Soumya > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-devel > -- Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-devel