On Tue, Mar 1, 2016 at 12:37 PM, Prasanna Kumar Kalever
<pkale...@redhat.com> wrote:
> Hello Gluster,
>
>
> Introducing a new file based snapshot feature in gluster which is based  on 
> reflinks feature which will be out from xfs in a couple of months  
> (downstream)
>
>
> what is a reflink ?
>
> You might have surely used softlinks and hardlinks everyday!
>
> Reflink  supports transparent copy on write, unlike soft/hardlinks which if 
> useful for  snapshotting, basically reflink points to same data blocks that 
> are used  by actual file (blocks are common to real file and a reflink file 
> hence  space efficient), they use different inode numbers hence they can have 
>  different permissions to access same data blocks, although they may look  
> similar to hardlinks but are more space efficient and can handle all  
> operations that can be performed on a regular file, unlike hardlinks  that 
> are limited to unlink().
>
> which filesystem support reflink ?
> I  think its Btrfs who put it for the first time, now xfs trying hard to  
> make them available, in the future we can see them in ext4 as well
>
> You can get a feel of reflinks by following tutorial
> https://pkalever.wordpress.com/2016/01/22/xfs-reflinks-tutorial/
>
>
> POC in gluster: https://asciinema.org/a/be50ukifcwk8tqhvo0ndtdqdd?speed=2
>
>
> How we are doing it ?
> Currently  we don't have a specific system-call that gives handle to 
> reflinks, so I  decided to go with ioctl call with XFS_IOC_CLONE command.
>
> In POC I have used setxattr/getxattr to create/delete/list the snapshot. 
> Restore feature will use setxattr as well.
>
> We  can have a fop although Fuse does't understand it, we will manage with a  
> setxattr at Fuse mount point and again from client side it will be a fop till 
>  the posix xlator then as a ioctl to the underlying filesystem. Planing  to 
> expose APIs for create, delete, list and restore.
>
> Are these snapshots Internal or external?
> We  will have a separate file each time we create a snapshot, obviously the  
> snapshot file will have a different inode number and will be a  readonly, all 
> these files are maintained in the ".fsnap/ " directory  which is maintained 
> by the parent directory where the  snapshot-ted/actual file  resides, 
> therefore they will not be visible to user (even with ls -a option, just like 
> USS).
>
> *** We can always restore to any snapshot available  in the list and the best 
> part is we can delete any snapshot between  snapshot1 and  snapshotN because 
> all of them are independent ***
>
> It  is applications duty to ensure the consistency of the file before it  
> tries to create a snapshot, say in case of VM file snapshot it is the  
> hyper-visor that should freeze the IO and then request for the snapshot
>
>
>
> Integration with gluster: (Initial state, need more investigation)
>
> Quota:
> Since  the snapshot files resides in ".fsnap/" directory which is maintained  
> by the same directory where the actual file exist, it falls in the same  
> users quota :)
>
> DHT:
> As said the snapshot files will resides in the same directory where the 
> actual file resides may be in a ".fsnap/" directory
>
> Re-balancing:
> Simplest  solution could be, copy the actual file as whole copy then for  
> snapshotfiles rsync only delta's and recreate snapshots history by  repeating 
> snapshot sequence after each snapshotfile rsync.
>
> AFR:
> Mostly  will be same as write fop (inodelk's and quorum's). There could be no 
>  way to recover or recreate a snapshot on node (brick to be precise) which 
> was down while  taking snapshot and comes back later in time.
>
> Disperse:
> Mostly take the inodelk and snapshot the file, on each of the bricks should 
> work.
>
> Sharding:
> Assume we have a file split into 4 shards. If the fop for take snapshot is 
> sent to all the subvols having the shards, it would be sufficient. All shards 
> will have the snapshot for the state of the shard.
> List of snap fop should be sent only to the main subvol where shard 0 resides.
> Delete of a snap should be similar to create.
> Restore would be a little difficult because metadata of the file needs to be 
> updated in shard xlator.
> <Needs more investigation>
> Also in case of sharding, the bricks have gfid based flat filesystem. Hence 
> the snaps created will also be in the shard directory, hence quota is not 
> straight forward and needs additional work in this case.
>
>
> How can we make it better ?
> Discussion page: http://pad.engineering.redhat.com/kclYd9TPjr

This link is not accessible externally. Could you move the contents to
a public location?

>
>
> Thanks to "Pranith Kumar Karampuri", "Raghavendra Talur", "Rajesh Joseph", 
> "Poornima Gurusiddaiah" and "Kotresh Hiremath Ravishankar"
> for all initial discussions.
>
>
> -Prasanna
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to