Hello Gluster,
Introducing a new file based snapshot feature in gluster which is based on reflinks feature which will be out from xfs in a couple of months (downstream) what is a reflink ? You might have surely used softlinks and hardlinks everyday! Reflink supports transparent copy on write, unlike soft/hardlinks which if useful for snapshotting, basically reflink points to same data blocks that are used by actual file (blocks are common to real file and a reflink file hence space efficient), they use different inode numbers hence they can have different permissions to access same data blocks, although they may look similar to hardlinks but are more space efficient and can handle all operations that can be performed on a regular file, unlike hardlinks that are limited to unlink(). which filesystem support reflink ? I think its Btrfs who put it for the first time, now xfs trying hard to make them available, in the future we can see them in ext4 as well You can get a feel of reflinks by following tutorial https://pkalever.wordpress.com/2016/01/22/xfs-reflinks-tutorial/ POC in gluster: https://asciinema.org/a/be50ukifcwk8tqhvo0ndtdqdd?speed=2 How we are doing it ? Currently we don't have a specific system-call that gives handle to reflinks, so I decided to go with ioctl call with XFS_IOC_CLONE command. In POC I have used setxattr/getxattr to create/delete/list the snapshot. Restore feature will use setxattr as well. We can have a fop although Fuse does't understand it, we will manage with a setxattr at Fuse mount point and again from client side it will be a fop till the posix xlator then as a ioctl to the underlying filesystem. Planing to expose APIs for create, delete, list and restore. Are these snapshots Internal or external? We will have a separate file each time we create a snapshot, obviously the snapshot file will have a different inode number and will be a readonly, all these files are maintained in the ".fsnap/ " directory which is maintained by the parent directory where the snapshot-ted/actual file resides, therefore they will not be visible to user (even with ls -a option, just like USS). *** We can always restore to any snapshot available in the list and the best part is we can delete any snapshot between snapshot1 and snapshotN because all of them are independent *** It is applications duty to ensure the consistency of the file before it tries to create a snapshot, say in case of VM file snapshot it is the hyper-visor that should freeze the IO and then request for the snapshot Integration with gluster: (Initial state, need more investigation) Quota: Since the snapshot files resides in ".fsnap/" directory which is maintained by the same directory where the actual file exist, it falls in the same users quota :) DHT: As said the snapshot files will resides in the same directory where the actual file resides may be in a ".fsnap/" directory Re-balancing: Simplest solution could be, copy the actual file as whole copy then for snapshotfiles rsync only delta's and recreate snapshots history by repeating snapshot sequence after each snapshotfile rsync. AFR: Mostly will be same as write fop (inodelk's and quorum's). There could be no way to recover or recreate a snapshot on node (brick to be precise) which was down while taking snapshot and comes back later in time. Disperse: Mostly take the inodelk and snapshot the file, on each of the bricks should work. Sharding: Assume we have a file split into 4 shards. If the fop for take snapshot is sent to all the subvols having the shards, it would be sufficient. All shards will have the snapshot for the state of the shard. List of snap fop should be sent only to the main subvol where shard 0 resides. Delete of a snap should be similar to create. Restore would be a little difficult because metadata of the file needs to be updated in shard xlator. <Needs more investigation> Also in case of sharding, the bricks have gfid based flat filesystem. Hence the snaps created will also be in the shard directory, hence quota is not straight forward and needs additional work in this case. How can we make it better ? Discussion page: http://pad.engineering.redhat.com/kclYd9TPjr Thanks to "Pranith Kumar Karampuri", "Raghavendra Talur", "Rajesh Joseph", "Poornima Gurusiddaiah" and "Kotresh Hiremath Ravishankar" for all initial discussions. -Prasanna _______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel