Hello Gluster,

Introducing a new file based snapshot feature in gluster which is based  on 
reflinks feature which will be out from xfs in a couple of months  (downstream)


what is a reflink ?

You might have surely used softlinks and hardlinks everyday!

Reflink  supports transparent copy on write, unlike soft/hardlinks which if 
useful for  snapshotting, basically reflink points to same data blocks that are 
used  by actual file (blocks are common to real file and a reflink file hence  
space efficient), they use different inode numbers hence they can have  
different permissions to access same data blocks, although they may look  
similar to hardlinks but are more space efficient and can handle all  
operations that can be performed on a regular file, unlike hardlinks  that are 
limited to unlink().

which filesystem support reflink ?
I  think its Btrfs who put it for the first time, now xfs trying hard to  make 
them available, in the future we can see them in ext4 as well

You can get a feel of reflinks by following tutorial
https://pkalever.wordpress.com/2016/01/22/xfs-reflinks-tutorial/


POC in gluster: https://asciinema.org/a/be50ukifcwk8tqhvo0ndtdqdd?speed=2


How we are doing it ?
Currently  we don't have a specific system-call that gives handle to reflinks, 
so I  decided to go with ioctl call with XFS_IOC_CLONE command.

In POC I have used setxattr/getxattr to create/delete/list the snapshot. 
Restore feature will use setxattr as well.

We  can have a fop although Fuse does't understand it, we will manage with a  
setxattr at Fuse mount point and again from client side it will be a fop till  
the posix xlator then as a ioctl to the underlying filesystem. Planing  to 
expose APIs for create, delete, list and restore.

Are these snapshots Internal or external?
We  will have a separate file each time we create a snapshot, obviously the  
snapshot file will have a different inode number and will be a  readonly, all 
these files are maintained in the ".fsnap/ " directory  which is maintained by 
the parent directory where the  snapshot-ted/actual file  resides, therefore 
they will not be visible to user (even with ls -a option, just like USS).  

*** We can always restore to any snapshot available  in the list and the best 
part is we can delete any snapshot between  snapshot1 and  snapshotN because 
all of them are independent ***

It  is applications duty to ensure the consistency of the file before it  tries 
to create a snapshot, say in case of VM file snapshot it is the  hyper-visor 
that should freeze the IO and then request for the snapshot



Integration with gluster: (Initial state, need more investigation)

Quota:
Since  the snapshot files resides in ".fsnap/" directory which is maintained  
by the same directory where the actual file exist, it falls in the same  users 
quota :)

DHT:
As said the snapshot files will resides in the same directory where the actual 
file resides may be in a ".fsnap/" directory

Re-balancing:
Simplest  solution could be, copy the actual file as whole copy then for  
snapshotfiles rsync only delta's and recreate snapshots history by  repeating 
snapshot sequence after each snapshotfile rsync.

AFR:
Mostly  will be same as write fop (inodelk's and quorum's). There could be no  
way to recover or recreate a snapshot on node (brick to be precise) which was 
down while  taking snapshot and comes back later in time.
 
Disperse:
Mostly take the inodelk and snapshot the file, on each of the bricks should 
work.
 
Sharding:
Assume we have a file split into 4 shards. If the fop for take snapshot is sent 
to all the subvols having the shards, it would be sufficient. All shards will 
have the snapshot for the state of the shard.
List of snap fop should be sent only to the main subvol where shard 0 resides.
Delete of a snap should be similar to create.
Restore would be a little difficult because metadata of the file needs to be 
updated in shard xlator.
<Needs more investigation>
Also in case of sharding, the bricks have gfid based flat filesystem. Hence the 
snaps created will also be in the shard directory, hence quota is not straight 
forward and needs additional work in this case.


How can we make it better ?
Discussion page: http://pad.engineering.redhat.com/kclYd9TPjr


Thanks to "Pranith Kumar Karampuri", "Raghavendra Talur", "Rajesh Joseph", 
"Poornima Gurusiddaiah" and "Kotresh Hiremath Ravishankar"
for all initial discussions.


-Prasanna ​ 


_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to