Attached is a basic write-up of the user-serviceable snapshot feature design (Avati's). Please take a look and let us know if you have questions of any sort...

We have a basic implementation up now; reviews and upstream commit should follow very soon over the next week.

Cheers,
Anand
User-serviceable Snapshots
==========================

Credits
=======
This brilliant design is Anand Avati's brainchild. The meta xlator is also to 
blame to some extent.


Terminology
==========

* gluster volume - a GlusterFS volume created by the gluster vol create cmd
* snapshot volume - a snapshot volume created by the gluster snapshot create 
cmd ; this is based on the LVM2 thin-lv backend and is itself a thin-lv; a 
snapshot thin-lv is accessible as yet another GlusterFS volume in itself

1. Introduction

User serviceable snapshots are a quick and easy way to access data stored in 
earlier snapshotted volumes. This feature is based on the core snapshot feature 
introduced in GlusterFS earlier. The key point here is that USS allows the end 
user to access his/her older data without any admin intervention. To that 
extent this feature is about ease of use, ease of access to one's past data in 
snapshot volumes (which, today in the gluster world are based on LVM2 Thin-LVs 
as the backend.

This is not a replacement for bulk data access from an earlier snapshot volume, 
in which case the recommendation is to use the mounted snapshot volume and 
access it as a GlusterFS volume, mounted and accessed via the Native FUSE 
client.
Rather this is targetted for use in typical home directory scenarios where 
individual users can at random points of time, access their files/dirs in their 
individual home directories without admin intervention of any sort. The home 
directory usecase is only an example and there are several other use-cases 
including other kinds of applications that could benefit from this feature.

2. Use-case

Consider a user John with his Unix id john and $HOME as /home/john. Let us 
consider an example when John wants to access a file 
/home/john/Important/file_john.txt which existed in his home directory in 
November 2013 but was deleted in December 2013. In order to access the file 
(prior to the introduction of the user-serviceable snapshot feature), John's 
only option was to send a note to the admin to ensure the gluster-snapshotted 
volume from Nov 2013 was made available (activated
and mounted). The admin would then notify John of the availability of the 
snapshot volume when John could potentially traverse his older home directory 
to copy over the file.

With USS, the need for admin intervention goes away. John is now free to 
execute the following steps and access the desired file whenever he needs:

$pwd
/home/john

$ls
dir1/   dir2/   dir3/   file1   file2   Important/

$cd Important/

$ls

(No files present - this being his current view)

$cd .snaps

$pwd
/home/john/Important/.snaps

$ls
snapshot_jan2014/       snapshot_dec2013/       snapshot_nov2013/       
snapshot_oct2013/       snapshot_sep2013/

$cd snapshot_nov2013/

$ls
file_john.txt   file_john_1.txt

$cp -p file_john.txt $HOME

As the above steps indicate, it is fairly easy to recover lost files or even 
older vesions of files or directories using USS.


3. Design
==========

A new server-side xlator (snapview-server) and a client-side xlator 
(snapview-client) are introduced. On the client side, the xlator would be above 
DHT xlator in the graph and would redirect fops to either the dht xlator or to 
the protocol-client xlator (both of which are children of the snapview-client 
xlator). On the server side, the protocol-server xlator and the snapview-server 
xlator would form the graph which is hosted inside a separate daemon snapd 
(glusterfsd process). One such daemon process is spawned for each gluster 
volume.

We rely on the fact that gfids are unique and are the same across all 
snapshotted volumes. Given a volume, we will access a file using its gfid 
without knowing the filename. We accomplish this by taking the existing data 
filesystem namespace and overlaying a virtual gfid namespace on top.

All files, directories will remain accessible as they are (in the current state 
of the gluster volumes). But in every directory we will create a "virtual 
directory" called ".snaps" on the fly. This ".snaps" directory will provide a 
list of all the available snapshots for the given volume and kind of act as a 
wormhole into all the available snapshots of that volume ie. to the past.

When the .snaps dir is looked up, the client xlator with its instrumented 
lookup() detects that its a reference to the virtual directory. It redirects 
the request to the snapd daemon and to the snapview-server xlator in turn, 
which generates a random gfid, fills up a pseudo stat sturcture with necessary 
info and returns via STACK_UNWIND. Information about the directory is 
maintained in the server xlator inode context, where inodes are classified as 
VIRTUAL, REAL or the
special "DOT_SNAPS_INODE" so that this info can be used in subsequent lookups. 
On the client xlator side too, such virtual type info is maintained in the 
inode_ctx.

The user would typically do a "ls" which results in an opendir and a readdirp() 
on the inode returned. Server xlator will query the list of snaps that are 
present in the system and present each one as an entry in the directory, in the 
form of dirent entries. We also need to encode enough info in each of the 
respective inodes so that next time a call happens to that inode, we can figure 
where that inode is in the big picture. Whether it is in the snap vol, where we 
are etc. And once a user tries to do ls inside one of the specific snap dirs, 
we will have to figure out what the gfid was of the original directory and 
perform access on the appropriate graph based on the specific snap directory 
(hourly.0 etc) and perform access on that graph based on that gfid. The inode 
information in the server xlator side is mapped to the gfapi world via the 
handle-based libgfapi which were introduced for the nfs-ganesha integration. 
These handle based APIs allow a gfapi operation to be performed on a "gfid" 
handle and a glfs-object that encodes the gfid and inode returned from the 
gfapi world. 
In this case, once the server xlator allocates an inode, we need to track and 
map it to the corresponding glfs-object in the handle-based gfapi world, so 
that any glfs_h_XXX operation can be performed on it. 

For example, on the server xlator side, the _stat call would typically need to 
check the type of inode stored in the inode_ctx. If its a ".snaps" inode then 
the iatt structure is filled in. If its a subsequent lookup on a Virtual inode, 
then we obtain the glfs_t and glfs_object info from the inode_ctx (where this 
is already stored). The desired stat is then easily obtained using the 
glfs_h_stat (fs, object, &stat) call.

Considerations
==============
- A global option will be available to turn off USS globally. A volume level 
option will also be made available to enable USS per volume. There could be 
volumes for which uss access is not desirable at all.

- Disabling this feature would remove the client side graph generation while 
snapds can continue to exist on the server side; they will never be accessed 
without the client side enablement. And since every access to client gfapi 
graphs etc. is dynamic and done on the fly and cleaned up, the expectation is 
that such a snapd left behind would not hog resources at all.

- Today we are allowing the listing of all available snapshots in each ".snaps" 
directory. We plan to introduce a configurable option to limit the number of 
snapshots visible under the USS feature.

- There is no impact to existing fops by this feature. If enabled, it is just 
an extra check in the client side xlator when the fop is redirected to the 
server side xlator

- With a large number of snapshot volumes being made available or visible, one 
glfs_t * hangs off the snapd for each gfapi client call-graph. Along with that, 
if a large number of users start simultaneously accessing files on each of
the snapshot volumes (max number of supported snapshots is 256 today) then the 
RSS for snapd could go high. We are trying to get numbers for this before we 
can say for sure if this is an issue at all (say, with OOM killer).

- The list of snapshots is refreshed each time a new snap is taken or added to 
the system. The snapd would query glusterd for the new list of snaps and 
refresh its in-memory list of snapshots, appropriately cleaning up the glfs_t 
graphs for each of the deleted snapshots and clearing up any glfs_objects.

- Again, this is not a performance oriented feature. Rather, the goal is to 
allow a seamless user-experience by allowing easy and useful access to 
snapshotted volumes and individual data stored in those volumes.





_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Reply via email to