commit e5ae87a1e6e0350a6d643835766ae6ff13e59d39
Author: Yiannis Pericleous <[EMAIL PROTECTED]>
Date: Wed Mar 28 16:55:22 2007 -0400
documentation: updated usage.txt and added odf.txt
diff --git a/Documentation/filesystems/unionfs/odf.txt
b/Documentation/filesystems/unionfs/odf.txt
new file mode 100644
index 0000000..eaa920a
--- /dev/null
+++ b/Documentation/filesystems/unionfs/odf.txt
@@ -0,0 +1,544 @@
+ ODF: On Disk Format for Unionfs 2.x
+
+ March 24, 2007
+
+ http://unionfs.filesystems.org/
+
+
+Note: instructions on how to use Unionfs's ODF features (mount options), are
+found in the Documentation/filesystems/unionfs/usage.txt in the same
+directory as this file.
+
+
+* INTRODUCTION
+
+Stackable file systems, and especially Unionfs, often require to keep
+additional information persistently, so as to provide a file system that is
+fully compliant with POSIX, as well as interoperate with other file systems
+(especially Network File Systems).
+
+In addition, stackable file systems are "virtual" in the sense that they
+operate on top of one or more other file systems. This means that often a
+lot of state has to be produced and kept in memory. Such state can
+sometimes be discarded, but costs a lot of CPU cycles to reproduce.
+Therefore, it could be beneficial to cache such data in another place, so it
+can be retrieved easily.
+
+Both of the above problems can be solved if Unionfs could store certain
+amount of data persistently. This is accomplished using the On Disk Format
+(ODF) support in Unionfs. Briefly stated, Unionfs uses another small file
+system partition or loop device, and another file system (e.g., Ext2), to
+store persistent data such as whiteout information, cached directories, and
+more. This, as you will see, solves *many* of the problems that stackable
+file systems such as Unionfs have to deal with.
+
+
+* ODF SOLVES IMPORTANT ISSUES
+
+Some of the problems that ODF solves are the following:
+
+1. Whiteouts and opaques pollute the namespace.
+
+Currently Unionfs (pre-ODF) uses a file such as ".wh.F" to indicate the file
+"F" should be whited out; Unionfs also uses a file named ".wh.__dir_opaque"
+in a directory D to indicate that lookups should not continue past the
+branch where D is. Both of these special names pollute the names space, and
+they prevent anyone from actually creating a file which may start with
+".wh". Also, writing those special files over various file systems,
+especially network file systems, introduces various races.
+
+Instead, the new ODF code stores whiteouts and opaque directories as inode
+bits in an Ext2 file system on the local machine. This avoid the name-space
+pollution and avoids races with network file systems.
+
+2. Inode numbers must be persistent indefinitely.
+
+Network file systems (e.g., NFS) and other user tools such as /bin/tar and
+/bin/find, require that the inode numbers of files be persistent. Unionfs,
+however, creates dynamic inode numbers on the fly because it cannot inherit
+the actual inode numbers from the lower branches (the latter can be any
+number and there's no unique way to combine them). To keep inode numbers
+persistent, our old Unionfs 1.x supported a special option called 'imap'
+which was a simple file that recorded the persistent inode numbers per
+branch. Alas, the imap code was racy, polluted the namespace, didn't handle
+branch management, and added significant performance overheads.
+
+Instead, the new ODF code allocates inodes on a real but small Ext2 file
+system and uses those inode numbers as the persistent inode numbers for
+Unionfs to use. By using Ext2, we leverage a small and very efficient file
+system, we get persistency easily, concurrent accesses are permitted,
+further improving performance, and Unionfs can be exported across network
+file systems reliably.
+
+3. Layering multiple unions.
+
+Stacking unionfs on top of unionfs was not possible before. The reason is
+mainly the namespace pollution: a whiteout file could belong to any logical
+union and it's impossible to tell which one.
+
+With the new ODF code, each union has its own ODF state, which separates out
+whiteouts and opaque files. With ODF, stacking any unions in any arbitrary
+manner is simple.
+
+4. Resuming unions with complex branch configurations.
+
+Sometimes, users build complex unions with many branches, and then they
+reconfigure their unions using branch-management commands (adding, deleting,
+and modifying branches and their states). The end result is a union that
+may be hard to reproduce or forgotten: it may require a long command line to
+build the same union again, so users often have to record the final branch
+configuration in some other location (a shell script, for example).
+
+With the new ODF code, each union records the branch configuration precisely
+and persistently. If you unmount a union, you can resume it easily by just
+pointing Unionfs to the correct ODF location; Unionfs then will resume the
+exact branch configuration as was there before.
+
+5. Directory reading
+
+When a user lists a directory, Unionfs has to build the contents of that
+directory by merging the contents of that directory on all of the
+corresponding lower branches. Then Unionfs has to eliminate all duplicates,
+remove any file for which there is a whiteout, and handle opaque
+directories. This process consumes memory and CPU cycles, and has to be
+repeated each time Unionfs has to re-read the same directory. Worse, if the
+merged directories are very large, the kernel could run out of core memory.
+So one solution was to build the unified directory contents incrementally,
+piece by piece, conserving memory. However, incremental building of a union
+directory's contents prevents lseek()-ing into a directory that's not
+completely constructed in memory (NSF/SMB use directory lseek-ing
+extensively).
+
+With the new ODF code, Unionfs constructs each unified directory contents
+incrementally in memory, but stores it persistently onto the ODF location.
+This has the benefit of supporting lseek() perfectly, and avoiding large
+memory consumption. Essentially ODF becomes a cache of the directory
+contents, and since directories are read more often than they are written,
+this cache also improves performance.
+
+6. Hardlinks
+
+Supporting hard-links in any file system is a challenge. This is primarily
+because FFS-like file systems and Unix OSs don't support an efficient
+mechanism to find out all of names and precise locations of a hard-linked
+file from just one inode number or one name. It is even more of a challenge
+in Unionfs because it supports an arbitrary way of creating branches, which
+could contain hard-linked files, only some of their names are exposed within
+a given branch. The challenge is during copy-up, how to recognize all of
+the hard-linked files and properly treat a copied-up hard-linked file as one
+file, not just a collection of arbitrary names.
+
+With the new ODF code, Unionfs incrementally builds the knowledge about
+hard-linked files across branches, as it finds them in the lower branches.
+Essentially, Unionfs tries to discover all of the names of a hard-linked
+file "lazily" so that Unionfs can properly treat a hard-linked file when it
+gets copied up.
+
+
+* DESIGN
+
+** ODF STRUCTURE
+
+ODF is just a file (or raw partition), formatted to our needs, not a full
+blown file system developed just for Unionfs, and its namespace is *not*
+exposed to users. To make debugging easier, however, and leverage existing
+tools (e.g., debugfs), we decided to use an existing simple file system,
+Ext2 (several other disk-based file systems could work just as well).
+
+The advantage of using an existing file system are:
+
+- we remain format compatible with an existing file systems.
+
+- no need to hide the namespace for the ODF prototype (you can mount the
+ /odf partition and inspect it manually)
+
+- we can use existing tools: mkfs, dump/restore, debugfs, etc.
+
+- we can get ODF working quicker and into the hands of users.
+
+Of course, regardless what we use for ODF, we still need to take care of
+locking of our objects to prevent unsafe concurrent accesses to /odf by
+multiple users.
+
+In the rest of this document, we assume that an ODF partition for a single
+union is mounted under /odf. In practice, there can be many such ODFs
+mounted on any paths the user desires. Note also that normally this mount
+is performed inside the kernel and hence the /odf namespace is hidden from
+normal user's view.
+
+An /odf is a hierarchy of directories and files with specific purposes, as
+follows:
+
+1. /odf/sb is a plain file containing our superblock, which records branch
+ configuration, UUIDs, etc. Each branch has a unique UUID so it can be
+ identified consistently.
+
+2. /odf/ns/<PATH> is used for storing information about whiteouts, opaque
+ directories, and persistent inode numbers. We use the actual inum of the
+ created zero-length file, as our persistent inum. The term 'ns' here
+ stands for Name-Space.
+
+3. /odf/ic/##/##/<INUM> is for caching readdir() results, where '##' is the
+ 64-bit inum, broken one byte at a time, for up to an 8-deep hierarchy.
+ (It might be more efficient for '##' to go MSB->LSB, to help inum
+ locality.) The term 'ic' stands for icache.
+
+4. We use /odf/ic also to store hardlink info in
+ /odf/ic/##/##/<INUM>.<UUID>, a file whose contents records a mapping of
+ inodes to hardlinked names; here <INUM> is the inode number of the source
+ hardlinked file, and <UUID> is the uuid of the source branch.
+
+ The first thing in the file is the UUID of the destination branch we
+ copied the hardlinked file up to, and the inum of the copied up file we
+ created (this would allow us to hardlink to it). If we don't know this
+ yet, because a hardlinked file was just looked up one or more times
+ without resulting in a copyup, then we leave both entries zero. (In
+ practice, we don't really need to store the UUID itself, but just an
+ index to the branch UUID's location in /odf/sb as needed).
+
+ The rest of the contents of this file is a list of struct dirent's, which
+ records the name of the hardlink, but with the inum of the *parent*
+ directory (e.g., <name,dirinum>). We need to store the parent inum so
+ that the hard-linked file's name can be found by searching the parent
+ dir.
+
+ We do not store the number of links here, because that can change (a user
+ can delete one link while we're still trying to discover the rest of the
+ link names). So we always look it up always from one of the known
+ hardlink names.
+
+5. /odf/reclaim/<INUM> is used for storing hierarchies that need to be
+ deleted. This is useful after an mkdir on top of an opaque dir, because
+ the entire opaque directory's hierarchy has to be deleted (recursively,
+ no less). On rmdir we may also use /odf/reclaim.
+
+ This allows us to speed up operations which need to delete items from the
+ ODF, because we just move those items to the reclaim area (which acts as
+ a "trash bin" of sorts). In addition, we have an asynchronous kernel
+ thread which wakes up every few seconds (configurable) and purges some
+ items in /odf/reclaim, to keep space usage in /odf low. Finally, if we
+ notice that space in /odf has become very short (especially if we get an
+ ENOSPACE), then we immediately suspend file system activity in Unionfs,
+ and go cleanup some space in /odf, until it drops below a (configurable)
+ threshold.
+
+
+** The ODF superblock
+
+The ODF superblock (/odf/sb) stores the following:
+
+- the ODF format version, in case it changes in the future.
+
+- the UUID of each branch
+
+- the no. of branches
+
+- records the pathnames and options for each branch. This is so we can
+ mount an older union by its odf info, without specifying full mount
+ options again. (Optional: we can record portable information such as a
+ file system's LABEL.)
+
+- records whether this odf is newly formatted or in use. (We can store a
+ "new" bit which is initialized on mkfs, or just count the number of inodes
+ in use.)
+
+- we might also need to record extra info to handle hardlinks (e.g., the
+ orig branch pathname from root of that branch's file system up to the
+ lower-branch dir name we stack on)
+
+Actions:
+
+1. To prepare a new ODF, just run "mkfs" on the /odf partition, and set the
+ "new" bit.
+
+2. on mount, clear odf "new" bit; get the number and list of branches, and
+ their UUIDs, and store in /odf/sb.
+
+3. on remount, if we reconfigure the branches, then
+
+(a) new bit cannot be on (else BUG_ON)
+
+(b) compare old and new UUID list: if same (rare case), then ignore.
+
+(c) if UUID list differs, and we didn't use the "odfforce" flag, then return
+ EINVAL (or some other errno) because the odf cache will have been stale.
+
+(d) otherwise, if you really wants to use an existing /odf cache (with
+ possibly some stale info), then use options "remount,odfforce". (We may
+ want to not implement "odfforce" initially, and see exactly what needs
+ to be purged or kept upon odfforce. Besides, "odfforce" may only lead
+ to trouble, so why risk it?)
+
+
+** Whiteouts and Opaque Dir actions
+
+Whiteout (WH) semantics:
+
+- create a wh only if necessary, of course
+
+> lookup():
+
+- check for main file in each branch, from leftmost branch until the opaque
+ branch (so we have to find out the opaque limit)
+
+- upon a successful lookup of a dir, we mark in the unionfs dir inode the
+ values for start branch, end branch, and opaque branch.
+
+# XXX: do we return pointers to the first matching dentry below, or all?
+# XXX: does this affect rename().
+
+1. if not found in any branch, return ENOENT
+ - optional (possibly deferred to cron):
+ - check if wh is in /odf
+ - if so, it was leftover from old wh (someone may have removed the
+ main file)
+ - so cleanup wh in /odf
+2. if file found, then check /odf
+- if wh found in /odf, return ENOENT
+- if no wh in /odf, then return dentry normally
+
+The above lookup procedure is further modified to handle hard-links
+
+# XXX: we may have to change the above to handle hardlinks (keep track of
+# which branch the wh came from)
+
+3. when looking up a directory, we need to set the link count to 2 if the
+ dir is empty, else to whatever it should be. This could be useful in
+ renaming a dir to an empty dir.
+
+> create(), mknod(), and symlink():
+
+- after you create new file, delete wh if exists
+
+- we create only in the leftmost branch; if failed, return error (with this,
+ no need to worry about opaque dirs)
+
+> unlink():
+
+- you go L->R (left to right on branches) and delete all files with the same
+ name.
+
+- if we get *any* error (EIO, EROFS, ...), then stop trying to unlink
+ remaining ones, and create wh; return "OK" but we may want to printk on
+ EIO or some other serious errors; if wh creation failed, return error.
+
+> is_empty(char * dir) utility function:
+
+- checks if dir is logically empty or not
+- handles opaque dirs
+- handles whiteouts
+- returns TRUE/FALSE if dir is empty
+
+Sample operation (partial):
+
+- first need to find out what will have been the state of that directory's
+ "emptiness" status before, so we know if to return ENOTEMPTY.
+
+- So we go L->R (up to and including an OPAQUE dir, but not beyond).
+
+- if all those dirs have link-count==2, then the directory will have been
+ empty, so we can return TRUE from is_empty(); else return FALSE.
+
+> rmdir() is similar to unlink, but:
+
+- first call is_empty(); if dir will not have been empty, we're done, just
+ return ENOTEMPTY; else
+
+- now handle actual rmdir'ing: you go L->R and rmdir all dirs with the same
+ name.
+
+- if you get *any* error (EIO, EROFS, ...), then stop trying to rmdir
+ remaining ones, and create wh; return "OK" but we may want to printk on
+ EIO or some other serious errors.
+
+- when we have to rmdir a non-empty physical dir, we mv it to /odf/reclaim
+ instead, to avoid in-kernel recursive removal. We can use a combination
+ of a user-level cron job or an in-kernel async thread to cleanup stuff in
+ /odf/relclaim.
+
+> mkdir()
+
+- create dir first; if failed, return error
+
+- delete wh if exists. To delete the wh from /odf, we need to effectively
+ remove an entire dir of whiteouts. So mv the whole dir into /odf/reclaim,
+ which would get cleaned up asynchronously.
+
+- create new dir in /odf, and mark it OPAQUE
+
+> readdir() and filldir()
+
+- we create in /odf/ic (see above) a file, containing the unionfs-visible
+ content of the given, say, "foo/" dir. We use the same duplicate
+ elimination algorithm that we have now.
+
+- we update the mtime on /odf/ic as well as the unionfs visible "foo/" dir,
+ so we know when the /odf/ic's readdir content is stale or not; if stale,
+ unlink old /odf/ic file and recreate it using the current branch contents.
+
+- The benefit: /odf caches readdir contents for all users, until/if the
+ "foo/" dir changes in Unionfs; this survives reboots, works with all forms
+ of l/seek-ing, and is even more efficient because dir pages will get
+ cached; this also works with NFS exports (no need for cookie ugliness any
+ more).
+
+> permission()
+
+- just check the first one you find, same as now.
+
+
+** Copyup actions
+
+Note: Copyup actions don't really relate to ODF. However, we put them here
+because the ODF work allowed us to consider simplifying the Unionfs code
+(especially with respect to rename semantics).
+
+> write(): need to copyup
+
+- if you write to a readonly branch or ROFS file system, then copyup
+
+- look at SEEK_HOLE to, else check each page for zeros and skip it.
+
+- optional: we can also copyup on ENOSPACE/ENOQUOTA
+
+- if copyup failed, return error
+
+- copyup may need to create a parent hierarchy of dirs (but don't make the
+ newly created hierarchy opaque)
+
+# XXX: fix hardlinks if copyup succeeded.
+
+# XXX: fix copyup for mmap(), probably in ->writepage, commit/prepare_write
+
+- we need to copyup also on some of ->setattr (chown, chgrp, ...)
+
+> truncate(): needs to copyup
+
+- handle same as copyup for write
+
+- useful to have a shared copyup function that can copy part/all of a file.
+
+> rename() has always been challenging in Unionfs
+
+case 1: regular file within same dir
+case 2: regular file to separate dir
+case 3: dirname within same dir
+case 4: dirname to a separate dir
+
+The older semantics of rename were that we tried hard to preserve the
+location of a file being renamed. But this was too complex to maintain
+without races, and offered few benefits to users. Some Unionfs users
+proposed that on rename, we always copyup the renamed file to the left-most
+branch. This indeed would be much simpler to implement, but that seems too
+excessive and may clutter the left-most branch too much. So our new design
+with ODF provides a trade-off between the old and proposed implementations.
+Our new, simplified rename idea is as follows:
+
+- By the time we're called with rename(src,dst), both dentries have been
+ looked up: src must exist else our lookup will have failed; and dst may or
+ may not exist (we'll get a dentry which could be a negative dentry).
+
+- If src is a non-empty dir that lies on a ro branch we let the userspace
+ recurse and ask us to copy up each file separately, by returning EXDEV
+ (just like in Unionfs 1.x)
+
+- If dst exists on the top branch and is a whiteout or a logically empty dir
+ we remove it before renaming by:
+ - rmdir if its a physically empty dir
+ - silly rename if its a non-empty dir
+ - unlink if not a dir
+
+- Start renaming from the leftmost branch of src to the rightmost or until
+ any one rename fails. We then copyup to top branch and rename if the src
+ leftmost was not the top branch and:
+ - the rename at the src leftmost branch failed
+ - dst already existed at a branch to the left of src
+ - dst was in an opaque dir with opaqueness branch to the left of src
+
+- We create a whiteout if any of the renames failed.
+
+- For directories: The same applies to directories as well. If we need to
+ copyup a directory, then this directory is guaranteed to be always empty
+ by using EXDEV (above), so we just mkdir at the top branch.
+
+- We need to purge the readdir cache (/odf/ic) for all successful renames,
+ in both src and dst dir. Also, the creation of a whiteout should purge
+ the readdir cache of that dir (can be optimized later).
+
+# XXX: need to handle hardlinks and renames
+
+** Persistent Inodes actions
+
+Easy: just create the file in /odf/ns/<PATH>, using a flexible copyup()
+routine that will work for whole file copyup, truncated files, and
+zero-length files in /odf/ns/.
+
+On lookup, we need to create the hierarchy in /odf/ns, if it doesn't exist;
+if/when it does, read the inum from /odf/ns, and put it in unionfs's inode.
+Let dcache/icache algorithms figure out when to purge stuff.
+
+This means we'll need lots of inodes in /odf, so mkfs it appropriately. Our
+current estimate is to create around 3000 to 3500 inodes per MB in the odf
+using "mke2fs -i NNN"
+
+
+* Hardlinks
+
+When looking up a hardlinked file for the first time, create a new entry in
+/odf/ic/##/##/<INUM>.<UUID> as described above.
+
+If that hardlinked file is truncated/written to, need to copyup; so we
+copyup normally because it's the first time we copyup that hardlinked file;
+then we permit the write to proceed. If we copyup and have created a new
+inode on the destination branch, then update /odf/ic entry to record the
+UUID+inum (which were all zero initially).
+
+When looking up a hardlinked file, we check /odf/ic, and let's say we find
+it there. That means the hardlinked file was looked up before, and possibly
+copied up. If no one modifies the hardlinked file even on the second
+lookup, no need to copyup the file or parent hierarchy, so just add another
+entry to /odf/ic.
+
+But if a user modifies the file, then we have two cases.
+
+(1) if the file is modified for the first time, then copyup several parent
+ hierarchies, for all known names of this hardlink in /odf/ic; then link
+ the file in several places. (Copyup will have to lookup the dentry then
+ do a d_path on it to get full hierarchy to recreate.)
+
+(2) if the file was already copied up, then it means that the entire known
+ hardlink hierarchy already existed, so just create the parent hierarchy
+ for this newly discovered link, and link the file to the already copied
+ up version. Then we can allow the write/truncate to proceed.
+
+When we copyup for hardlinks or otherwise, also hardlink in /odf/ns, so we
+retain the persistent inodes.
+
+We need to record the actual number of links to report for a file, in case
+one of the hardlinked files was removed. We can store this in
+/odf/ns/<PATH> by stealing the gid field (or some other field we can take
+over safely).
+
+> link()
+
+If you create a new hardlink, then we copyup the src as described above,
+plus update /odf/ns as described, then link the dst to the src. If someone
+creates a new hardlink to an already copied up file, then we will see that
+as existing info in /odf/ns.
+
+
+* CREDITS
+
+Our group of stackable file system developers at Stony Brook had considered
+something like ODF for several years already, and for several projects (not
+just Unionfs). But it wasn't until OLS 2006, that we began seriously
+designing and developing this support. One major reason was that at OLS'06,
+we met and discussed the topic with Ted Ts'o, who was quite supportive of
+this idea.
+
+The initial design for ODF was done by Yiannis Pericleous, Josef "Jeff"
+Sipek, and Erez Zadok. The bulk of the initial development was done by
+Yiannis Pericleous. Currently, all Unionfs developers help maintain the ODF
+code.
+
+For more information on Unionfs, see <http://unionfs.filesystems.org/>.
diff --git a/Documentation/filesystems/unionfs/usage.txt
b/Documentation/filesystems/unionfs/usage.txt
index 13fbcea..d14b8d0 100644
--- a/Documentation/filesystems/unionfs/usage.txt
+++ b/Documentation/filesystems/unionfs/usage.txt
@@ -81,10 +81,60 @@ CACHE CONSISTENCY
If you modify any file on any of the lower branches directly, while there is
a Unionfs 2.0 mounted above any of those branches, you should tell Unionfs
to purge its caches and re-get the objects. To do that, you have to
-incremenet the generation number of the superblock using the following
+increment the generation number of the superblock using the following
command:
-# mount -t unionfs -o remount,remount,incgen none MOUNTPOINT
+# mount -t unionfs -o remount,incgen none MOUNTPOINT
+ON-DISK FORMAT (ODF)
+====================
+
+For motivation, design, and implementation details of ODF, see the odf.txt
+document in this directory. To use ODF, you first have to format a small
+file system with enough inodes. You can use a loop device, a ram disk, or
+even a real partition. For now, this must be formatted as Ext2 (ODF has not
+been tested with other file systems yet). We assume here that a single
+Unionfs is mounted and its ODF partition is mounted in /odf.
+
+To mount Unionfs with a new ODF:
+
+# mount -t unionfs -o odf=/odf,dirs=... none MOUNTPOINT
+
+To mount Unionfs with an existing ODF
+
+# mount -t unionfs -o odf=/odf none MOUNTPOINT
+
+Each time you mount an existing ODF, if the branch configuration had
+changed, we purge the existing cache to avoid using stale information. To
+force a Unionfs mount with an older and possibly stale ODF:
+
+# mount -t unionfs -o odfforce,odf=/odf none MOUNTPOINT
+
+
+We use configfs to define three global variables for all mounted unions, to
+control cache cleaning. The following assumes that you mount configfs in
+/config. We have 3 variables
+
+1. timeout: defaults to 30 seconds. Controls the interval when the ODF
+ cleanup thread will wake up. You can use configfs to view the value:
+
+# cat /config/unionfs/timeout
+
+ and you can change value, say reduce it to 15 seconds, as follows:
+
+# echo 15 > /config/unionfs/timeout
+
+
+2. block_thresh: defaults to 80%. The threshold of consumed disk blocks in
+ /odf, above which we begin cleaning, to bring the system below this
+ threshold. (Note: we may add a hysteresis functionality to prevent
+ ping-pong effects.)
+
+3. inode_thresh: defaults to 80%. The threshold of consumed inodes in /odf,
+ above which we begin cleaning, to bring the system below this threshold.
+
+You can view or modify the latter two variables similarly to the 'timeout'
+variable.
+
For more information, see <http://unionfs.filesystems.org/>.
_______________________________________________
unionfs-cvs mailing list: http://unionfs.filesystems.org/
[email protected]
http://www.fsl.cs.sunysb.edu/mailman/listinfo/unionfs-cvs