On Mon, 2010-12-06 at 09:27 -0500, Josef Bacik wrote:
> On Sat, Dec 04, 2010 at 01:58:07PM -0800, Mike Fedyk wrote:
> > On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <jo...@redhat.com> wrote:
> > > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > >> Hello,
> > >>
> > >> Various people have complained about how BTRFS deals with subvolumes 
> > >> recently,
> > >> specifically the fact that they all have the same inode number, and 
> > >> there's no
> > >> discrete seperation from one subvolume to another.  Christoph asked that 
> > >> I lay
> > >> out a basic design document of how we want subvolumes to work so we can 
> > >> hash
> > >> everything out now, fix what is broken, and then move forward with a 
> > >> design that
> > >> everybody is more or less happy with.  I apologize in advance for how 
> > >> freaking
> > >> long this email is going to be.  I assume that most people are generally
> > >> familiar with how BTRFS works, so I'm not going to bother explaining in 
> > >> great
> > >> detail some stuff.
> > >>
> > >> === What are subvolumes? ===
> > >>
> > >> They are just another tree.  In BTRFS we have various b-trees to 
> > >> describe the
> > >> filesystem.  A few of them are filesystem wide, such as the extent tree, 
> > >> chunk
> > >> tree, root tree etc.  The tree's that hold the actual filesystem data, 
> > >> that is
> > >> inodes and such, are kept in their own b-tree.  This is how subvolumes 
> > >> and
> > >> snapshots appear on disk, they are simply new b-trees with all of the 
> > >> file data
> > >> contained within them.
> > >>
> > >> === What do subvolumes look like? ===
> > >>
> > >> All the user sees are directories.  They act like any other directory 
> > >> acts, with
> > >> a few exceptions
> > >>
> > >> 1) You cannot hardlink between subvolumes.  This is because subvolumes 
> > >> have
> > >> their own inode numbers and such, think of them as seperate mounts in 
> > >> this case,
> > >> you cannot hardlink between two mounts because the link needs to point 
> > >> to the
> > >> same on disk inode, which is impossible between two different 
> > >> filesystems.  The
> > >> same is true for subvolumes, they have their own trees with their own 
> > >> inodes and
> > >> inode numbers, so it's impossible to hardlink between them.
> > >>
> > >> 1a) In case it wasn't clear from above, each subvolume has their own 
> > >> inode
> > >> numbers, so you can have the same inode numbers used between two 
> > >> different
> > >> subvolumes, since they are two different trees.
> > >>
> > >> 2) Obviously you can't just rm -rf subvolumes.  Because they are roots 
> > >> there's
> > >> extra metadata to keep track of them, so you have to use one of our 
> > >> ioctls to
> > >> delete subvolumes/snapshots.
> > >>
> > >> But permissions and everything else they are the same.
> > >>
> > >> There is one tricky thing.  When you create a subvolume, the directory 
> > >> inode
> > >> that is created in the parent subvolume has the inode number of 256.  So 
> > >> if you
> > >> have a bunch of subvolumes in the same parent subvolume, you are going 
> > >> to have a
> > >> bunch of directories with the inode number of 256.  This is so when 
> > >> users cd
> > >> into a subvolume we can know its a subvolume and do all the normal 
> > >> voodoo to
> > >> start looking in the subvolumes tree instead of the parent subvolumes 
> > >> tree.
> > >>
> > >> This is where things go a bit sideways.  We had serious problems with 
> > >> NFS, but
> > >> thankfully NFS gives us a bunch of hooks to get around these problems.
> > >> CIFS/Samba do not, so we will have problems there, not to mention any 
> > >> other
> > >> userspace application that looks at inode numbers.
> > >>
> > >> === How do we want subvolumes to work from a user perspective? ===
> > >>
> > >> 1) Users need to be able to create their own subvolumes.  The permission
> > >> semantics will be absolutely the same as creating directories, so I 
> > >> don't think
> > >> this is too tricky.  We want this because you can only take snapshots of
> > >> subvolumes, and so it is important that users be able to create their own
> > >> discrete snapshottable targets.
> > >>
> > >> 2) Users need to be able to snapshot their subvolumes.  This is 
> > >> basically the
> > >> same as #1, but it bears repeating.
> > >>
> > >> 3) Subvolumes shouldn't need to be specifically mounted.  This is also
> > >> important, we don't want users to have to go around mounting their 
> > >> subvolumes up
> > >> manually one-by-one.  Today users just cd into subvolumes and it works, 
> > >> just
> > >> like cd'ing into a directory.
> > >>
> > >> === Quotas ===
> > >>
> > >> This is a huge topic in and of itself, but Christoph mentioned wanting 
> > >> to have
> > >> an idea of what we wanted to do with it, so I'm putting it here.  There 
> > >> are
> > >> really 2 things here
> > >>
> > >> 1) Limiting the size of subvolumes.  This is really easy for us, just 
> > >> create a
> > >> subvolume and at creation time set a maximum size it can grow to and not 
> > >> let it
> > >> go farther than that.  Nice, simple and straightforward.
> > >>
> > >> 2) Normal quotas, via the quota tools.  This just comes down to how do 
> > >> we want
> > >> to charge users, do we want to do it per subvolume, or per filesystem.  
> > >> My vote
> > >> is per filesystem.  Obviously this will make it tricky with snapshots, 
> > >> but I
> > >> think if we're just charging the diff's between the original volume and 
> > >> the
> > >> snapshot to the user then that will be the easiest for people to 
> > >> understand,
> > >> rather than making a snapshot all of a sudden count the users currently 
> > >> used
> > >> quota * 2.
> > >>
> > >> === What do we do? ===
> > >>
> > >> This is where I expect to see the most discussion.  Here is what I want 
> > >> to do
> > >>
> > >> 1) Scrap the 256 inode number thing.  Instead we'll just put a flag in 
> > >> the inode
> > >> to say "Hey, I'm a subvolume" and then we can do all of the appropriate 
> > >> magic
> > >> that way.  This unfortunately will be an incompatible format change, but 
> > >> the
> > >> sooner we get this adressed the easier it will be in the long run.  
> > >> Obviously
> > >> when I say format change I mean via the incompat bits we have, so old 
> > >> fs's won't
> > >> be broken and such.
> > >>
> > >> 2) Do something like NFS's referral mounts when we cd into a subvolume.  
> > >> Now we
> > >> just do dentry trickery, but that doesn't make the boundary between 
> > >> subvolumes
> > >> clear, so it will confuse people (and samba) when they walk into a 
> > >> subvolume and
> > >> all of a sudden the inode numbers are the same as in the directory 
> > >> behind them.
> > >> With doing the referral mount thing, each subvolume appears to be its 
> > >> own mount
> > >> and that way things like NFS and samba will work properly.
> > >>
> > >> I feel like I'm forgetting something here, hopefully somebody will point 
> > >> it out.
> > >>
> > >> === Conclusion ===
> > >>
> > >> There are definitely some wonky things with subvolumes, but I don't 
> > >> think they
> > >> are things that cannot be fixed now.  Some of these changes will require
> > >> incompat format changes, but it's either we fix it now, or later on down 
> > >> the
> > >> road when BTRFS starts getting used in production really find out how 
> > >> many
> > >> things our current scheme breaks and then have to do the changes then.  
> > >> Thanks,
> > >>
> > >
> > > So now that I've actually looked at everything, it looks like the 
> > > semantics are
> > > all right for subvolumes
> > >
> > > 1) readdir - we return the root id in d_ino, which is unique across the fs
> > > 2) stat - we return 256 for all subvolumes, because that is their inode 
> > > number
> > > 3) dev_t - we setup an anon super for all volumes, so they all get their 
> > > own
> > > dev_t, which is set properly for all of their children, see below
> > >
> > > [root@test1244 btrfs-test]# stat .
> > >  File: `.'
> > >  Size: 20              Blocks: 8          IO Block: 4096   directory
> > > Device: 15h/21d Inode: 256         Links: 1
> > > Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> > > Access: 2010-12-03 15:35:41.931679393 -0500
> > > Modify: 2010-12-03 15:35:20.405679493 -0500
> > > Change: 2010-12-03 15:35:20.405679493 -0500
> > >
> > > [root@test1244 btrfs-test]# stat foo
> > >  File: `foo'
> > >  Size: 12              Blocks: 0          IO Block: 4096   directory
> > > Device: 19h/25d Inode: 256         Links: 1
> > > Access: (0700/drwx------)  Uid: (    0/    root)   Gid: (    0/    root)
> > > Access: 2010-12-03 15:35:17.501679393 -0500
> > > Modify: 2010-12-03 15:35:59.150680051 -0500
> > > Change: 2010-12-03 15:35:59.150680051 -0500
> > >
> > > [root@test1244 btrfs-test]# stat foo/foobar
> > >  File: `foo/foobar'
> > >  Size: 0               Blocks: 0          IO Block: 4096   regular empty 
> > > file
> > > Device: 19h/25d Inode: 257         Links: 1
> > > Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> > > Access: 2010-12-03 15:35:59.150680051 -0500
> > > Modify: 2010-12-03 15:35:59.150680051 -0500
> > > Change: 2010-12-03 15:35:59.150680051 -0500
> > >
> > > So as far as the user is concerned, everything should come out right.  
> > > Obviously
> > > we had to do the NFS trickery still because as far as VFS is concerned the
> > > subvolumes are all on the same mount.  So the question is this (and 
> > > really this
> > > is directed at Christoph and Bruce and anybody else who may care), is 
> > > this good
> > > enough, or do we want to have a seperate vfsmount for each subvolume?  
> > > Thanks,
> > >
> > 
> > What are the drawbacks of having a vfsmount for each subvolume?
> > 
> > Why (besides having to code it up) are you trying to avoid doing it that 
> > way?
> 
> It's the having to code it up that way thing, I'm nothing if not lazy.

And, anything that uses the mount table, exposed from the kernel, will
grind a system to a halt with only a few thousand mounts, not to mention
that user space utilities, like df, du ..., will become painful to use
for more than a hundred or so entries.

> 
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to