On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <jo...@redhat.com> wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
>> Hello,
>>
>> Various people have complained about how BTRFS deals with subvolumes 
>> recently,
>> specifically the fact that they all have the same inode number, and there's 
>> no
>> discrete seperation from one subvolume to another.  Christoph asked that I 
>> lay
>> out a basic design document of how we want subvolumes to work so we can hash
>> everything out now, fix what is broken, and then move forward with a design 
>> that
>> everybody is more or less happy with.  I apologize in advance for how 
>> freaking
>> long this email is going to be.  I assume that most people are generally
>> familiar with how BTRFS works, so I'm not going to bother explaining in great
>> detail some stuff.
>>
>> === What are subvolumes? ===
>>
>> They are just another tree.  In BTRFS we have various b-trees to describe the
>> filesystem.  A few of them are filesystem wide, such as the extent tree, 
>> chunk
>> tree, root tree etc.  The tree's that hold the actual filesystem data, that 
>> is
>> inodes and such, are kept in their own b-tree.  This is how subvolumes and
>> snapshots appear on disk, they are simply new b-trees with all of the file 
>> data
>> contained within them.
>>
>> === What do subvolumes look like? ===
>>
>> All the user sees are directories.  They act like any other directory acts, 
>> with
>> a few exceptions
>>
>> 1) You cannot hardlink between subvolumes.  This is because subvolumes have
>> their own inode numbers and such, think of them as seperate mounts in this 
>> case,
>> you cannot hardlink between two mounts because the link needs to point to the
>> same on disk inode, which is impossible between two different filesystems.  
>> The
>> same is true for subvolumes, they have their own trees with their own inodes 
>> and
>> inode numbers, so it's impossible to hardlink between them.
>>
>> 1a) In case it wasn't clear from above, each subvolume has their own inode
>> numbers, so you can have the same inode numbers used between two different
>> subvolumes, since they are two different trees.
>>
>> 2) Obviously you can't just rm -rf subvolumes.  Because they are roots 
>> there's
>> extra metadata to keep track of them, so you have to use one of our ioctls to
>> delete subvolumes/snapshots.
>>
>> But permissions and everything else they are the same.
>>
>> There is one tricky thing.  When you create a subvolume, the directory inode
>> that is created in the parent subvolume has the inode number of 256.  So if 
>> you
>> have a bunch of subvolumes in the same parent subvolume, you are going to 
>> have a
>> bunch of directories with the inode number of 256.  This is so when users cd
>> into a subvolume we can know its a subvolume and do all the normal voodoo to
>> start looking in the subvolumes tree instead of the parent subvolumes tree.
>>
>> This is where things go a bit sideways.  We had serious problems with NFS, 
>> but
>> thankfully NFS gives us a bunch of hooks to get around these problems.
>> CIFS/Samba do not, so we will have problems there, not to mention any other
>> userspace application that looks at inode numbers.
>>
>> === How do we want subvolumes to work from a user perspective? ===
>>
>> 1) Users need to be able to create their own subvolumes.  The permission
>> semantics will be absolutely the same as creating directories, so I don't 
>> think
>> this is too tricky.  We want this because you can only take snapshots of
>> subvolumes, and so it is important that users be able to create their own
>> discrete snapshottable targets.
>>
>> 2) Users need to be able to snapshot their subvolumes.  This is basically the
>> same as #1, but it bears repeating.
>>
>> 3) Subvolumes shouldn't need to be specifically mounted.  This is also
>> important, we don't want users to have to go around mounting their 
>> subvolumes up
>> manually one-by-one.  Today users just cd into subvolumes and it works, just
>> like cd'ing into a directory.
>>
>> === Quotas ===
>>
>> This is a huge topic in and of itself, but Christoph mentioned wanting to 
>> have
>> an idea of what we wanted to do with it, so I'm putting it here.  There are
>> really 2 things here
>>
>> 1) Limiting the size of subvolumes.  This is really easy for us, just create 
>> a
>> subvolume and at creation time set a maximum size it can grow to and not let 
>> it
>> go farther than that.  Nice, simple and straightforward.
>>
>> 2) Normal quotas, via the quota tools.  This just comes down to how do we 
>> want
>> to charge users, do we want to do it per subvolume, or per filesystem.  My 
>> vote
>> is per filesystem.  Obviously this will make it tricky with snapshots, but I
>> think if we're just charging the diff's between the original volume and the
>> snapshot to the user then that will be the easiest for people to understand,
>> rather than making a snapshot all of a sudden count the users currently used
>> quota * 2.
>>
>> === What do we do? ===
>>
>> This is where I expect to see the most discussion.  Here is what I want to do
>>
>> 1) Scrap the 256 inode number thing.  Instead we'll just put a flag in the 
>> inode
>> to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
>> that way.  This unfortunately will be an incompatible format change, but the
>> sooner we get this adressed the easier it will be in the long run.  Obviously
>> when I say format change I mean via the incompat bits we have, so old fs's 
>> won't
>> be broken and such.
>>
>> 2) Do something like NFS's referral mounts when we cd into a subvolume.  Now 
>> we
>> just do dentry trickery, but that doesn't make the boundary between 
>> subvolumes
>> clear, so it will confuse people (and samba) when they walk into a subvolume 
>> and
>> all of a sudden the inode numbers are the same as in the directory behind 
>> them.
>> With doing the referral mount thing, each subvolume appears to be its own 
>> mount
>> and that way things like NFS and samba will work properly.
>>
>> I feel like I'm forgetting something here, hopefully somebody will point it 
>> out.
>>
>> === Conclusion ===
>>
>> There are definitely some wonky things with subvolumes, but I don't think 
>> they
>> are things that cannot be fixed now.  Some of these changes will require
>> incompat format changes, but it's either we fix it now, or later on down the
>> road when BTRFS starts getting used in production really find out how many
>> things our current scheme breaks and then have to do the changes then.  
>> Thanks,
>>
>
> So now that I've actually looked at everything, it looks like the semantics 
> are
> all right for subvolumes
>
> 1) readdir - we return the root id in d_ino, which is unique across the fs
> 2) stat - we return 256 for all subvolumes, because that is their inode number
> 3) dev_t - we setup an anon super for all volumes, so they all get their own
> dev_t, which is set properly for all of their children, see below
>
> [r...@test1244 btrfs-test]# stat .
>  File: `.'
>  Size: 20              Blocks: 8          IO Block: 4096   directory
> Device: 15h/21d Inode: 256         Links: 1
> Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-12-03 15:35:41.931679393 -0500
> Modify: 2010-12-03 15:35:20.405679493 -0500
> Change: 2010-12-03 15:35:20.405679493 -0500
>
> [r...@test1244 btrfs-test]# stat foo
>  File: `foo'
>  Size: 12              Blocks: 0          IO Block: 4096   directory
> Device: 19h/25d Inode: 256         Links: 1
> Access: (0700/drwx------)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-12-03 15:35:17.501679393 -0500
> Modify: 2010-12-03 15:35:59.150680051 -0500
> Change: 2010-12-03 15:35:59.150680051 -0500
>
> [r...@test1244 btrfs-test]# stat foo/foobar
>  File: `foo/foobar'
>  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
> Device: 19h/25d Inode: 257         Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-12-03 15:35:59.150680051 -0500
> Modify: 2010-12-03 15:35:59.150680051 -0500
> Change: 2010-12-03 15:35:59.150680051 -0500
>
> So as far as the user is concerned, everything should come out right.  
> Obviously
> we had to do the NFS trickery still because as far as VFS is concerned the
> subvolumes are all on the same mount.  So the question is this (and really 
> this
> is directed at Christoph and Bruce and anybody else who may care), is this 
> good
> enough, or do we want to have a seperate vfsmount for each subvolume?  Thanks,
>

What are the drawbacks of having a vfsmount for each subvolume?

Why (besides having to code it up) are you trying to avoid doing it that way?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to