On 9/16/16 4:28 PM, Tomasz Sterna wrote:
> Hi.
> 
> I have spotted an issue with stat(2) call on files on btrfs.
> It is giving me dev_t st_dev number that does not correspond to any
> mounted filesystem in proc's mountinfo.

That's by design.  Your particular file system may only use one device
but, internally, btrfs uses virtualized storage that may be spread
across multiple devices.  To make things more complicated, snapshots
mean that:

sled1a:/mnt # btrfs sub list .
ID 257 gen 14 top level 5 path a
ID 258 gen 14 top level 5 path b

sled1a:/mnt # ls -laRi
.:
total 16
256 drwxr-xr-x 1 root root   4 Sep 20 09:08 .
256 drwxr-xr-x 1 root root 220 Sep 16 09:49 ..
256 drwxr-xr-x 1 root root   8 Sep 14 10:24 a
256 drwxr-xr-x 1 root root   8 Sep 14 10:24 b

./a:
total 4112
256 drwxr-xr-x 1 root root       8 Sep 14 10:24 .
256 drwxr-xr-x 1 root root       4 Sep 20 09:08 ..
257 -rw-r--r-- 1 root root 4194304 Sep 14 10:24 file

./b:
total 4112
256 drwxr-xr-x 1 root root       8 Sep 14 10:24 .
256 drwxr-xr-x 1 root root       4 Sep 20 09:08 ..
257 -rw-r--r-- 1 root root 4194304 Sep 14 10:24 file

Under normal circumstances those are two files with the same st_dev and
the same inode number.  That would normally correspond to a hard link,
but the files do not (necessarily) correspond to the same file.

... but because we use anonymous device numbers for each subvolume, we
have different device numbers for each one.

sled1a:/mnt # stat --format "%n st_dev=%d" {a,b}/file
a/file st_dev=69
b/file st_dev=70

It's a pretty big usability wart that we don't consistently report the
device number.  We do it correctly in stat() but there are other places
in the code that assume that inode->i_sb->s_dev will work.  In the SUSE
kernels, we have patches that add a super_operation to report the
correct device number everywhere, but even that is a hack.

> I already attempted a illinformed-patch in fs/btrfs/super.c:
> 
> @@ -1127,6 +1127,7 @@ static int btrfs_fill_super(struct super_block *sb,
>               goto fail_close;
>       }
>  
> +     sb->s_dev = inode->i_sb->s_dev;
>       sb->s_root = d_make_root(inode);
>       if (!sb->s_root) {
>               err = -ENOMEM;
> 
> but it didn't help.

It wouldn't.  That is assigning a variable to itself.

> I would like to dig deeper and fix it, but first I have to ask:
> - Which number is wrong?
>   The one returned by stat() or the one in mountinfo?

The one in mountinfo, but then that means that the user only sees the
anonymous devices in mount(8), which isn't what we want either.

I'm afraid the correct fix is very involved and requires non-trivial
changes in the VFS layer as well.  It's on my long-term TODO list.  I
currently have some patches that do the magic with vfsmounts but it's
far from being usable.

-Jeff

-- 
Jeff Mahoney
SUSE Labs

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to