Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

Konstantin Sun, 07 Dec 2014 16:06:07 -0800

Anand Jain wrote on 02.12.2014 at 12:54:
>
>
>
> On 02/12/2014 19:14, Goffredo Baroncelli wrote:
>> I further investigate this issue.
>>
>> MegaBrutal, reported the following issue: doing a lvm snapshot of the
>> device of a
>> mounted btrfs fs, the new snapshot device name replaces the name of
>> the original
>> device in the output of /proc/mounts. This confused tools like
>> grub-probe which
>> report a wrong root device.
>
> very good test case indeed thanks.
>
> Actual IO would still go to the original device, until FS is remounted.
This seems to be correct at least at the beginning but I wouldn't be so
sure - why else the system is crashing in my case after a while when the
second drive is present?! So if the kernel was not using it in some way,
except the wrong /proc/mounts nothing else should happen.


>
>> It has to be pointed out that instead the link under
>> /sys/fs/btrfs/<fsid>/devices is
>> correct.
>
> In this context the above sysfs path will be out of sync with the
> reality, its just stale sysfs entry.
>
>>
>> What happens is that *even if the filesystem is mounted*, doing a
>> "btrfs dev scan" of a snapshot (of the real volume), the device name
>> of the
>> filesystem is replaced with the snapshot one.
>
> we have some fundamentally wrong stuff. My original patch tried
> to fix it. But later discovered that some external entities like
> systmed and boot process is using that bug as a feature and we had
> to revert the patch.
>
> Fundamentally scsi inquiry serial number is only number which is unique
> to the device (including the virtual device, but there could be some
> legacy virtual device which didn't follow that strictly, Anyway those
> I deem to be device side issue.) Btrfs depends on the combination of
> fsid, uuid and devid (and generation number) to identify the unique
> device volume, which is weak and easy to go wrong.
>
>
>> Anand, with b96de000b, tried to fix it; however further regression
>> appeared
>> and Chris reverted this commit (see below).
>>
>> BR
>> G.Baroncelli
>>
>> commit b96de000bc8bc9688b3a2abea4332bd57648a49f
>> Author: Anand Jain <anand.j...@oracle.com>
>> Date:   Thu Jul 3 18:22:05 2014 +0800
>>
>>      Btrfs: device_list_add() should not update list when mounted
>> [...]
>>
>>
>> commit 0f23ae74f589304bf33233f85737f4fd368549eb
>> Author: Chris Mason <c...@fb.com>
>> Date:   Thu Sep 18 07:49:05 2014 -0700
>>
>>      Revert "Btrfs: device_list_add() should not update list when
>> mounted"
>>
>>      This reverts commit b96de000bc8bc9688b3a2abea4332bd57648a49f.
>>
>>      This commit is triggering failures to mount by subvolume id in some
>>      configurations.  The main problem is how many different ways this
>>      scanning function is used, both for scanning while mounted and
>>      unmounted.  A proper cleanup is too big for late rcs.
>>
>> [...]
>>
>> On 12/02/2014 09:28 AM, MegaBrutal wrote:
>>> 2014-12-02 8:50 GMT+01:00 Goffredo Baroncelli <kreij...@inwind.it>:
>>>> On 12/02/2014 01:15 AM, MegaBrutal wrote:
>>>>> 2014-12-02 0:24 GMT+01:00 Robert White <rwh...@pobox.com>:
>>>>>> On 12/01/2014 02:10 PM, MegaBrutal wrote:
>>>>>>>
>>>>>>> Since having duplicate UUIDs on devices is not a problem for me
>>>>>>> since
>>>>>>> I can tell them apart by LVM names, the discussion is of little
>>>>>>> relevance to my use case. Of course it's interesting and I like to
>>>>>>> read it along, it is not about the actual problem at hand.
>>>>>>>
>>>>>>
>>>>>> Which is why you use the device= mount option, which would take
>>>>>> LVM names
>>>>>> and which was repeatedly discussed as solving this very problem.
>>>>>>
>>>>>> Once you decide to duplicate the UUIDs with LVM snapshots you
>>>>>> take up the
>>>>>> burden of disambiguating your storage.
>>>>>>
>>>>>> Which is part of why re-reading was suggested as this was covered
>>>>>> in some
>>>>>> depth and _is_ _exactly_ about the problem at hand.
>>>>>
>>>>> Nope.
>>>>>
>>>>> root@reproduce-1391429:~# cat /proc/cmdline
>>>>> BOOT_IMAGE=/vmlinuz-3.18.0-031800rc5-generic
>>>>> root=/dev/mapper/vg-rootlv ro
>>>>> rootflags=device=/dev/mapper/vg-rootlv,subvol=@
>>>>>
>>>>> Observe, device= mount option is added.
>>>>
>>>> device= options is needed only in a btrfs multi-volume scenario.
>>>> If you have only one disk, this is not needed
>>>>
>>>
>>> I know. I only did this as a demonstration for Robert. He insisted it
>>> will certainly solve the problem. Well, it doesn't.
>>>
>>>
>>>>>
>>>>> root@reproduce-1391429:~# ./reproduce-1391429.sh
>>>>> #!/bin/sh -v
>>>>> lvs
>>>>>    LV     VG   Attr      LSize   Pool Origin Data%  Move Log
>>>>> Copy%  Convert
>>>>>    rootlv vg   -wi-ao---   1.00g
>>>>>    swap0  vg   -wi-ao--- 256.00m
>>>>>
>>>>> grub-probe --target=device /
>>>>> /dev/mapper/vg-rootlv
>>>>>
>>>>> grep " / " /proc/mounts
>>>>> rootfs / rootfs rw 0 0
>>>>> /dev/dm-1 / btrfs rw,relatime,space_cache 0 0
>>>>>
>>>>> lvcreate --snapshot --size=128M --name z vg/rootlv
>>>>>    Logical volume "z" created
>>>>>
>>>>> lvs
>>>>>    LV     VG   Attr      LSize   Pool Origin Data%  Move Log
>>>>> Copy%  Convert
>>>>>    rootlv vg   owi-aos--   1.00g
>>>>>    swap0  vg   -wi-ao--- 256.00m
>>>>>    z      vg   swi-a-s-- 128.00m      rootlv   0.11
>>>>>
>>>>> ls -l /dev/vg/
>>>>> total 0
>>>>> lrwxrwxrwx 1 root root 7 Dec  2 00:12 rootlv -> ../dm-1
>>>>> lrwxrwxrwx 1 root root 7 Dec  2 00:12 swap0 -> ../dm-0
>>>>> lrwxrwxrwx 1 root root 7 Dec  2 00:12 z -> ../dm-2
>>>>>
>>>>> grub-probe --target=device /
>>>>> /dev/mapper/vg-z
>>>>>
>>>>> grep " / " /proc/mounts
>>>>> rootfs / rootfs rw 0 0
>>>>> /dev/dm-2 / btrfs rw,relatime,space_cache 0 0
>>>>
>>>> What /proc/self/mountinfo contains ?
>>>
>>> Before creating snapshot:
>>>
>>> 15 20 0:15 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
>>> 16 20 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
>>> 17 20 0:5 / /dev rw,relatime - devtmpfs udev
>>> rw,size=241692k,nr_inodes=60423,mode=755
>>> 18 17 0:12 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts
>>> rw,gid=5,mode=620,ptmxmode=000
>>> 19 20 0:16 / /run rw,nosuid,noexec,relatime - tmpfs tmpfs
>>> rw,size=50084k,mode=755
>>> 20 0 0:17 /@ / rw,relatime - btrfs /dev/dm-1 rw,space_cache
>>> <----- THIS!
>>> 21 15 0:20 / /sys/fs/cgroup rw,relatime - tmpfs none
>>> rw,size=4k,mode=755
>>> 22 15 0:21 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
>>> 23 15 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw
>>> 24 15 0:10 / /sys/kernel/security rw,relatime - securityfs none rw
>>> 25 19 0:22 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs none
>>> rw,size=5120k
>>> 26 19 0:23 / /run/shm rw,nosuid,nodev,relatime - tmpfs none rw
>>> 27 19 0:24 / /run/user rw,nosuid,nodev,noexec,relatime - tmpfs none
>>> rw,size=102400k,mode=755
>>> 28 15 0:25 / /sys/fs/pstore rw,relatime - pstore none rw
>>> 29 20 253:1 / /boot rw,relatime - ext2 /dev/vda1 rw
>>>
>>>
>>> After creating snapshot:
>>>
>>> 15 20 0:15 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
>>> 16 20 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
>>> 17 20 0:5 / /dev rw,relatime - devtmpfs udev
>>> rw,size=241692k,nr_inodes=60423,mode=755
>>> 18 17 0:12 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts
>>> rw,gid=5,mode=620,ptmxmode=000
>>> 19 20 0:16 / /run rw,nosuid,noexec,relatime - tmpfs tmpfs
>>> rw,size=50084k,mode=755
>>> 20 0 0:17 /@ / rw,relatime - btrfs /dev/dm-2 rw,space_cache
>>> <----- WTF?!
>>> 21 15 0:20 / /sys/fs/cgroup rw,relatime - tmpfs none
>>> rw,size=4k,mode=755
>>> 22 15 0:21 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
>>> 23 15 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw
>>> 24 15 0:10 / /sys/kernel/security rw,relatime - securityfs none rw
>>> 25 19 0:22 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs none
>>> rw,size=5120k
>>> 26 19 0:23 / /run/shm rw,nosuid,nodev,relatime - tmpfs none rw
>>> 27 19 0:24 / /run/user rw,nosuid,nodev,noexec,relatime - tmpfs none
>>> rw,size=102400k,mode=755
>>> 28 15 0:25 / /sys/fs/pstore rw,relatime - pstore none rw
>>> 29 20 253:1 / /boot rw,relatime - ext2 /dev/vda1 rw
>>>
>>>
>>> So it's consistent with what /proc/mounts reports.
>>>
>>>
>>>>
>>>> And more important question: it is only the value
>>>> returned by /proc/mount wrongly or also the filesystem
>>>> content is affected ?
>>>>
>>>
>>> I quote my bug report on this:
>>>
>>> "The information reported in /proc/mounts is certainly bogus, since
>>> still the origin device is being written, the kernel does not actually
>>> mix up the devices for write operations, and such, the phenomenon does
>>> not cause data corruption. (I did an entire distro release upgrade
>>> while the conditions were present, and I centainly would have suffered
>>> severe data corruption otherwise. Fortunately, the origin device had
>>> the new distro, and the snapshot device had the old one, so besides
>>> the mixup in /proc/mounts, no actual damage happened.)"
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

Reply via email to