We did get around to do more testing and follow suggestions. The
filesystem corruption could be reproduced. The run was done bypassing
multipathing and using a single path directly.
OUTPUT reproduce.sh (tail):
+ fstrim -v /mnt/san
/mnt/san: 120 TiB (131939249291264 bytes) trimmed
+ umount -vvv /mnt/san
umount: /mnt/san unmounted
+ sync
+ mount -vvv
/dev/disk/by-path/pci-0000:02:00.0-fc-0x9999999999999999-lun-0 /mnt/san
mount: mount /dev/sde on /mnt/san failed: Structure needs cleaning
+ fail '3rd mount failed'
+ echo 'ERROR: 3rd mount failed'
ERROR: 3rd mount failed
+ exit 11
OUTPUT dmesg:
[13568.741492] XFS (sde): metadata I/O error: block 0x39fffffc70
("xfs_trans_read_buf_map") error 74 numblks 8 [13568.891639] XFS (sde):
Metadata CRC error detected at
xfs_agi_read_verify+0x5e/0x110 [xfs], xfs_agi block 0x3a7ffffc68 [13569.033899]
XFS (sde): Unmount and run xfs_repair [13569.103826] XFS (sde): First 64 bytes
of corrupted metadata buffer:
[13569.174222] ffff9107a8836000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 ................
[13569.314629] ffff9107a8836010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 ................
[13569.454924] ffff9107a8836020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 ................
[13569.595204] ffff9107a8836030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 ................
----
Since there is no fstab entry, mount options are determined by whatever
mechanism decides on defaults. In this case, mount reported the
following:
xfs (rw,relatime,attr2,inode64,noquota)
The kernel for this run:
$ uname -a
Linux xxxxxxx 4.11.0-041100rc8-generic #201704232131 SMP Mon Apr 24 01:32:55
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
** Changed in: linux (Ubuntu)
Status: Incomplete => Confirmed
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1686687
Title:
fstrim destroying XFS on SAN
Status in linux package in Ubuntu:
Confirmed
Bug description:
We observed severe data loss/filesystem corruption when executing
fstrim on a filesystem hosted on an Eternus DX600 S3 system.
There is multipathing via a fibre channel fabrics but the issue could
be reproduced when disabling multipathing and using one of the block
devices directly.
It could not be reproduced when creating a multipathing device via
dmsetup with four paths pointing to four loop devices mapping the same
file.
The observed behavior is that XFS cannot read vital filesystem
metadata as the underlying storage device returns blocks of 0x00. The
blocks are discarded via UNMAP commands and since thin provisioning is
used, the SAN deallocates them and returns 0x00 on subsequent reads.
Invoking find yields error messages like "find: ./dir_16: Structure
needs cleaning". In other tests, where more data had been written,
files were accessible but checksums did no longer match.
In consequence, the XFS filesystem is in an unusable state and has to
be created freshly, equaling complete data loss. Trying to repair the
filesystem had proven not to be worth it as backups were available and
trust had already been compromised.
The problem was discovered after installing a new storage server with
ubuntu 16.04, intending to replace the current machine running 14.04.
Every weekend, the test volumes were corrupted. Investigation pointed
towards Sunday, 06:47, which is the time `cron.weekly` is run. The job
file `/etc/cron.weekly/fstrim` seemed most likely, so `fstrim -a` was
run manually after `mkfs.xfs` and the filesystem became damaged. The
damage only became apparent after a `umount` `mount` cycle, when all
buffers were flushed and data was re-read from the device.
We now could use config management to install a cronjob that (every
minute!) checks for /sbin/fstrim and renames it, if present. This
would be extremely unsatisfactory as it is a brittle workaround. So
for now, we are locked on ubuntu 14.04. Since util-linux is one of the
most central packages, there is no way to not have fstrim or the
cronjob on a ubuntu system.
I have attached a script used to reproduce the bug reliably on our
system and its log output, as well as excerpts from syslog and md5sum.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1686687/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp