On Thu, Nov 10, 2016 at 05:01:44PM +0100, Holger Hoffstätte wrote: > On 11/10/16 16:37, Omar Sandoval wrote: > > On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote: > >> On 11/10/16 00:26, Omar Sandoval wrote: > >>> From: Omar Sandoval <osan...@fb.com> > >>> > >>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to > >>> errors coming from the qcow2 virtual drive in the host system. The qcow2 > >>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > >>> Every once in awhile, pread() or pwrite() would return EEXIST, which > >>> makes no sense. This turned out to be a bug in btrfs_get_extent(). > >>> > >>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > >>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > >>> two threads race on adding the same extent map to an inode's extent map > >>> tree. However, if the added em is merged with an adjacent em in the > >>> extent tree, then we'll end up with an existing extent that is not > >>> identical to but instead encompasses the extent we tried to add. When we > >>> call merge_extent_mapping() to find the nonoverlapping part of the new > >>> em, the arithmetic overflows because there is no such thing. We then end > >>> up trying to add a bogus em to the em_tree, which results in a EEXIST > >>> that can bubble all the way up to userspace. > >>> > >>> Fix it by extending the identical extent map special case. > >>> > >>> Signed-off-by: Omar Sandoval <osan...@fb.com> > >>> --- > >>> Applies to 4.9-rc4. > >>> > >>> Here [1] is a reproducer for this bug that doesn't involve firing up a > >>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible > >>> to debug this on my laptop without compiling a custom kernel and > >>> rebooting just to add printks [3]. > >>> > >>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b > >> > >> I can't really make this reproducer fail. It builds and runs fine, but just > >> exits with no messages (other than the one about drop_caches in dmesg). > >> It creates the 1MB file and always returns 0. Ideas? > >> > >> -h > > > > It's a race condition, so it doesn't happen 100% of the time. I imagine > > it depends on the storage speed, as well. On my laptop, which is > > dm-crypt on top of an SSD, it works about 50% of the time. Could you > > just try running it 100 times or something and see if it fails? > > $for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail" > > ..couple of thousand runs without problem, only lots of fallocating and > cache dropping. > > Oh well, I tried. :) > > -h
Just out of curiousity, what kind of disk were you trying this on? I've only been able to trigger it on my laptop and a VM running on my laptop. -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html