Hi, On Mon, Aug 01, 2016 at 11:25:28AM -0600, Will Aoki wrote: > Package: src:linux > Version: 3.16.7-ckt25-2+deb8u3 > Severity: important > > I hit a nasty filesystem corruption bug while restoring some backups. I'm able > to reliably reproduce this on every system where I tried to restore to XFS and > on a brand-new VM created for testing (which I can share as an OVF, although > it's pretty big). > > Everything's running on hardware with ECC RAM, so memory errors are unlikely. > I've reproduced it on VMs spread across three different storage arrays from > two > different vendors. Everything has at least two virtual CPUs. > > All systems tested use XFS on LVM. > > In my test VM, the only packages not from Debian stable are custom stow and > tar > packages to fix bugs in the versions in the stable release and a backport of > burp because the verson in Debian was very old. These are all userland tools > and none should be able to cause filessytem corruption. > > I'm using xfsprogs from Debian jessie. On one affected system, I tried version > 4.3.0+nmu1 and saw no difference in what xfs_repair found. > > > At this point, my test case uses the burp backup software to create the I/O > activity which triggers this bug. I have not been able to make tar trigger > this > problem. > > > Steps to reproduce: > > 1: Create new XFS filesystem & mount it on /srv/src > > 2: Create some directories in /srv/src & set ACLs (including default ACLs) on > them > > 3: Generate deep tree of files in each of the directories from step #2. For > testing, I used a script which created random files & subdirectories. Total > bulk was about 2.5 gigabytes. > > 4: Take backup of /srv/src with burp: > > # burp -a b > > 5: Unmount /srv/src > > 6: Create new XFS filesystem & mount it on /srv/src > > 7: Run restore to /srv/src: > > # burp -a r -r ^/srv/src > > Do not suspend the restore process: the bug appears to require sustained > I/O > to trigger. In trials where I suspended it multiple times during a restore, > corruption did not surface. > > > Expected outcome (observed when restoring to e.g. ext4): > > 1: Can create files (permissions notwithstanding) in every directory under > /srv/src > > 2: Default ACL on every directory is the same as the backup utility wrote > > 3: If filesystem is unmounted and xfs_repair is run on it, no errors will be > found > > > Actual outcome (observed when restoring to XFS): > > 1: Some files & directories cannot be written. The easiest way to find problem > directories them is: > > # find . -type d -exec touch {}/asdf \; > touch: cannot touch ‘./aaaaa/-BIz/asdf’: Cannot allocate memory > touch: cannot touch ‘./aaaaa/-BIz/Zp.NyvX0guz./asdf’: Cannot allocate > memory > touch: cannot touch ‘./aaaaa/-BIz/Zp.NyvX0guz./TWDU/asdf’: Cannot allocate > memory > [etc] > > Giving VMs more RAM has no effect on this. Clearing the ACL on the > directory > has no effect. > > Affected directories are not always the same between different runs. > > 2: The default ACL has not been restored to problem directories. Directories > which I can write to have had the default ACL restored. > > 3: If filesystem is unmounted and xfs_repair is run on it, many errors are > reported: > > # xfs_repair -n /dev/mapper/xfsbugtest--vg-dst 2>&1 | head -90 > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - scan filesystem freespace and inode maps... > - found root inode chunk > Phase 3 - for each AG... > - scan (but don't clear) agi unlinked lists... > - process known inodes and perform inode discovery... > - agno = 0 > Too many ACL entries, count -2010719080 > entry contains illegal value in attribute named SGI_ACL_FILE or > SGI_ACL_DEFAULT > bad security value for attribute entry 1 in attr block 0, inode 133 > problem with attribute contents in inode 133 > would clear attr fork > bad nblocks 2 for inode 133, would reset to 1 > bad anextents 1 for inode 133, would reset to 0 > Too many ACL entries, count -2010719080 > entry contains illegal value in attribute named SGI_ACL_FILE or > SGI_ACL_DEFAULT > bad security value for attribute entry 1 in attr block 0, inode 134 > problem with attribute contents in inode 134 > would clear attr fork > bad nblocks 2 for inode 134, would reset to 1 > bad anextents 1 for inode 134, would reset to 0 > [...] > bad nblocks 1 for inode 52741928, would reset to 0 > bad anextents 1 for inode 52741928, would reset to 0 > - process newly discovered inodes... > Phase 4 - check for duplicate blocks... > - setting up duplicate extent list... > - check for inodes claiming duplicate blocks... > - agno = 0 > - agno = 1 > - agno = 2 > - agno = 3 > No modify flag set, skipping phase 5 > Phase 6 - check inode connectivity... > - traversing filesystem ... > - traversal finished ... > - moving disconnected inodes to lost+found ... > Phase 7 - verify link counts... > No modify flag set, skipping filesystem flush and exiting. > > On my production VMs, running xfs_repair without '-n' typically left many > files (the highest was 148k) in /lost+found and left many directories > without ACLs. > > > > xfs_info output on a corrupted filesystem on the test VM: > > meta-data=/dev/mapper/xfsbugtest--vg-dst isize=256 agcount=4, > agsize=655360 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=0 finobt=0 > data = bsize=4096 blocks=2621440, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=0 > log =internal bsize=4096 blocks=2560, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > > xfs_metadump of filesystem is at > ftp://ftp.umnh.utah.edu/general-temporary/xfs/corrupted.metadump > > > Giant (5.9 GB uncompressed) trace-cmd output is at > ftp://ftp.umnh.utah.edu/general-temporary/xfs/trace_report.xz
Is this issue reproducible with current supported Debian versions? If not we might want to close this bug as Jessie respectively v3.16.y is EOL'ed. Regards, Salvatore