On Fri, Nov 14, 2014 at 05:00:26PM -0500, Josef Bacik wrote: > On 11/14/2014 04:51 PM, Hugo Mills wrote: > > Chris, Josef, anyone else who's interested, > > > > On IRC, I've been seeing reports of two persistent unsolved > >problems. Neither is showing up very often, but both have turned up > >often enough to indicate that there's something specific going on > >worthy of investigation. > > > > One of them is definitely a btrfs problem. The other may be btrfs, > >or something in the block layer, or just broken hardware; it's hard to > >tell from where I sit. > > > >Problem 1: ENOSPC on balance > > > > This has been going on since about March this year. I can > >reasonably certainly recall 8-10 cases, possibly a number more. When > >running a balance, the operation fails with ENOSPC when there's plenty > >of space remaining unallocated. This happens on full balance, filtered > >balance, and device delete. Other than the ENOSPC on balance, the FS > >seems to work OK. It seems to be more prevalent on filesystems > >converted from ext*. The first few or more reports of this didn't make > >it to bugzilla, but a few of them since then have gone in. > > > >Problem 2: Unexplained zeroes > > > > Failure to mount. Transid failure, "expected xyz, have 0". Chris > >looked at an early one of these (for Ke, on IRC) back in September > >(the 27th -- sadly, the public IRC logs aren't there for it, but I can > >supply a copy of the private log). He rapidly came to the conclusion > >that it was something bad going on with TRIM, replacing some blocks > >with zeroes. Since then, I've seen a bunch of these coming past on > >IRC. It seems to be a 3.17 thing. I can successfully predict the > >presence of an SSD and -odiscard from the "have 0". I've successfully > >persuaded several people to put this into bugzilla and capture > >btrfs-images. btrfs recover doesn't generally seem to be helpful in > >recovering data. > > > > > > I think Josef had problem 1 in his sights, but I don't know if > >additional images or reports are helpful at this point. For problem 2, > >there's obviously something bad going on, but there's not much else to > >go on -- and the inability to recover data isn't good. > > > > For each of these, what more information should I be trying to > >collect from any future reporters? > > > > > > So for #2 I've been looking at that the last two weeks. I'm always > paranoid we're screwing up one of our data integrity sort of things, > either not waiting on IO to complete properly or something like > that. I've built a dm target to be as evil as possible and have been > running it trying to make bad things happen. I got slightly side > tracked since my stress test exposed a bug in the tree log stuff an > csums which I just fixed. Now that I've fixed that I'm going back > to try and make the "expected blah, have 0" type errors happen.
I've searched the bugzilla archive and found the two reports that I know of (87061 and 87021); I couldn't see any others. I've requested more information on both -- nothing obviously in common, except SSD and (probably) discard. I tried to tag them both with "trim" for easy finding, but that seems to have been lost somewhere. I'll try that again when I get home this evening and have access to my password. > As for the ENOSPC I keep meaning to look into it and I keep getting > distracted with other more horrible things. Ideally I'd like to > reproduce it myself, so more info on that front would be good, like > do all reports use RAID/compression/some other odd set of features? > Thanks for taking care of this stuff Hugo, #2 is the worst one and > I'd like to be absolutely sure it's not our bug, once I'm happy we > aren't I'll look at the balance thing. OK, good to know you're on both of these. I think the "easy" solution to reproduce the ENOSPC is to convert an ext4 filesystem. It doesn't seem to be a unique characteristic, but it is a frequent correlation. We had another one today, after an FS conversion -- I've asked them to attach a btrfs-image dump and the enospc_debug log to the bugzilla report. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- 2 + 2 = 5, for sufficiently large values of 2. ---
signature.asc
Description: Digital signature