On Thu, 13 Aug 2015 09:55:10 +0000 Hugo Mills <h...@carfax.org.uk> wrote:
> On Thu, Aug 13, 2015 at 01:33:22PM +1000, David Seikel wrote: > > I don't actually think that this is a BTRFS problem, but it's > > showing symptoms within BTRFS, and I have no other clues, so maybe > > the BTRFS experts can help me figure out what is actually going > > wrong. > > > > I'm a sysadmin working for a company that does scientific modelling. > > They have many TBs of data. We use two servers running Ubuntu > > 14.04 LTS to backup all of this data. One of them includes 16 > > spinning rust disks hooked to a RAID controller running in JBOD > > mode (in other words, as far as Linux is concerned, they are just > > 16 ordinary disks). They are /dev/sdc to /dev/sdr, all being used > > as a single BTRFS file system. > > > > I have been having no end of trouble with this system recently. > > Keep in mind that due to the huge amount of data we deal with, doing > > anything takes a long time. So "recently" means "in the last > > several months". > > > > My latest attempt to beat some sense into this server was to > > upgrade it to the latest officially backported kernel from Ubuntu, > > and compile my own copy of btrfs-progs from source code (latest > > release from github). Then I recreated the 16 disk BTRFS file > > system, and started the backup software running again, from > > scratch. The next day, /dev/sdc has vanished, to be replaced be a > > phantom /dev/sds. There's no such disk as /dev/sds. /dev/sds is > > now included in the BTRFS file system replacing /dev/sdc. In /dev > > sdc does indeed vanish, and sds does indeed appear. This was > > happening before. /dev/sds then starts to fill up with errors, > > since no such disk actually exists. > > Sounds like the kind of behaviour when the disk has vanished from > the system for long enough to drop out and be recreated by the > driver. The renaming may (possibly) be down to a poor error-handling > path in btrfs -- we see this happening on USB sometimes, where the > original device node is still hung on to by the FS on a hardware > error, and so when the device comes back it's given a different name. So that part may need some fixing in BTRFS. > > I don't know what is actually causing the problem. The disks are > > in a hot swap backplane, and if I actually pulled sdc out, then it > > would still be listed as part of the BTRFS file system, wouldn't it? > > With btrfs fi show, no, you'd get ** some devices missing ** in the > output. Which is different from what I'm getting. > > If I > > then where to plug some new disk into the same spot, it would not be > > recognised as part of the file system? > > Correct... Unless the device had a superblock with the same UUID in > it (like, say, the new device is just the old one reappearing > again). In that case, udev would trigger a btrfs dev scan, and the > "new" device would rejoin the FS -- probably a little out of date, but > that would be caught by checksums and be fixed if you have redundancy > in the storage. But btrfs is thinking it's a different device, hence all the errors as it gets confused. > > So assuming that the RAID > > controller is getting confused and thinking that sdc has been > > pulled, then replaced by sds, it should not be showing up as part > > of the BTRFS file system? Or maybe there's a signature on sdc that > > BTRFS notices makes it part of the file system, even though BTRFS > > is now confused about it's location? > > See above. > > > After a reboot, sdc returns and sds is gone again. > > Expected. > > > The RAID controller has recently been replaced, but there where > > similar problems with the old one as well. A better model of RAID > > controller was chosen this time. > > > > I've also not been able to complete a scrub on this system recently. > > The really odd thing is that I get messages that the scrub has > > aborted, yet the scrub continues, then much later (days later) the > > scrub causes a kernel panic. The "aborted" happens some random > > time into the scrub, but usually in the early part of the scrub. > > Mind you, if BTRFS is completely confused due to a problem > > elsewhere, then maybe this can be excused. > > I think that means that it's aborting on one device but continuing > on all the others. Ah, would be useful for scrub to say so, and point out which device/s got aborted. > > The other backup server is almost identical, though it has less > > disks in the array. It doesn't have any issues with the BTRFS file > > system. > > > > Can any one help shed some light on this please? Hopefully some > > "quick" things to try, given my definition of "recently" above means > > that most things take days or weeks, or even months for me to try. > > > > I have attached the usual debugging info requested. This is after > > the bogus sds replaces sdc. > > > > The first thing would be to check your system logs for signs of > hardware problems (ATA errors). This sounds a lot like you've got a > dodgy disk that needs to be replaced. Just gotta figure out which one, I thought I already replaced the dodgy one. Might be more than one. sigh I'm guessing /dev/sdc. -- A big old stinking pile of genius that no one wants coz there are too many silver coated monkeys in the world.
signature.asc
Description: PGP signature