On Thu, 13 Aug 2015 09:55:10 +0000 Hugo Mills <h...@carfax.org.uk>
wrote:

> On Thu, Aug 13, 2015 at 01:33:22PM +1000, David Seikel wrote:
> > I don't actually think that this is a BTRFS problem, but it's
> > showing symptoms within BTRFS, and I have no other clues, so maybe
> > the BTRFS experts can help me figure out what is actually going
> > wrong.
> > 
> > I'm a sysadmin working for a company that does scientific modelling.
> > They have many TBs of data.  We use two servers running Ubuntu
> > 14.04 LTS to backup all of this data.  One of them includes 16
> > spinning rust disks hooked to a RAID controller running in JBOD
> > mode (in other words, as far as Linux is concerned, they are just
> > 16 ordinary disks).  They are /dev/sdc to /dev/sdr, all being used
> > as a single BTRFS file system.
> > 
> > I have been having no end of trouble with this system recently.
> > Keep in mind that due to the huge amount of data we deal with, doing
> > anything takes a long time.  So "recently" means "in the last
> > several months".
> > 
> > My latest attempt to beat some sense into this server was to
> > upgrade it to the latest officially backported kernel from Ubuntu,
> > and compile my own copy of btrfs-progs from source code (latest
> > release from github). Then I recreated the 16 disk BTRFS file
> > system, and started the backup software running again, from
> > scratch.  The next day, /dev/sdc has vanished, to be replaced be a
> > phantom /dev/sds.  There's no such disk as /dev/sds.  /dev/sds is
> > now included in the BTRFS file system replacing /dev/sdc.  In /dev
> > sdc does indeed vanish, and sds does indeed appear.  This was
> > happening before.  /dev/sds then starts to fill up with errors,
> > since no such disk actually exists.
> 
>    Sounds like the kind of behaviour when the disk has vanished from
> the system for long enough to drop out and be recreated by the
> driver. The renaming may (possibly) be down to a poor error-handling
> path in btrfs -- we see this happening on USB sometimes, where the
> original device node is still hung on to by the FS on a hardware
> error, and so when the device comes back it's given a different name.

So that part may need some fixing in BTRFS.

> > I don't know what is actually causing the problem.  The disks are
> > in a hot swap backplane, and if I actually pulled sdc out, then it
> > would still be listed as part of the BTRFS file system, wouldn't it?
> 
>    With btrfs fi show, no, you'd get ** some devices missing ** in the
> output.

Which is different from what I'm getting.

> >  If I
> > then where to plug some new disk into the same spot, it would not be
> > recognised as part of the file system?
> 
>    Correct... Unless the device had a superblock with the same UUID in
> it (like, say, the new device is just the old one reappearing
> again). In that case, udev would trigger a btrfs dev scan, and the
> "new" device would rejoin the FS -- probably a little out of date, but
> that would be caught by checksums and be fixed if you have redundancy
> in the storage.

But btrfs is thinking it's a different device, hence all the errors as
it gets confused.

> >  So assuming that the RAID
> > controller is getting confused and thinking that sdc has been
> > pulled, then replaced by sds, it should not be showing up as part
> > of the BTRFS file system?  Or maybe there's a signature on sdc that
> > BTRFS notices makes it part of the file system, even though BTRFS
> > is now confused about it's location?
> 
>    See above.
> 
> > After a reboot, sdc returns and sds is gone again.
> 
>    Expected.
> 
> > The RAID controller has recently been replaced, but there where
> > similar problems with the old one as well.  A better model of RAID
> > controller was chosen this time.
> > 
> > I've also not been able to complete a scrub on this system recently.
> > The really odd thing is that I get messages that the scrub has
> > aborted, yet the scrub continues, then much later (days later) the
> > scrub causes a kernel panic.  The "aborted" happens some random
> > time into the scrub, but usually in the early part of the scrub.
> > Mind you, if BTRFS is completely confused due to a problem
> > elsewhere, then maybe this can be excused.
> 
>    I think that means that it's aborting on one device but continuing
> on all the others.

Ah, would be useful for scrub to say so, and point out which device/s
got aborted.

> > The other backup server is almost identical, though it has less
> > disks in the array.  It doesn't have any issues with the BTRFS file
> > system.
> > 
> > Can any one help shed some light on this please?  Hopefully some
> > "quick" things to try, given my definition of "recently" above means
> > that most things take days or weeks, or even months for me to try.
> > 
> > I have attached the usual debugging info requested.  This is after
> > the bogus sds replaces sdc.
> > 
> 
>    The first thing would be to check your system logs for signs of
> hardware problems (ATA errors). This sounds a lot like you've got a
> dodgy disk that needs to be replaced.

Just gotta figure out which one, I thought I already replaced the dodgy
one.  Might be more than one.  sigh

I'm guessing /dev/sdc.

-- 
A big old stinking pile of genius that no one wants
coz there are too many silver coated monkeys in the world.

Attachment: signature.asc
Description: PGP signature

Reply via email to