raid1 has failing disks, but smart is clear

Corey Coughlin Wed, 06 Jul 2016 15:15:03 -0700

Hi all,

Hoping you all can help, have a strange problem, think I knowwhat's going on, but could use some verification. I set up a raid1 typebtrfs filesystem on an Ubuntu 16.04 system, here's what it looks like:


btrfs fi show
Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
    Total devices 10 FS bytes used 3.42TiB
    devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
    devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
    devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
    devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
    devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
    devid    6 size 1.82TiB used 823.03GiB path /dev/sdj
    devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
    devid    8 size 1.82TiB used 1.18TiB path /dev/sda
    devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
    devid   10 size 1.36TiB used 745.03GiB path /dev/sdh

I added a couple disks, and then ran a balance operation, and that tookabout 3 days to finish. When it did finish, tried a scrub and got thismessage:


scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4

scrub started at Sun Jun 26 18:19:28 2016 and was aborted after01:16:35

    total bytes scrubbed: 926.45GiB with 18849935 errors
    error details: read=18849935

corrected errors: 5860, uncorrectable errors: 18844075, unverifiederrors: 0

So that seems bad. Took a look at the devices and a few of them haveerrors:

...
[/dev/sdi].generation_errs 0
[/dev/sdj].write_io_errs   289436740
[/dev/sdj].read_io_errs    289492820
[/dev/sdj].flush_io_errs   12411
[/dev/sdj].corruption_errs 0
[/dev/sdj].generation_errs 0
[/dev/sdg].write_io_errs   0
...
[/dev/sda].generation_errs 0
[/dev/sdb].write_io_errs   3490143
[/dev/sdb].read_io_errs    111
[/dev/sdb].flush_io_errs   268
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sdh].write_io_errs   5839
[/dev/sdh].read_io_errs    2188
[/dev/sdh].flush_io_errs   11
[/dev/sdh].corruption_errs 1
[/dev/sdh].generation_errs 16373

So I checked the smart data for those disks, they seem perfect, noreallocated sectors, no problems. But one thing I did notice is thatthey are all WD Green drives. So I'm guessing that if they power downand get reassigned to a new /dev/sd* letter, that could lead to datacorruption. I used idle3ctl to turn off the shut down mode on all thegreen drives in the system, but I'm having trouble getting thefilesystem working without the errors. I tried a 'check --repair'command on it, and it seems to find a lot of verification errors, but itdoesn't look like things are getting fixed. But I have all the data onit backed up on another system, so I can recreate this if I need to.But here's what I want to know:

1. Am I correct about the issues with the WD Green drives, if theychange mounts during disk operations, will that corrupt data?

2.  If that is the case:

a.) Is there any way I can stop the /dev/sd* mount points fromchanging? Or can I set up the filesystem using UUIDs or something moresolid? I googled about it, but found conflicting infob.) Or, is there something else changing my drive devices? I havemost of drives on an LSI SAS 9201-16i card, is there something I need todo to make them fixed?c.) Or, is there a script or something I can use to figure out ifthe disks will change mounts?d.) Or, if I wipe everything and rebuild, will the disks with theidle3ctl fix work now?

Regardless of whether or not it's a WD Green drive issue, should I justwipefs all the disks and rebuild it? Is there any way to recover this?Thanks for any help!



    ------- Corey
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

raid1 has failing disks, but smart is clear

Reply via email to