Re: Detailed RAID Status and Errors

Chris Murphy Tue, 25 Feb 2014 22:44:25 -0800

On Feb 25, 2014, at 11:19 PM, Justin Brown <justin.br...@fandingo.org> wrote:


> Chris,
> 
> Thanks for the reply.
> 
>> Total includes metadata.
> 
> It still doesn't seem to add up:
> 
> ~$ btrfs fi df t
> Data, single: total=8.00MiB, used=0.00
> Data, RAID6: total=2.17TiB, used=2.17TiB
> System, single: total=4.00MiB, used=0.00
> System, RAID6: total=9.56MiB, used=192.00KiB
> Metadata, single: total=8.00MiB, used=0.00
> Metadata, RAID6: total=4.03GiB, used=3.07GiB
> 
> Nonetheless, the scrub finished shortly after I started typing this
> response. Total was ~2.7TB if I remember correctly.

What do you get for btfs fi show

> 
>> All of this looks like a conventional bad sector read error. It's concerning 
>> why there'd be a bad sector after having just written to it when putting all 
>> your data on this volume. What do you get for:
> 
>> smartctl -x /dev/sdd
> 
> ...
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x0001  2            0  Command failed due to ICRC error
> 0x0002  2            0  R_ERR response for data FIS
> 0x0003  2            0  R_ERR response for device-to-host data FIS
> 0x0004  2            0  R_ERR response for host-to-device data FIS
> 0x0005  2            0  R_ERR response for non-data FIS
> 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> 0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
> 0x8000  4       185377  Vendor specific

You chopped out the important part. Post the whole thing.


>> smartctl -l scterc /dev/sdd
> 
> SCT Error Recovery Control:
>           Read: Disabled
>          Write: Disabled

It's possible for the drive recovery to take longer when reading from a 
troublesome sector than the SCSI command timer value, which is 30 seconds by 
default. This is a kernel function. You can check it with cat or change it with 
echo value > to /sys/block/<device-name>/device/timeout

You'd have to consult the model spec to find out what the drive's time out is, 
but you want the kernel to wait at least say, a second, longer than the drive. 
So if the drive waits up to 120 seconds, then have the kernel wait 121 seconds. 
Otherwise what happens is you get a reset instead of this:

end_request: I/O error, dev sdd, sector 1487873214

That's important because it's how to know what to write good data back to (once 
supported). If a reset happens first, this information is lost. So it's not 
related to this problem but you'll want to change the command timer value.

> 
>> btrfs device stats /dev/X
> 
> All drives except /dev/sdf1 have zeroes for all values. /dev/sdf1
> reports that same read error from the logs:
> 
> [/dev/sdf1].write_io_errs   0
> [/dev/sdf1].read_io_errs    1
> [/dev/sdf1].flush_io_errs   0
> [/dev/sdf1].corruption_errs 0
> [/dev/sdf1].generation_errs 0

Yeah I'm confused. Maybe the entire dmesg would be useful; or two separate ones:
dmesg | grep -i sdd
dmesg | grep -i sdf

Maybe there's another read error floating around here somewhere…


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Detailed RAID Status and Errors

Reply via email to