On 2015-04-16 14:48, Miguel Negrão wrote:
Hello,

I'm running a laptop, macbook pro 8,2, with ubuntu, on kernel
3.13.0-49-lowlatency. I have a USB enclosure containing two harddrives
(Icydock JBOD). Each harddrive runs their own btrfs file system, on top of
luks partitions. I backup one harddrive to the other using btrfs
send/receive with incremental sends (tests that I did indicated this setup
was too fragile for running btrfs RAID).

I've noticed that files on one of the harddrive start to get corrupted
sometimes. It's not many files, but it does happen from time to time. On the
irc I was told it could be the USB enclosure, it could be memory, etc. The
SMART data of the harddrives say they are fine, the quick SMART tests also
pass without problems.


  - Given that I'm running a laptop and comunicating with the harddrives via
USB, is it expected that I will get some corruption from time to time or is
this abnormal and there is something very wrong with some of my equipment
and if so how can track what is responsible ?
  - Is it possible to extract a file that has csum errors ? I work with audio
files, if I don't have a backup of file I would still like to get full
corrupted version, since most of the audio might still be perfectly fine.
Can I tell btrfs to do a new csum of the file has it is now, and just live
with the corruption ?

I've copied a file to the main USB harddrive on 2015-02-21, the file was
backed up to the other harddrive via send/receive on 2015-02-23. Now
(yesterday) when I try to access the file on the main harddrive it is corrupted:

Apr 16 19:20:35 miguel-MacBookPro kernel: [  835.944606] BTRFS info (device
dm-1): csum failed ino 136726 off 1067679744 csum 4135207512 expected csum
1128560616
Apr 16 19:20:35 miguel-MacBookPro kernel: [  835.948431] BTRFS info (device
dm-1): csum failed ino 136726 off 1067761664 csum 730461863 expected csum
1924299628
Apr 16 19:20:36 miguel-MacBookPro kernel: [  836.395372] BTRFS info (device
dm-1): csum failed ino 136726 off 1067679744 csum 4135207512 expected csum
1128560616
Apr 16 19:20:36 miguel-MacBookPro kernel: [  836.396682] BTRFS info (device
dm-1): csum failed ino 136726 off 1067679744 csum 4135207512 expected csum
1128560616

I can access it fine on the backup harddrive.

Questions:

- Can I assume that that the corruption happened after the file was sent to
the backup hardrive ?
- Will btrfs send ever send a file with corrupted blocks ?
- I kept running more backups, but that particular file was not changed
since. I'm I correct in assuming that since the file was not changed it was
not sent again to the backup disk and that therefore the version I have in
the backup should be a good copy ?

Best regards,
Miguel

Label: 'huge-new'  uuid: 21d841c9-7c30-4d1b-b4c2-8c0e59e8959a
        Total devices 1 FS bytes used 1.04TiB
        devid    1 size 2.73TiB used 1.06TiB path /dev/mapper/huge-new

[/dev/mapper/huge-new].write_io_errs   0
[/dev/mapper/huge-new].read_io_errs    0
[/dev/mapper/huge-new].flush_io_errs   0
[/dev/mapper/huge-new].corruption_errs 1970
[/dev/mapper/huge-new].generation_errs 0

Btrfs v0.20-rc1-335-gf00dd83

Label: 'huge-new-backup'  uuid: 9af299bc-48b0-4e52-8078-82749627d9f4
        Total devices 1 FS bytes used 1.04TiB
        devid    1 size 2.73TiB used 1.05TiB path /dev/mapper/huge-new-backup

[/dev/mapper/huge-new-backup].write_io_errs   0
[/dev/mapper/huge-new-backup].read_io_errs    0
[/dev/mapper/huge-new-backup].flush_io_errs   0
[/dev/mapper/huge-new-backup].corruption_errs 0
[/dev/mapper/huge-new-backup].generation_errs 0

Btrfs v0.20-rc1-335-gf00dd83


First, as mentioned in another reply to this, you should update your kernel. I don't think that the kernel is what is causing the issue, but it is an old kernel by BTRFS standards, and keeping up to date is important with a filesystem under such heavy development. The same actually goes for the userspace components as well, although that is less critical than the kernel side.

As to the corruption, this sounds like some kind of hardware issue to me. Assuming that you can afford to wipe the filesystems, I would suggest running some tests on the disks with the program 'badblocks' (found in the e2fsutils). The fact that it is only the first disk that is having issues would seem to indicate that either that port on the enclosure is intermitently bad, or the disk itself is having issues. The SMART tests passing just indicate that the disk doesn't think it is failing, not that it is perfectly reliable (I've had disks that pass all the SMART tests, and then just randomly reset themselves from time to time). I would also look into what manufacturer and firmware version the drives are, as I do know that some of the early Seagate and WD multi-terabyte drives had some serious firmware bugs that could cause data corruption similar to this.


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to