Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change.

Tomasz Kusmierz Thu, 10 Jul 2014 16:33:07 -0700

Hi all !

So it been some time with btrfs, and so far I was very pleased, but
since I've upgraded to ubuntu from 13.10 to 14.04 problems started to
occur (YES I know this might be unrelated).


So in the past I've had problems with btrfs which turned out to be a
problem caused by static from printer generating some corruption in
ram causing checksum failures on the file system - so I'm not going to
assume that there is something wrong with btrfs from the start.

Anyway:
On my server I'm running 6 x 2TB disk in raid 10 for general storage
and 2 x ~0.5 TB raid 1 for system. Might be unrelated, but after
upgrading to 14.04 I've started using Own Cloud which uses Apache &
MySql for backing store - all data stored on storage array, mysql was
on system array.

All started with csum errors showing up in mysql data files and in
some transactions !!!. Generally system imidiatelly was switching to
all btrfs read only mode due to being forced by kernel (don't have
dmesg / syslog now). Removed offending files, problem seemed to go
away and started from scratch. After 5 days problem reapered and now
was located around same mysql files and in files managed by apache as
"cloud". At this point since these files are rather dear to me I've
decided to pull all stops and try to rescue as much as I can.

As a excercise in btrfs managment I've run btrfsck --repair - did not
help. Repeated with --init-csum-tree - turned out that this left me
with blank system array. Nice ! could use some warning here.

I've moved all drives and move those to my main rig which got a nice
16GB of ecc ram, so errors of ram, cpu, controller should be kept
theoretically eliminated. I've used system array drives and spare
drive to extract all "dear to me" files to newly created array (1tb +
500GB + 640GB). Runned a scrub on it and everything seemed OK. At this
point I've deleted "dear to me" files from storage array and ran  a
scrub. Scrub now showed even more csum errors in transactions and one
large file that was not touched FOR VERY LONG TIME (size ~1GB).
Deleted file. Ran scrub - no errors. Copied "dear to me files" back to
storage array. Ran scrub - no issues. Deleted files from my backup
array and decided to call a day. Next day I've decided to run a scrub
once more "just to be sure" this time it discovered a myriad of errors
in files and transactions. Since I've had no time to continue decided
to postpone on next day - next day I've started my rig and noticed
that both backup array and storage array does not mount anymore. I was
attempting to rescue situation without any luck. Power cycled PC and
on next startup both arrays failed to mount, when I tried to mount
backup array mount told me that this specific uuid DOES NOT EXIST
!?!?!

my fstab uuid:
fcf23e83-f165-4af0-8d1c-cd6f8d2788f4
new uuid:
771a4ed0-5859-4e10-b916-07aec4b1a60b


tried to mount by /dev/sdb1 and it did mount. Tried by new uuid and it
did mount as well. Scrub passes with flying colours on backup array
while storage array still fails to mount with:

root@ubuntu-pc:~# mount /dev/sdd1 /arrays/@storage/
mount: wrong fs type, bad option, bad superblock on /dev/sdd1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

for any device in the array.

Honestly this is a question to more senior guys - what should I do now ?

Chris Mason - have you got any updates to your "old friend stress.sh"
? If not I can try using previous version that you provided to stress
test my system - but I this is a second system that exposes this
erratic behaviour.

Anyone - what can I do to rescue my "bellowed files" (no sarcasm with
zfs / ext4 / tapes / DVDs)

ps. needles to say: SMART - no sata CRC errors, no relocated sectors,
no errors what so ever (as much as I can see).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change.

Reply via email to