Juergen Sauer posted on Tue, 11 Nov 2014 12:13:41 +0100 as excerpted:

> this event occoured today in the morning.
> Accidentially the Archive Machine was kickt  into hibernation.
> 
> After reactivating the archive Btrfs filesystem was "readonly", after
> rebooting the system the "archive" btrfs filesystem was not mountable
> anymore.

FWIW, I've had similar issues with both mdraid in the past, and with 
btrfs now, with both hibernation and suspend-to-ram.

Tho after early experiences I switched to mdraid-1 some time in the past, 
and now btrfs raid1 mode, which (even with the more mature mdraid) tends 
to be more resilient than raid5 and faster than raid6.  At least with 
raid1, there's multiple copies of the data, and at least in my 
experience, that dramatically increases the reliability of recovery from 
temporary or permanent dropout of one device.

The general problem seems to be that in the resume process, some devices 
wake up faster than others, and even "awake" devices don't necessarily 
fully stabilize for a minute or two.  Back on mdraid, I noticed some 
devices coming up with model number strings and UIDs that would have 
incorrect characters in some position, tho they'd stabilize over time.  
Obviously, this plays havoc with kernel efforts to ensure the devices it 
woke up to are the same devices it had when it went to sleep (either 
suspend to ram or hibernate to disk).

And the same general problems continue to occur with the pair of SSDs I 
have now, with suspend-to-ram instead of hibernate, while the original 
devices I noticed the problem on were spinning rust of an entirely 
different brand.

So it's not a btrfs-specific issue, or a device specific issue, or a 
motherboard specific issue since I've upgraded since I first saw it too, 
or a suspend/hibernate type specific issue.  It's a general issue.  Tho I 
/have/ noticed on the current equipment, that if I suspend for a 
relatively short period, an hour or two, it seems to come back with fewer 
problems than if I suspend for 6 hours or more... say if I suspend while 
I'm at work or overnite.  (FWIW, the old machine seemed to hibernate and 
resume reasonably well other than this but couldn't reliably resume from 
suspend, while the new machine is the opposite, I never got it to resume 
from hibernation, but other than this, it reliably resumes from suspend.)

Unfortunately, the only reliable solution seems to be to fully shut down 
instead of suspending or hibernating, and obviously, after running into 
issues a few times, I eventually quit experimenting further.  But the 
fact that I'm running systemd on fast ssds now, does ameliorate the 
problem quite a bit, both due to faster booting, and by making the lost 
cache of a reboot far less of an issue because reading the data back in 
is so much faster on ssd.

So it seems both suspend and hibernate seem to work better with single 
devices where one device being slower to stabilize won't be the issue it 
is with raid (either mdraid or btrfs raid), and raid doesn't combine well 
with suspend/hibernate. =:^(

Too bad, as being able to suspend and wake up right away was saving on 
the electric bill. =:^(

So if it's really critical, as it arguably might be on an archive 
machine, I'd consider pointing whatever suspend/hibernate triggers at 
shutdown or reboot, instead.  If it's not possible to accidentally 
hibernate the thing, it triggers shutdown/reboot instead, it won't/can't 
be accidentally hibernated. =:^)

> I tried every thing of recovery possibilities I know. Nothing worked.
> 
> Here I liste the Problem of the Machine, it would be very ugly to loose
> thoes data.
> 
> Do you have any further ideas, what I may try to recover my archive
> filesystem?
> 
> The archive Filesystem is an raid5-multi device btrfs.

Btrfs raid5, or mdraid-5 with btrfs on top?  Because it's common 
knowledge that btrfs raid56 modes aren't yet fully implemented, and while 
they work in normal operation, recovery from a lost device is iffy at 
best because the code simply isn't complete for that yet.  As such a 
raid5/6 mode btrfs is best effectively considered a raid0 in terms of 
reliability, don't count on recovering anything if a single device is 
lost, even temporarily.  Depending on the circumstances, it's not always 
/quite/ that bad, but raid0 reliability, or more accurately the lack 
thereof, is what you plan for when you setup a btrfs raid5 or raid6, 
because that's effectively what it is until the recovery code is complete 
and tested, and that way you won't be caught with critical data on it if 
it does go south, any more than you would put critical data on a raid0.

So I /hope/ you meant mdraid-5, on top of which you had btrfs.  With 
that, once the mdraid level is recovered, you are basically looking at a 
standard btrfs recovery as if it were a single device.  That's still not 
a great position to be in as you are after all looking at a recovery with 
a non-zero chance of failure, but let's call it an 80% chance of 
recovery, over a 10% chance, you're still in far better shape than with 
btrfs raid5/6 at that point.

If you /did/ mean btrfs raid56 mode, then take a look at the raid56 
information on the wiki and the links from there to additional 
information on Marc MERLIN's site, as he is the regular around here that 
has done the most intensive testing of raid56 mode and has written about 
it extensively, and other than getting one of the devs to take personal 
interest in your special case, that's the best chance you have at 
recovery.

https://btrfs.wiki.kernel.org/index.php/RAID56

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to