Re: Recover btrfs volume which can only be mounded in read-only mode

Duncan Fri, 16 Oct 2015 01:20:14 -0700

Dmitry Katsubo posted on Thu, 15 Oct 2015 16:10:13 +0200 as excerpted:

> On 15 October 2015 at 02:48, Duncan <1i5t5.dun...@cox.net> wrote:
> 
>> [snipped] 
> 
> Thanks for this information. As far as I can see, btrfs-tools v4.1.2 in
> now in experimental Debian repo (but you anyway suggest at least 4.2.2,
> which is just 10 days ago released in master git). Kernel image 3.18 is
> still not there, perhaps because Debian jessie was frozen before is was
> released (2014-12-07).


For userspace, as long as it's supporting the features you need at 
runtime (where it generally simply has to know how to make the call to 
the kernel, to do the actual work), and you're not running into anything 
really hairy that you're trying to offline-recover, which is where the 
latest userspace code becomes critical...

Running a userspace series behind, or even more (as long as it's not 
/too/ far), isn't all /that/ critical a problem.

It generally becomes a problem in one of three ways: 1) You have a bad 
filesystem and want the best chance at fixing it, in which case you 
really want the latest code, including the absolute latest fixups for the 
most recently discovered possible problems. 2) You want/need a new 
feature that's simply not supported in your old userspace.  3) The 
userspace gets so old that the output from its diagnostics commands no 
longer easily compares with that of current tools, giving people on-list 
difficulties when trying to compare the output in your posts to the 
output they get.

As a very general rule, at least try to keep the userspace version 
comparable to the kernel version you are running.  Since the userspace 
version numbering syncs to kernelspace version numbering, and userspace 
of a particular version is normally released shortly after the similarly 
numbered kernel series is released, with a couple minor updates before 
the next kernel-series-synced release, keeping userspace to at least the 
kernel space version, means you're at least running the userspace release 
that was made with that kernel series release in mind.

Then, as long as you don't get too far behind on kernel version, you 
should remain at least /somewhat/ current on userspace as well, since 
you'll be upgrading to near the same userspace (at least), when you 
upgrade the kernel.

Using that loose guideline, since you're aiming for the 3.18 stable 
kernel, you should be running at least a 3.18 btrfs-progs as well.

In that context, btrfs-progs 4.1.2 should be fine, as long as you're not 
trying to fix any problems that a newer version fixed.  And, my 
recommendation of the latest 4.2.2 was in the "fixing problems" context, 
in which case, yes, getting your hands on 4.2.2, even if it means 
building from sources to do so, could be critical, depending of course on 
the problem you're trying to fix.  But otherwise, 4.1.2, or even back to 
the last 3.18.whatever release since that's the kernel version you're 
targeting, should be fine.

Just be sure that whenever you do upgrade to later, you avoid the known-
bad-mkfs.btrfs in 4.2.0 and/or 4.2.1 -- be sure if you're doing the btrfs-
progs-4.2 series, that you get 4.2.2 or later.

As for finding a current 3.18 series kernel released for Debian, I'm not 
a Debian user so my my knowledge of the ecosystem around it is limited, 
but I've been very much under the impression that there are various 
optional repos available that you can choose to include and update from 
as well, and I'm quite sure based on previous discussions with others 
that there's a well recognized and fairly commonly enabled repo that 
includes debian kernel updates thru current release, or close to it.

Of course you could also simply run a mainstream Linus kernel and build 
it yourself, and it's not too horribly hard to do either, as there's all 
sorts of places with instructions for doing so out there, and back when I 
switched from MS to freedomware Linux in late 2001, I learned the skill, 
at at least the reasonably basic level of mostly taking a working config 
from my distro's kernel and using it as a basis for my mainstream kernel 
config as well, within about two months of switching.

Tho of course just because you can doesn't mean you want to, and for 
many, finding their distro's experimental/current kernel repos and simply 
installing the packages from it, will be far simpler.

But regardless of the method used, finding or building and keeping 
current with your own copy of at least the lastest couple of LTS 
releases, shouldn't be /horribly/ difficult.  While I've not used them as 
actual package resources in years, I do still know a couple rpm-based 
package resources from my time back on Mandrake (and do still check them 
in contexts like this for others, or to quickly see what files a package 
I don't have installed on gentoo might include, etc), and would point you 
at them if Debian was an rpm-based distro, but of course it's not, so 
they won't do any good.  But I'd guess a google might. =:^)

> If I may ask:
> 
> Provided that btrfs allowed to mount a volume in read-only mode – does
> it mean that add data blocks are present (e.g. it has assured that add
> files / directories can be read)

I'm not /absolutely/ sure I understand your question, here.  But assuming 
it's what I believe it is... here's an answer in typical Duncan fashion, 
answering the question... and rather more! =:^)

In this particular scenario, yes, everything should still be accessible, 
as at least one copy of every raid1 chunk should exist on a still 
detected and included device.  This is because of the balance after the 
loss of the first device, making sure there was two copies of each chunk 
on remaining devices, before loss of the second device.  But because 
btrfs device delete missing didn't work, you couldn't remove that first 
device, even tho you now had two copies of each chunk on existing 
devices.  So when another device dropped, you had two missing devices, 
but because of the balance between, you still had at least one copy of 
all chunks.

The reason it's not letting you mount read-write is that btrfs sees now 
two devices missing on a raid1, the one that you actually replaced but 
couldn't device delete, and the new missing one that it didn't detect 
this time.  To btrfs' rather simple way of thinking about it, that means 
anything with one of the only two raid1 copies on each of the two missing 
devices is now entirely gone, and to avoid making changes that would 
complicate things and prevent return of at least one of those missing 
devices, it won't let you mount writable, even in degraded mode.  It 
doesn't understand that there's actually still at least one copy of 
everything available, as it simply sees the two missing devices and gives 
up without actually checking.

And in the situation where btrfs' fears were correct, where chunks 
existed with each of the two copies on one of the now missing devices, 
no, not everything /would/ be accessible, and btrfs forcing read-only 
mounting is its way of not letting you make the problem even worse, 
forcing you to copy the data you can actually get to off to somewhere 
else, while you can still get to it in read-only mode, at least.  Also, 
of course, forcing the filesystem read-only when there's two devices 
missing, at least in theory preserves a state where a device might be 
able to return, allowing repair of the filesystem, while allowing 
writable could prevent a returning device allowing the healing of the 
filesystem.

So in this particular scenario, yes, all your data should be there, 
intact.  However, a forced read-only mount normally indicates a serious 
issue, and in other scenarios, it could well indicate that some of the 
data is now indeed *NOT* accessible.

Which is where AJ's patch comes in.  That teaches btrfs to actually check 
each chunk.  Once it sees that there's actually at least one copy of each 
chunk available, it'll allow mounting degraded, writable, again, so you 
can fix the problem.

(Tho the more direct scenario that the patch addresses is a bit 
different, loss of one device of a two-device raid1, in which case 
mounting degraded writable will force new chunks to be written in single 
mode, because there's not a second device to write to so writing raid1 is 
no longer possible.  So far, so good.  But then on an unmount and attempt 
to mount again, btrfs sees single mode chunks on a two-device btrfs, and 
knows that single mode normally won't allow a missing device, so forces 
read-only, thus blocking adding a new device and rebalancing all the 
single chunks back to raid1.  But in actuality, the only single mode 
chunks there are the ones written when the second device wasn't 
available, so they HAD to be written to the available device, and it's 
not POSSIBLE for any to be on the missing device.  Again, the patch 
teaches btrfs to actually look at what's there and see that it can 
actually deal with it, thus allowing writable mounting, instead of 
jumping to conclusions and giving up, as soon as it sees a situation 
that /could/, in a different situation, mean entirely missing chunks with 
no available copies on remaining devices.)

Again, these patches are in newer kernel versions, so there (assuming no 
further bugs) they "just work".  On older kernels, however, you either 
have to cherry-pick the patches yourself, or manually avoid or work 
around the problem they fix.  This is why we typically stress new 
versions so much -- they really /do/ fix active bugs and make problems 
/much/ easier to deal with. =:^)

> Do you have any ideas why "btrfs balance" has pulled all data to two
> drives (and not balanced between three)?

Hugo did much better answering that, than I would have initially done, as 
most of my btrfs are raid1 here, but they're all exactly two-device, with 
the two devices exactly the same size, so I'm not used to thinking in 
terms of different sizes and didn't actually notice the situation, thus 
leaving me clueless, until Hugo pointed it out.

But he's right.  Here's my much more detailed way of saying the same 
thing, now that he reminded me of why that would be the deciding factor 
here.

Given that (1) your devices are different sizes, that (2) btrfs raid1 
means exactly two copies, not one per device, and that (3), the btrfs 
chunk-allocator allocates chunks from the device with the most free space 
left, subject to the restriction that both copies of a raid1 chunk can't 
be allocated to the same device...

A rebalance of raid1 chunks would indeed start filling the two biggest 
devices first, until the space available on the smallest of the two 
biggest devices (thus the second largest) was equal to the space 
available on the third largest device, at which point it would continue 
allocating from the largest for one copy (until it too reached equivalent 
space available), while alternating between the others for the second 
copy.

Given that the amount of data you had fit a copy each on the two largest 
devices, before the space available on either one dwindled to that 
available on the third largest device, only the two largest devices 
actually had chunk allocations, leaving the third device, still with less 
space total than the other two each had remaining available, entirely 
empty.

> Does btrfs has the following optimization for mirrored data: if drive is
> non-rotational, then prefer reads from it? Or it simply schedules the
> read to the drive that performs faster (irrelative to rotational
> status)?

Such optimizations have in general not yet been done to btrfs -- not even 
scheduling to the faster drive.  In fact, the lack of such optimizations 
is arguably the biggest "objective" proof that btrfs devs themselves 
don't yet consider btrfs truly stable.

As any good dev knows there's a real danger to "premature optimization", 
with that danger appearing in one or both of two forms: (a) We've now 
severely limited the alternative code paths we can take, because 
implementing things differently will force throwing away all that 
optimization work we did as it won't work with what would otherwise be 
the better alternative, and (b) We're now throwing away all that 
optimization work we did, making it a waste, because the previous 
implementation didn't work, and the new one does, but doesn't work with 
the current optimization code, so that work must now be redone as well.

Thus, good devs tend to leave moderate to complex optimization code until 
they know the implementation is stable and won't be changing out from 
under the optimization.  To do differently is "premature optimization", 
and devs tend to be well aware of the problem, often because of the 
number of times they did it themselves earlier in their career.

It follows that looking at whether devs (assuming you consider them good 
enough to be aware of the dangers of premature optimization, which if 
they're doing the code that runs your filesystem, you better HOPE they're 
at least that good, or you and your data are in serious trouble!) have 
actually /done/ that sort of optimization, ends up being a pretty good 
indicator of whether they consider the code actually stable enough to 
avoid the dangers of premature optimization, or not.

In this case, definitely not, since these sorts of optimizations in 
general remain to be done.

Meanwhile, the present btrfs raid1 read-scheduler is both pretty simple 
to code up and pretty simple to arrange tests for that run either one 
side or the other, but not both, or that are well balanced to both.  
However, it's pretty poor in terms of ensuring optimized real-world 
deployment read-scheduling.

What it does is simply this.  Remember, btrfs raid1 is specifically two 
copies.  It chooses which copy of the two will be read very simply, based 
on the PID making the request.  Odd PIDs get assigned one copy, even PIDs 
the other.  As I said, simple to code, great for ensuring testing of one 
copy or the other or both, but not really optimized at all for real-world 
usage.

If your workload happens to be a bunch of all odd or all even PIDs, well, 
enjoy your testing-grade read-scheduler, bottlenecking everything reading 
one copy, while the other sits entirely idle.

(Of course on fast SSDs with their zero seek-time, which is what I'm 
using for my own btrfs, that's not the issue it'd be on spinning rust.  
I'm still using my former reiserfs standard for spinning rust, which I 
use for backup and media files.  But normal operations are on btrfs on 
ssd, and despite btrfs lack of optimization, on ssd, it's fast /enough/ 
for my usage, and I particularly like the data integrity features of 
btrfs raid1 mode, so...)

> No, it was particular my decision to use btrfs on various reasons.
> First of all, I am using raid1 on all data. Second, I benefit from
> transparent compression. Third I need CRC consistency: some of the
> drives (like /dev/sdd in my case) seem to fail, also once I have a buggy
> DIMM so btrfs helps me not to loose the data "silently". Anyway,
> it much better then md-raid.

The fact that despite it being available, mdraid couldn't be configured 
to runtime-verify integrity using either parity or redundancy, nor 
checksums (which weren't available) was a very strong disappointment for 
me.

To me, the fact that btrfs /does/ do runtime checksumming on write and 
data integrity checking on read, and in raid1/10 mode, will actually 
fallback to the second copy if the first one fails checksum verification, 
is one of its best features, and why I use btrfs raid1 (or on a couple 
single-device btrfs, mixed-bg mode dup). =:^)

That's also why my personally most hotly anticipated features is N-way-
mirroring, with 3-way being my ideal balance, since that will give me a 
fallback to the fallback, if both the first read copy and the first 
fallback copy fail verification.  Four-way would be too much, but I just 
don't quite rest as easy as I otherwise could, because I know that if 
both the primary-read copy and the fallback happen to be bad, same 
logical place at the same time, there's no third copy to fall back on!  
It seems as much of a shame not to have that on btrfs with its data 
integrity, as it did to have mdraid with N-way-mirroring but no runtime 
data integrity.  But at least btrfs does have N-way-mirroring on the 
roadmap, actually for after raid56, which is now done, so N-way-mirroring 
should be coming up rather soon (even if on btrfs, "soon" is relative), 
while AFAIK, mdraid has no plans to implement runtime data integrity 
checking.

> And dynamic assignment is not a problem since udev was introduced (so
> one can add extra persistent symlinks):
> 
> https://wiki.debian.org/Persistent_disk_names

FWIW, I actually use labels as my own form of "human-readable" UUID, 
here.  I came up with the scheme back when I was on reiserfs, with 15-
character label limits, so that's what mine are.  Using this scheme, I 
encode the purpose of the filesystem (root/home/media/whatever), the size 
and brand of the media, the sequence number of the media (since I often 
have more than one of the same brand and size), the machine the media is 
targeted at, the date I did the formatting, and the sequence-number of 
the partition (root-working, root-backup1, root-backup2, etc).

hm0238gcnx+35l0

home, on a 238 gig corsair neutron, #x (the filesystem is multidevice, 
across #0 and #1), targeted at + (the workstation), originally 
partitioned in (201)3, on May (5) 21 (l), working copy (0)

I use GPT partitioning, which takes partition labels (aka names) as 
well.  The two partitions hosting that filesystem are on identically 
partitioned corsair neutrons, 256 GB = 238 GiB.  The gpt labels on those 
two partitions are identical to the above, except one will have a 0 
replacing the x, while the other has a 1, as they are my first and second 
media of that size and brand.

hm0238gcn0+35l0
hm0238gcn1+35l0

The primary backup of home, on a different pair of partitions on the same 
physical devices, is labeled identically, except the partition number is 
one:

hm0238gcnx+35l1

... and its partitions:

hm0238gcn0+35l1
hm0238gcn1+35l1

The secondary backup is on a reiserfs, on spinning rust:

hm0465gsg0+47f0

In that case the partition label and filesystem label are the same, since 
the partition and its filesystem correspond 1:1.  It's home on the 465 
GiB (aka 500 GB) seagate #0, targeted at the workstation, first formatted 
in (201)4, on July 15, first (0) copy there.  (I could make it #3 instead 
of #0, indicating second backup, but didn't, as I know that 0465gsg0+ is 
the media and backups spinning rust device for the workstation.)

Both my internal and USB attached devices have the same labeling scheme, 
media identified by size, brand, media sequence number and what it's 
targetting, partition/filesystem identified by purpose, original 
partition/format date, and partition sequence number.

As I said, it's effectively human-readable GUID, my own scheme for my own 
devices.

And I use LABEL= in fstab as well, running gdisk -l to get a listing of 
partitions with their gpt-labels when I need to associate actual sdN 
mapping to specific partitions (if I don't already have the mapping from 
mount or whatever).

Which makes it nice when btrfs fi show outputs filesystem label as well. 
=:^)

The actual GUID is simply machine-readable but not necessary for the 
human to deal with "noise", to me, as the label (of either the gpt 
partition or the filesystem it hosts) gives me *FAR* more and more useful 
information, while being entirely unique within my ID system.

> If "btrfs device scan" is user-space, then I think doing some output is
> better then outputting nothing :) (perhaps with "-v" flag). If it is
> kernel-space, then I agree that logging to dmesg is not very evident
> (from perspective that user should remember where to look),
> but I think has a value.

Well, btrfs is a userspace tool, but in this case, btrfs device scan's 
use is purely to make a particular kernel call, which triggers the btrfs 
module to do a device rescan to update its own records, *not* for human 
consumption.  -v to force output could work if it had been designed that 
way, but getting that output is precisely what btrfs filesystem show is 
for, printing for both mounted and unmounted filesystems unless told 
otherwise.

Put it this way.  If neither your initr* nor some service started before 
whatever mounts local filesystems doesn't do a btrfs device scan, then 
attempting to mount a multi-device btrfs will fail, unless all its 
component devices have been fed in using device= options.  Why?  Because 
mount takes exactly one device to mount.  With traditional filesystems, 
that's enough, since they only consist of a single device.  And with 
single-device btrfs, it's enough as well.  But with a multi-device btrfs, 
something has to supply the other devices to btrfs, along with the one 
that mount tells it about.  It is possible to list all those component 
devices in device= options, but those take /dev/sd* style device nodes, 
and those may change from boot to boot, so that's not very reliable.  
Which is where btrfs device scan comes in.  It tells the btrfs module to 
do a general scan and map out internally which devices belong to which 
filesystems, after which a mount supplying just one of them can work, 
since this internal map, the generation or refresh of which is triggered 
by btrfs device scan, supplies the others.

IOW, btrfs device scan needs no output, because all the userspace command 
does is call a kernel function, which triggers the mapping internal to 
the btrfs kernel module, so it can then handle mounts with just one of 
the possibly many devices handed to it from mount.

Outputting that mapping is an entirely different function, with the 
userspace side of that being btrfs filesystem show, which calls a kernel 
function that generates output back to the btrfs userspace app, which 
then further formats it for output back to the user.

> Thanks. I have carefully read changelog wiki page and found that:
> 
> btrfs-progs 4.2.2:
> scrub: report status 'running' until all devices are finished

Thanks.  As I said, I had seen the patch on the list, and /thought/ it 
was now in, but had lost track of specifically when it went in, or 
indeed, /whether/ it had gone in.

Now I know it's in 4.2.2, without having to actually go look it up in the 
git log again, myself.

> Idea concerning balance is listed on wiki page "Project ideas":
> 
> balance: allow to run it in background (fork) and report status
> periodically

FWIW, it sort of does that today, except that the btrfs bal start doesn't 
actually return to the command prompt.  But again, what it actually does 
is call a kernel function to initiate the balance, and then it's simply 
waiting.  On my relatively small btrfs on partitioned ssd, the return is 
often within a minute or two anyway, but on multi-TB spinning rust...

In any case, once the kernel function has triggered the balance, ctrl-C 
should I believe terminate the userspace side and get you back to the 
prompt, without terminating the balance as that continues on in kernel 
space.

But it would still be useful to have balance start actually return 
quickly, instead of having to ctrl-C it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recover btrfs volume which can only be mounded in read-only mode

Reply via email to