Kai Krakow posted on Fri, 07 Feb 2014 01:32:27 +0100 as excerpted:

> Duncan <1i5t5.dun...@cox.net> schrieb:
> 
>> That also explains the report of a NOCOW VM-image still triggering the
>> snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
>> snapshotted btrfs (thousands of snapshots, something like every 30
>> seconds or more frequent, without thinning them down right away), and
>> the continuing VM writes would nearly guarantee that many of those
>> snapshots had unique blocks, so the effect was nearly as bad as if it
>> wasn't NOCOW at all!
> 
> The question here is: Does it really make sense to create such snapshots
> of disk images currently online and running a system. They will probably
> be broken anyway after rollback - or at least I'd not fully trust the
> contents.
> 
> VM images should not be part of a subvolume of which snapshots are taken
> at a regular and short interval. The problem will go away if you follow
> this rule.
> 
> The same applies to probably any kind of file which you make nocow -
> e.g. database files. The only use case is taking _controlled_ snapshots
> - and doing it all 30 seconds is by all means NOT controlled, it's
> completely undeterministic.

I'd absolutely agree -- and that wasn't my report, I'm just recalling it, 
as at the time I didn't understand the interaction between NOCOW and 
snapshots and couldn't quite understand how a NOCOW file was still 
triggering the snapshot-aware-defrag pathology, which in fact we were 
just beginning to realize based on such reports.

In fact at the time I assumed it was because the NOCOW had been added 
after the file was originally written, such that btrfs couldn't NOCOW it 
properly.  That still might have been the case, but now that I understand 
the interaction between snapshots and NOCOW, I see that such heavy 
snapshotting on an actively written VM could trigger the same issue, even 
if the NOCOW file was created properly and was indeed NOCOW when content 
was actually first written into it.

But definitely agreed.  30 second snapshotting, with a 30 second commit 
deadline, is pretty much off the deep end regardless of the content.  I'd 
even argue that 1 minute snapshotting without snapshots thinned down to 
say 5 or 10 minute snapshots after say an hour, is too extreme to be 
practical.  Even a couple days of that, and how are you going to even 
manage the thousands of snapshots or know which precise snapshot to roll 
back to if you had to?  That's why in the what-I-considered toward the 
extreme end of practical example I posted here some days ago, IIRC I had 
it do 1 minute snapshots but thin them down to 5 or 10 minutes after a 
couple hours and to half hour after a couple days, with something like 90 
day snapshots out to a decade.  Even that I considered extreme altho at 
least reasonably so, but the point was, even with something as extreme as 
1 minute snapshots at first frequency and decade of snapshots, with 
reasonable thinning it was still very manageable, something like 250 
snapshots total, well below the thousands or tens of thousands we're 
sometimes seeing in reports.  That's hardly practical no matter how you 
slice it, as how likely are you to know the exact minute to roll back to, 
even a month out, and even if you do, if you can survive a month before 
detecting it, how important is rolling back to precisely the last minute 
before the problem actually going to be?  At a month out perhaps the 
hour, but the minute?

But some of the snapshotting scripts out there, and the admins running 
them, seem to have the idea that just because it's possible it must be 
done, and they have snapshots taken every minute or more frequently, with 
no automated snapshot thinning at all.  IMO that's pathology run amok 
even if btrfs /was/ stable and mature and /could/ handle it properly.

That's regardless of the content so it's from a different angle than you 
were attacking the problem from...  But if admins aren't able to 
recognize the problem with per-minute snapshots without any thinning at 
all for days, weeks, months on end, I doubt they'll be any better at 
recognizing that VMs, databases, etc, should have a dedicated subvolume.  
Taking the long view, with a bit of luck we'll get to the point were 
database and VM setup scripts and/or documentation recommend setting NOCOW 
on the directory the VMs/DBs/etc will be in, but in practice, even that's 
pushing it, and will take some time (2-5 years) as btrfs stabilizes and 
mainstreams, taking over from ext4 as the assumed Linux default.  Other 
than that, I guess it'll be a case-by-case basis as people report 
problems here.  But with a snapshot-aware-defrag that actually scales, 
hopefully there won't be so many people reporting problems.  True, they 
might not have the best optimized system and may have some minor 
pathologies in their admin practices, but as long as they remain /minor/ 
pathologies because btrfs can deal with them better than it does now thus 
keeping them from becoming /major/ pathologies...


But be that as it may, since such extreme snapshotting /is/ possible, and 
with automation and downloadable snapper scripts somebody WILL be doing 
it, btrfs should scale to it if it is to be considered mature and 
stable.  People don't want a filesystem that's going to fall over on them 
and lose data or simply become unworkably live-locked just because they 
didn't know what they were doing when they setup the snapper script and 
set it to 1 minute snaps without any corresponding thinning after an hour 
or a day or whatever.


Anyway, the temporary snapshot-aware-defrag disable commit is now in 
mainline, committed shortly after 3.14-rc1 so it'll be in rc2, giving the 
devs some breathing room to work out a solution that scales rather better 
than what we had.  So defragging is (hopefully temporarily) not snapshot 
aware again ATM, but the pathologic snapshot-aware-defrag scaling issues 
are at least in a bounded set of kernel releases now, so the immediately 
critical problem should die down to some extent now, as the related 
commits (the patches did need some backporting rework, apparently) hit 
stable, anyway.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to