On Wed, Oct 12, 2016 at 05:10:18PM -0400, Zygo Blaxell wrote:
> On Wed, Oct 12, 2016 at 09:55:28PM +0200, Adam Borowski wrote:
> > On Wed, Oct 12, 2016 at 01:19:37PM -0400, Zygo Blaxell wrote:
> > > I had been thinking that we could inject "plug" extents to fill up
> > > RAID5 stripes.
> > Your idea sounds good, but there's one problem: most real users don't
> > balance.  Ever.  Contrary to the tribal wisdom here, this actually works
> > fine, unless you had a pathologic load skewed to either data or metadata on
> > the first write then fill the disk to near-capacity with a load skewed the
> > other way.
> 
> > Most usage patterns produce a mix of transient and persistent data (and at
> > write time you don't know which file is which), meaning that with time every
> > stripe will contain a smidge of cold data plus a fill of plug extents.
> 
> Yes, it'll certainly reduce storage efficiency.  I think all the
> RMW-avoidance strategies have this problem.  The alternative is to risk
> losing data or the entire filesystem on disk failure, so any of the
> RMW-avoidance strategies are probably a worthwhile tradeoff.  Big RAID5/6
> arrays tend to be used mostly for storing large sequentially-accessed
> files which are less susceptible to this kind of problem.
> 
> If the pattern is lots of small random writes then performance on raid5
> will be terrible anyway (though it may even be improved by using plug
> extents, since RMW stripe updates would be replaced with pure CoW).

I've looked at some simple scenarios, and it appears that, with your scheme,
the total amount of I/O would increase, but it would not hinder performance
as increases happen only when the disk would be otherwise idle.  There's
also a latency win and a fragmentation win -- all while fixing the write
hole!

Let's assume leaf size 16KB, stripe size 64KB.  The disk has four stripes,
each 75% full 25% deleted.  '*' marks cold data, '.' deleted/plug space, 'x'
new data.  I'm not drawing entirely empty stripes.
***.
***.
***.
***.
The user wants to write 64KB of data.
RMW needs to read 12 leafs, write 16, no matter if the data comes in one
commit or four.
***x
***x
***x
***x
Latency 28 (big commit)/7 per commit (small commits), total I/O 28.

The plug extents scheme requires compaction (partial balance):
****
****
****
I/O so far 24.
Big commit:
****
****
****
xxxx
Latency 4, total I/O 28.
If we had to compact on-demand, the latency is 28 (assuming we can do
stripe-sized balance).

Small commits, no concurrent writes:
****
****
****
x...
x...
x...
x...
Latency 1 per commit, I/O so far 28, need another compaction:
****
****
****
xxxx
Total I/O 32.

Small io, concurrent writes that peg the disk:
****
****
****
xyyy
xyyy
xyyy
xyyy
Total I/O 28 (not counting concurrent writes).


Other scenarios I've analyzed give similar results.

I'm not sure if my thinking is correct, but if it is, the outcome is quite
surprising: no performance loss even though we had to rewrite the stripes!

> > Thus, while the plug extents idea doesn't suffer from problems of big
> > sectors you just mentioned, we'd need some kind of auto-balance.
> 
> Another way to approach the problem is to relocate the blocks in
> partially filled RMW stripes so they can be effectively CoW stripes;
> however, the requirement to do full extent relocations leads to some
> nasty write amplification and performance ramifications.  Balance is
> hugely heavy I/O load and there are good reasons not to incur it at
> unexpected times.

We don't need balance in btrfs sense, it's enough to compact stripes -- ie,
something akin to balance except done at stripe level rather than allocation
block level.

As for write amplification, F2FS guys solved the issue by having two types
of cleaning (balancing):
* on demand (when there is no free space and thus it needs to be done NOW)
* in the background (done only on cold data)

The on-demand clean goes for juiciest targets first (least data/stripe),
background clean on the other hand uses a formula that takes into account
both the amount of space to reclaim and age of the stripe.  If the data is
hot, it shouldn't be cleaned yet -- it's likely to be deleted/modified soon.


Meow!
-- 
A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg
raspberries, 0.4kg sugar; put into a big jar for 1 month.  Filter out and
throw away the fruits (can dump them into a cake, etc), let the drink age
at least 3-6 months.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to