Dmitry Katsubo posted on Sun, 18 Oct 2015 11:44:08 +0200 as excerpted:

>> Meanwhile, the present btrfs raid1 read-scheduler is both pretty simple
>> to code up and pretty simple to arrange tests for that run either one
>> side or the other, but not both, or that are well balanced to both.
>> However, it's pretty poor in terms of ensuring optimized real-world
>> deployment read-scheduling.
>> 
>> What it does is simply this.  Remember, btrfs raid1 is specifically two
>> copies.  It chooses which copy of the two will be read very simply,
>> based on the PID making the request.  Odd PIDs get assigned one copy,
>> even PIDs the other.  As I said, simple to code, great for ensuring
>> testing of one copy or the other or both, but not really optimized at
>> all for real-world usage.
>> 
>> If your workload happens to be a bunch of all odd or all even PIDs,
>> well, enjoy your testing-grade read-scheduler, bottlenecking everything
>> reading one copy, while the other sits entirely idle.
> 
> I think PID-based solution is not the best one. Why not simply take a
> random device? Then at least all drives in the volume are equally loaded
> (in average).

Nobody argues that the even/odd-PID-based read-scheduling solution is 
/optimal/, in a production sense at least.  But at the time and for the 
purpose it was written it was pretty good, arguably reasonably close to 
"best", because the implementation is at once simple and transparent for 
debugging purposes, and real easy to test either one side or the other, 
or both, and equally important, to duplicate the results of those tests, 
by simply arranging for the testing to have either all even or all odd 
PIDs, or both.  And for ordinary use, it's good /enough/, as ordinarily, 
PIDs will be evenly distributed even/odd.

In that context, your random device read-scheduling algorithm would be 
far worse, because while being reasonably simple, it's anything *but* 
easy to ensure reads go to only one side or equally to both, or for that 
matter, to duplicate the tests, because randomization, by definition 
does /not/ lend itself to duplication.

And with both simplicity/transparency/debuggability and duplicatability 
of testing being primary factors when the code went in...

And again, the fact that it hasn't been optimized since then, in the 
context of "premature optimization", really says quite a bit about what 
the btrfs devs themselves consider btrfs' status to be -- obviously *not* 
production-grade stable and mature, or optimizations like this would have 
already been done.

Like it or not, that's btrfs' status at the moment.

Actually, the coming N-way-mirroring may very well be why they've not yet 
optimized the even/odd-PID mechanism already, because doing an optimized 
two-way would obviously be premature-optimization given the coming N-way, 
and doing an N-way clearly couldn't be properly tested at present, 
because only two-way is possible.  Introducing an optimized N-way 
scheduler together with the N-way-mirroring code necessary to properly 
test it thus becomes a no-brainer.

> From what you said I believe that certain servers will not benefit from
> btrfs, e.g. dedicated server that runs only one "fat" Java process, or
> one "huge" MySQL database.

Indeed.  But with btrfs still "stabilizing, but not entirely stable and 
mature", and indeed, various features still set to drop, and various 
optimizations still yet to do including this one, nobody, leastwise not 
the btrfs devs and knowledgeable regulars on this list, is /claiming/ 
that btrfs is at this time the be-all and end-all optimal solution for 
every single use-case.  Rather far from it!

As for the claims of salespeople... should any of them be making wild 
claims about btrfs, who in their sane mind takes salespeople's claims at 
face value in any case?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to