Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

Tomas Vondra Wed, 02 Sep 2015 14:27:12 -0700

Hi,

On 09/02/2015 08:49 PM, Greg Stark wrote:

On 2 Sep 2015 14:54, "Andres Freund" <and...@anarazel.de> wrote:

+     /*----------
+      * The user-visible GUC parameter is the number of drives (spindles),
+      * which we need to translate to a number-of-pages-to-prefetch target.
+      * The target value is stashed in *extra and then assigned to the actual
+      * variable by assign_effective_io_concurrency.
+      *
+      * The expected number of prefetch pages needed to keep N drives busy is:
+      *
+      * drives |   I/O requests
+      * -------+----------------
+      *              1 |   1
+      *              2 |   2/1 + 2/2 = 3
+      *              3 |   3/1 + 3/2 + 3/3 = 5 1/2
+      *              4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+      *              n |   n * H(n)


I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?


I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.

There are different meanings of "busy". If I get the coupon collectorproblem right (after quickly skimming the wikipedia article today), iteffectively makes sure that each "spindle" has at least 1 request in thequeue. Which sucks in practice, because on spinning rust it makesqueuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates oneof the multiple channels.

On spinning drives, it's usually good to keep the iodepth>=4. Forexample this 10k Seagate drive [1] can do ~450 random IOPS withiodepth=16, while 10k drive should be able to do ~150 IOPS (withiodepth=1). The other SAS drives behave quite similarly.

[1]http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review

On SSDs the good values usually start at 16, depending on the model (andcontroller), and size (large SSDs are basically multiple small onesglued together, thus have more channels).

This is why the numbers from coupon collector are way too low in manycases. (OTOH this is done per backend, so if there are multiple backendsdoing prefetching ...)


We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I

As I explained, spindles have very little to do with it - you needmultiple I/O requests per device, to get the benefit. Sure, the DBAsshould know how many spindles they have and should be able to determineoptimal IO depth. But we actually say this in the docs:


     A good starting point for this setting is the number of separate
     drives comprising a RAID 0 stripe or RAID 1 mirror being used for
     the database. (For RAID 5 the parity drive should not be counted.)
     However, if the database is often busy with multiple queries
     issued in concurrent sessions, lower values may be sufficient to
     keep the disk array busy. A value higher than needed to keep the
     disks busy will only result in extra CPU overhead.

So we recommend number of drives as a good starting value, and then warnagainst increasing the value further.

Moreover, ISTM it's very unclear what value to use even if you know thenumber of devices and optimal iodepth. Setting (devices * iodepth)doesn't really make much sense, because that effectively computes


    (devices*iodepth) * H(devices*iodepth)

which says "there are (devices*iodepth) devices, make sure there's atleast one request for each of them", right? I guess we actually want


    (devices*iodepth) * H(devices)

Sadly that means we'd have to introduce another GUC, because we needtrack both ndevices and iodepth.


There probably is a value X so that

     X * H(X) ~= (devices*iodepth) * H(devices)

but it's far from clear that's what we need (it surely is not in the docs).

think the above formula works for essentially random I/O but for
more predictable I/O it might be possible to use a different formula.
But if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access pattern.

Maybe. We only use this in Bitmap Index Scan at this point, and I don'tsee any proposals to introduce this to other places. So no opinion.


I did speak to a dm developer and he suggested that the kernel could
help out with an API. He suggested something of the form "how many
blocks do I have to read before the end of the current device". I
wasn't sure exactly what we would do with something like that but it
would be better than just guessing how many I/O operations we need
to issue to keep all the spindles busy.


I don't really see how that would help us?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

Reply via email to