Hi,
On 09/02/2015 08:49 PM, Greg Stark wrote:
On 2 Sep 2015 14:54, "Andres Freund" <and...@anarazel.de> wrote:
+ /*----------
+ * The user-visible GUC parameter is the number of drives (spindles),
+ * which we need to translate to a number-of-pages-to-prefetch target.
+ * The target value is stashed in *extra and then assigned to the actual
+ * variable by assign_effective_io_concurrency.
+ *
+ * The expected number of prefetch pages needed to keep N drives busy is:
+ *
+ * drives | I/O requests
+ * -------+----------------
+ * 1 | 1
+ * 2 | 2/1 + 2/2 = 3
+ * 3 | 3/1 + 3/2 + 3/3 = 5 1/2
+ * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+ * n | n * H(n)
I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?
I can take the blame for this formula.
It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.
This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.
There are different meanings of "busy". If I get the coupon collector
problem right (after quickly skimming the wikipedia article today), it
effectively makes sure that each "spindle" has at least 1 request in the
queue. Which sucks in practice, because on spinning rust it makes
queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one
of the multiple channels.
On spinning drives, it's usually good to keep the iodepth>=4. For
example this 10k Seagate drive [1] can do ~450 random IOPS with
iodepth=16, while 10k drive should be able to do ~150 IOPS (with
iodepth=1). The other SAS drives behave quite similarly.
[1]
http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review
On SSDs the good values usually start at 16, depending on the model (and
controller), and size (large SSDs are basically multiple small ones
glued together, thus have more channels).
This is why the numbers from coupon collector are way too low in many
cases. (OTOH this is done per backend, so if there are multiple backends
doing prefetching ...)
We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I
As I explained, spindles have very little to do with it - you need
multiple I/O requests per device, to get the benefit. Sure, the DBAs
should know how many spindles they have and should be able to determine
optimal IO depth. But we actually say this in the docs:
A good starting point for this setting is the number of separate
drives comprising a RAID 0 stripe or RAID 1 mirror being used for
the database. (For RAID 5 the parity drive should not be counted.)
However, if the database is often busy with multiple queries
issued in concurrent sessions, lower values may be sufficient to
keep the disk array busy. A value higher than needed to keep the
disks busy will only result in extra CPU overhead.
So we recommend number of drives as a good starting value, and then warn
against increasing the value further.
Moreover, ISTM it's very unclear what value to use even if you know the
number of devices and optimal iodepth. Setting (devices * iodepth)
doesn't really make much sense, because that effectively computes
(devices*iodepth) * H(devices*iodepth)
which says "there are (devices*iodepth) devices, make sure there's at
least one request for each of them", right? I guess we actually want
(devices*iodepth) * H(devices)
Sadly that means we'd have to introduce another GUC, because we need
track both ndevices and iodepth.
There probably is a value X so that
X * H(X) ~= (devices*iodepth) * H(devices)
but it's far from clear that's what we need (it surely is not in the docs).
think the above formula works for essentially random I/O but for
more predictable I/O it might be possible to use a different formula.
But if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access pattern.
Maybe. We only use this in Bitmap Index Scan at this point, and I don't
see any proposals to introduce this to other places. So no opinion.
I did speak to a dm developer and he suggested that the kernel could
help out with an API. He suggested something of the form "how many
blocks do I have to read before the end of the current device". I
wasn't sure exactly what we would do with something like that but it
would be better than just guessing how many I/O operations we need
to issue to keep all the spindles busy.
I don't really see how that would help us?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers