Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
Julien Rouhaud wrote: > Hi, > > Please find attached a v2 of the patch. See below for changes. Pushed after smallish tweaks. Please test to verify I didn't break anything. (It's a pity that we can't add a regression test with a value other than 0.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On 08/09/2015 18:00, Alvaro Herrera wrote: > Julien Rouhaud wrote: >> Hi, >> >> Please find attached a v2 of the patch. See below for changes. > > Pushed after smallish tweaks. Please test to verify I didn't break > anything. > I just tried with all the cases I could think of, everything works fine. Thanks! > (It's a pity that we can't add a regression test with a value other than > 0.) > -- Julien Rouhaud http://dalibo.com - http://dalibo.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
Hi, Please find attached a v2 of the patch. See below for changes. On 02/09/2015 15:53, Andres Freund wrote: > > Hi, > > On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote: >> I didn't know that the thread must exists on -hackers to be able to add >> a commitfest entry, so I transfer the thread here. > > Please, in the future, also update the title of the thread to something > fitting. > >> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState >> *estate, int eflags) >> { >> BitmapHeapScanState *scanstate; >> RelationcurrentRelation; >> +#ifdef USE_PREFETCH >> +int new_io_concurrency; >> +#endif >> >> /* check for unsupported flags */ >> Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); >> @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState >> *estate, int eflags) >> */ >> currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, >> eflags); >> >> +#ifdef USE_PREFETCH >> +/* check if the effective_io_concurrency has been overloaded for the >> + * tablespace storing the relation and compute the >> target_prefetch_pages, >> + * or just get the current target_prefetch_pages >> + */ >> +new_io_concurrency = get_tablespace_io_concurrency( >> +currentRelation->rd_rel->reltablespace); >> + >> + >> +scanstate->target_prefetch_pages = target_prefetch_pages; >> + >> +if (new_io_concurrency != effective_io_concurrency) >> +{ >> +double prefetch_pages; >> + if (compute_io_concurrency(new_io_concurrency, _pages)) >> +scanstate->target_prefetch_pages = rint(prefetch_pages); >> +} >> +#endif > > Maybe it's just me - but imo there should be as few USE_PREFETCH > dependant places in the code as possible. It'll just be 0 when not > supported, that's fine? Especially changing the size of externally > visible structs depending on a configure detected ifdef seems wrong to > me. > I removed these ifdefs, and the more problematic one in the struct. >> +bool >> +compute_io_concurrency(int io_concurrency, double *target_prefetch_pages) >> +{ >> +double new_prefetch_pages = 0.0; >> +int i; >> + >> +/* make sure the io_concurrency value is correct, it may have been >> forced >> + * with a pg_tablespace UPDATE >> + */ > > Nitpick: Wrong comment style (/* stands on its own line). > >> +if (io_concurrency > MAX_IO_CONCURRENCY) >> +io_concurrency = MAX_IO_CONCURRENCY; >> + >> +/*-- >> + * The user-visible GUC parameter is the number of drives (spindles), >> + * which we need to translate to a number-of-pages-to-prefetch target. >> + * The target value is stashed in *extra and then assigned to the actual >> + * variable by assign_effective_io_concurrency. >> + * >> + * The expected number of prefetch pages needed to keep N drives busy >> is: >> + * >> + * drives | I/O requests >> + * ---+ >> + * 1 | 1 >> + * 2 | 2/1 + 2/2 = 3 >> + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2 >> + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 >> + * n | n * H(n) > > I know you just moved this code. But: I don't buy this formula. Like at > all. Doesn't queuing and reordering entirely invalidate the logic here? > Changing the formula, or changing the GUC to a number of pages to prefetch is still discussed, so no change here. > Perhaps more relevantly: Imo nodeBitmapHeapscan.c is the wrong place for > this. bufmgr.c maybe? > Moved to bufmgr.c > You also didn't touch > /* > * How many buffers PrefetchBuffer callers should try to stay ahead of their > * ReadBuffer calls by. This is maintained by the assign hook for > * effective_io_concurrency. Zero means "never prefetch". > */ > int target_prefetch_pages = 0; > which surely doesn't make sense anymore after these changes. > > But do we even need that variable now? I slighty updated the comment. If the table doesn't belong to a tablespace with an overloaded effective_io_concurrency, keeping this pre-computed target_prefetch_pages can save a few cycles on each execution, so I think it's better to keep it. > >> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h >> index dc167f9..57008fc 100644 >> --- a/src/include/utils/guc.h >> +++ b/src/include/utils/guc.h >> @@ -26,6 +26,9 @@ >> #define MAX_KILOBYTES (INT_MAX / 1024) >> #endif >> >> +/* upper limit for effective_io_concurrency */ >> +#define MAX_IO_CONCURRENCY 1000 >> + >> /* >> * Automatic configuration file name for ALTER SYSTEM. >> * This file will be used to store values of configuration parameters >> @@ -256,6 +259,8 @@ extern int temp_file_limit; >> >> extern int num_temp_buffers; >> >> +extern int effective_io_concurrency; >> + > > target_prefetch_pages is
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On Fri, Sep 4, 2015 at 11:21 AM, Greg Starkwrote: > On Thu, Sep 3, 2015 at 2:13 PM, Merlin Moncure wrote: >> I find this talk of platters and spindles to be somewhat >> baroque; for a 200$ part I have to work pretty hard to max out the >> drive when reading and I'm still not completely sure if it's the drive >> itself, postgres, cpu, or sata interface bottlenecking me. This will >> require a rethink of e_i_o configuration; in the old days there were >> physical limitations of the drives that were in the way regardless of >> the software stack but we are in a new era, I think. I'm convinced >> prefetching works and we're going to want to aggressively prefetch >> anything and everything possible. SSD controllers (at least the intel >> ones) are very smart. > > > Wouldn't SSDs need much *less* aggressive prefetching? There's still > latency and there are multiple I/O channels so they will still need > some. But spinning media gives latencies measured in milliseconds. You > can process a lot of tuples in milliseconds. If you have a hundred > spindles you want them all busy doing seeks because in the 5ms it > takes them to do that you can proess all the results on a single cpu > and the rest of time is spend waiting. > > When your media has latency on the order of microseconds then you only > need to have a small handful of I/O requests in flight to keep your > processor busy. I'm not sure I agree with that. 100 micosecond latency is still a pretty big deal when memory latency is measured in nanoseconds. This is why we have pcie storage devices. (if anyone has a fast pice flash device I'd be interested in seing some e_i_o tests...I bet they'd still show some impact but it'd be less). For spinning media, any random i/o workload on large datasets is a freight train to 100% iowait. As each device has to process each data set synchronously minus what small gains can be eeked out by reordering and combining writes. This drives data transfer rates under a megabyte per second. Flash devices OTOH are essentially giant raid 0 devices of electronic media underneath the flash controller. Each device can functionally do many read operations in parallel so it makes perfect sense that they perform better when given multiple overlapping requests; unlike spinning media they can actually make use of that information and those assumptions are well supported by measurement. Flash is getting better all the time and storage technologies that remove its one weakness, random writes, appear to be right around the corner. The age of storage being a bottleneck for database systems without massive high cost engineering and optimization has come to an end. This is good for software developers and especially good for databases because most of the perceived mythology of databases being 'slow' are in fact storage issues. With that problem gone focus will return to clean data processing models which is where database systems excel with their many decades of refinement. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On 2015-09-04 17:21:38 +0100, Greg Stark wrote: > Wouldn't SSDs need much *less* aggressive prefetching? There's still > latency and there are multiple I/O channels so they will still need > some. But spinning media gives latencies measured in milliseconds. You > can process a lot of tuples in milliseconds. If you have a hundred > spindles you want them all busy doing seeks because in the 5ms it > takes them to do that you can proess all the results on a single cpu > and the rest of time is spend waiting. > > When your media has latency on the order of microseconds then you only > need to have a small handful of I/O requests in flight to keep your > processor busy. Most(?) SSDs have latencies between 0.1 and 0.4 ms for individual random reads, often significantly more when a lot of IO is going on. In that time you can still process a good number of pages on the CPU level. Sure, there's few workloads where you need dozens of SSDs to parallelize random reads like in the rotating media days. But that doesn't mean you don't need prefetching? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On Fri, Sep 4, 2015 at 05:21:38PM +0100, Greg Stark wrote: > Wouldn't SSDs need much *less* aggressive prefetching? There's still > latency and there are multiple I/O channels so they will still need > some. But spinning media gives latencies measured in milliseconds. You > can process a lot of tuples in milliseconds. If you have a hundred > spindles you want them all busy doing seeks because in the 5ms it > takes them to do that you can proess all the results on a single cpu > and the rest of time is spend waiting. > > When your media has latency on the order of microseconds then you only > need to have a small handful of I/O requests in flight to keep your > processor busy. Well, there is still the processing time of getting that data ready. All I know is that people have reported that prefetching is even more useful for SSDs. -- Bruce Momjianhttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On Thu, Sep 3, 2015 at 2:13 PM, Merlin Moncurewrote: > I find this talk of platters and spindles to be somewhat > baroque; for a 200$ part I have to work pretty hard to max out the > drive when reading and I'm still not completely sure if it's the drive > itself, postgres, cpu, or sata interface bottlenecking me. This will > require a rethink of e_i_o configuration; in the old days there were > physical limitations of the drives that were in the way regardless of > the software stack but we are in a new era, I think. I'm convinced > prefetching works and we're going to want to aggressively prefetch > anything and everything possible. SSD controllers (at least the intel > ones) are very smart. Wouldn't SSDs need much *less* aggressive prefetching? There's still latency and there are multiple I/O channels so they will still need some. But spinning media gives latencies measured in milliseconds. You can process a lot of tuples in milliseconds. If you have a hundred spindles you want them all busy doing seeks because in the 5ms it takes them to do that you can proess all the results on a single cpu and the rest of time is spend waiting. When your media has latency on the order of microseconds then you only need to have a small handful of I/O requests in flight to keep your processor busy. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On Wed, Sep 2, 2015 at 5:38 PM, Andres Freundwrote: > If you additionally take into account hardware realities where you have > multiple platters, multiple spindles, command queueing etc, that's even > more true. A single rotation of a single platter with command queuing > can often read several non consecutive blocks if they're on a similar Yeah. And in the case of solid state disks, it's really a much more simple case of, "synchronously reading from the disk block by block does not fully utilize the drive because of various introduced latencies". I find this talk of platters and spindles to be somewhat baroque; for a 200$ part I have to work pretty hard to max out the drive when reading and I'm still not completely sure if it's the drive itself, postgres, cpu, or sata interface bottlenecking me. This will require a rethink of e_i_o configuration; in the old days there were physical limitations of the drives that were in the way regardless of the software stack but we are in a new era, I think. I'm convinced prefetching works and we're going to want to aggressively prefetch anything and everything possible. SSD controllers (at least the intel ones) are very smart. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
Hi On 09/02/2015 03:53 PM, Andres Freund wrote: Hi, On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote: I didn't know that the thread must exists on -hackers to be able to add a commitfest entry, so I transfer the thread here. Please, in the future, also update the title of the thread to something fitting. @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags) { BitmapHeapScanState *scanstate; RelationcurrentRelation; +#ifdef USE_PREFETCH + int new_io_concurrency; +#endif /* check for unsupported flags */ Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags) */ currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags); +#ifdef USE_PREFETCH + /* check if the effective_io_concurrency has been overloaded for the +* tablespace storing the relation and compute the target_prefetch_pages, +* or just get the current target_prefetch_pages +*/ + new_io_concurrency = get_tablespace_io_concurrency( + currentRelation->rd_rel->reltablespace); + + + scanstate->target_prefetch_pages = target_prefetch_pages; + + if (new_io_concurrency != effective_io_concurrency) + { + double prefetch_pages; + if (compute_io_concurrency(new_io_concurrency, _pages)) + scanstate->target_prefetch_pages = rint(prefetch_pages); + } +#endif Maybe it's just me - but imo there should be as few USE_PREFETCH dependant places in the code as possible. It'll just be 0 when not supported, that's fine? It's not just you. Dealing with code with plenty of ifdefs is annoying - it compiles just fine most of the time, until you compile it with some rare configuration. Then it either starts producing strange warnings or the compilation fails entirely. It might make a tiny difference on builds without prefetching support because of code size, but realistically I think it's noise. Especially changing the size of externally visible structs depending ona configure detected ifdef seems wrong to me. +100 to that + /*-- +* The user-visible GUC parameter is the number of drives (spindles), +* which we need to translate to a number-of-pages-to-prefetch target. +* The target value is stashed in *extra and then assigned to the actual +* variable by assign_effective_io_concurrency. +* +* The expected number of prefetch pages needed to keep N drives busy is: +* +* drives | I/O requests +* ---+ +* 1 | 1 +* 2 | 2/1 + 2/2 = 3 +* 3 | 3/1 + 3/2 + 3/3 = 5 1/2 +* 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 +* n | n * H(n) I know you just moved this code. But: I don't buy this formula. Like at all. Doesn't queuing and reordering entirely invalidate the logic here? Well, even the comment right next after the formula says that: * Experimental results show that both of these formulas aren't * aggressiveenough, but we don't really have any better proposals. That's the reason why users generally either use 0 or some rather high value (16 or 32 are the most common values see). The problem is that we don't really care about the number of spindles (and not just because SSDs don't have them at all), but about the target queue length per device. Spinning rust uses TCQ/NCQ to optimize the head movement, SSDs are parallel by nature (stacking multiple chips with separate channels). I doubt we can really improve the formula, except maybe for saying "we want 16 requests per device" and multiplying the number by that. We don't really have the necessary introspection to determine better values (and it's not really possible, because the devices may be hidden behind a RAID controller (or a SAN). So we can't really do much. Maybe the best thing we can do is just completely abandon the "number of spindles" idea, and just say "number of I/O requests to prefetch". Possibly with an explanation of how to estimate it (devices * queue length). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote: > Maybe the best thing we can do is just completely abandon the "number of > spindles" idea, and just say "number of I/O requests to prefetch". Possibly > with an explanation of how to estimate it (devices * queue length). I think that'd be a lot better. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On 2015-09-03 01:59:13 +0200, Tomas Vondra wrote: > That's a bit surprising, especially considering that e_i_c=30 means ~100 > pages to prefetch if I'm doing the math right. > > AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 pages > should not make a difference), 256 for SAS drives and ~1000 for most current > RAID controllers. I think the point is that after an initial buildup - we'll again only prefetch pages in smaller increments because we already prefetched most pages. The actual number of concurrent reads will then be determined by how fast pages are processed. A prefetch_target of a 100 does *not* mean 100 requests are continously in flight. And even if it would, the OS won't issue many more requests than the queue depth, so they'll just sit in the OS queue. > So instead of "How many blocks I need to prefetch to saturate the devices?" > you're asking "How many blocks I need to prefetch to never actually wait for > the I/O?" Yes, pretty much. > I do like this view, but I'm not really sure how could we determine the > right value? It seems to be very dependent on hardware and workload. Right. > For spinning drives the speedup comes from optimizing random seeks to a more > optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the parallel > channels (and possibly faster access to the same block). + OS reordering & coalescing. Don't forget that the OS processes the OS IO queues while the userland process isn't scheduled - in a concurrent workload with more processes than hardware threads that means that pending OS requests are being issued while the query isn't actively being processed. > I guess the best thing we could do at this level is simply keep the > on-device queues fully saturated, no? Well, being too aggressive can hurt throughput and latency of concurrent processes, without being beneficial. Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On 09/03/2015 12:23 AM, Andres Freund wrote: On 2015-09-02 14:31:35 -0700, Josh Berkus wrote: On 09/02/2015 02:25 PM, Tomas Vondra wrote: As I explained, spindles have very little to do with it - you need multiple I/O requests per device, to get the benefit. Sure, the DBAs should know how many spindles they have and should be able to determine optimal IO depth. But we actually say this in the docs: My experience with performance tuning is that values above 3 have no real effect on how queries are executed. I saw pretty much the opposite - the benefits seldomly were significant below 30 or so. Even on single disks. That's a bit surprising, especially considering that e_i_c=30 means ~100 pages to prefetch if I'm doing the math right. AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 pages should not make a difference), 256 for SAS drives and ~1000 for most current RAID controllers. I'm not entirely surprised that values beyond 30 make a difference, but that you seldomly saw significant improvements below this value. No doubt there are workloads like that, but I'd expect them to be quite rare and not prevalent as you're claiming. Which actually isn't that surprising - to be actually beneficial (that is, turn an IO into a CPU bound workload) the prefetched buffer needs to actually have been read in by the time its needed. In many queries processing a single heap page takes far shorter than prefetching the data from storage, even if it's on good SSDs. > Therefore what you actually need is a queue of prefetches for the next XX buffers so that between starting a prefetch and actually needing the buffer ienough time has passed that the data is completely read in. And the point is that that's the case even for a single rotating disk! So instead of "How many blocks I need to prefetch to saturate the devices?" you're asking "How many blocks I need to prefetch to never actually wait for the I/O?" I do like this view, but I'm not really sure how could we determine the right value? It seems to be very dependent on hardware and workload. For spinning drives the speedup comes from optimizing random seeks to a more optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the parallel channels (and possibly faster access to the same block). I guess the best thing we could do at this level is simply keep the on-device queues fully saturated, no? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
That doesn't match any of the empirical tests I did at the time. I posted graphs of the throughput for varying numbers of spindles with varying amount of prefetch. In every case more prefetching increases throuput up to N times the single platter throuput where N was the number of spindles. There can be a small speedup from overlapping CPU with I/O but that's a really small effect. At most that can be a single request and it would be a very rarely would the amount of CPU time be even a moderate fraction of the I/O latency. The only case where they're comparable is when your reading sequentially and then hopefully you wouldn't be using postgres prefetching at all which is only really intended to help random I/O. On 2 Sep 2015 23:38, "Andres Freund"wrote: > On 2015-09-02 19:49:13 +0100, Greg Stark wrote: > > I can take the blame for this formula. > > > > It's called the "Coupon Collector Problem". If you hit get a random > > coupon from a set of n possible coupons, how many random coupons would > > you have to collect before you expect to have at least one of each. > > My point is that that's just the entirely wrong way to model > prefetching. Prefetching can be massively beneficial even if you only > have a single platter! Even if there were no queues on the hardware or > OS level! Concurrency isn't the right way to look at prefetching. > > You need to prefetch so far ahead that you'll never block on reading > heap pages - and that's only the case if processing the next N heap > blocks takes longer than the prefetch of the N+1 th page. That doesn't > mean there continously have to be N+1 prefetches in progress - in fact > that actually often will only be the case for the first few, after that > you hopefully are bottlnecked on CPU. > > If you additionally take into account hardware realities where you have > multiple platters, multiple spindles, command queueing etc, that's even > more true. A single rotation of a single platter with command queuing > can often read several non consecutive blocks if they're on a similar > > - Andres >
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On 2 Sep 2015 14:54, "Andres Freund"wrote: > > > > + /*-- > > + * The user-visible GUC parameter is the number of drives (spindles), > > + * which we need to translate to a number-of-pages-to-prefetch target. > > + * The target value is stashed in *extra and then assigned to the > > actual > > + * variable by assign_effective_io_concurrency. > > + * > > + * The expected number of prefetch pages needed to keep N drives busy > > is: > > + * > > + * drives | I/O requests > > + * ---+ > > + * 1 | 1 > > + * 2 | 2/1 + 2/2 = 3 > > + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2 > > + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 > > + * n | n * H(n) > > I know you just moved this code. But: I don't buy this formula. Like at > all. Doesn't queuing and reordering entirely invalidate the logic here? I can take the blame for this formula. It's called the "Coupon Collector Problem". If you hit get a random coupon from a set of n possible coupons, how many random coupons would you have to collect before you expect to have at least one of each. This computation model assumes we have no information about which spindle each block will hit. That's basically true for the case of bitmapheapscan for most cases because the idea of bitmapheapscan is to be picking a sparse set of blocks and there's no reason the blocks being read will have any regularity that causes them all to fall on the same spindles. If in fact you're reading a fairly dense set then bitmapheapscan probably is a waste of time and simply reading sequentially would be exactly as fast or even faster. We talked about this quite a bit back then and there was no dispute that the aim is to provide GUCs that mean something meaningful to the DBA who can actually measure them. They know how many spindles they have. They do not know what the optimal prefetch depth is and the only way to determine it would be to experiment with Postgres. Worse, I think the above formula works for essentially random I/O but for more predictable I/O it might be possible to use a different formula. But if we made the GUC something low level like "how many blocks to prefetch" then we're left in the dark about how to handle that different access pattern. I did speak to a dm developer and he suggested that the kernel could help out with an API. He suggested something of the form "how many blocks do I have to read before the end of the current device". I wasn't sure exactly what we would do with something like that but it would be better than just guessing how many I/O operations we need to issue to keep all the spindles busy. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, On 02/09/2015 18:06, Tomas Vondra wrote: > Hi > > On 09/02/2015 03:53 PM, Andres Freund wrote: >> >> Hi, >> >> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote: >>> I didn't know that the thread must exists on -hackers to be >>> able to add a commitfest entry, so I transfer the thread here. >> >> Please, in the future, also update the title of the thread to >> something fitting. >> Sorry for that. >>> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan >>> *node, EState *estate, int eflags) { BitmapHeapScanState >>> *scanstate; RelationcurrentRelation; +#ifdef USE_PREFETCH + >>> int new_io_concurrency; +#endif >>> >>> /* check for unsupported flags */ Assert(!(eflags & >>> (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); @@ -598,6 +603,25 @@ >>> ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, >>> int eflags) */ currentRelation = ExecOpenScanRelation(estate, >>> node->scan.scanrelid, eflags); >>> >>> +#ifdef USE_PREFETCH +/* check if the >>> effective_io_concurrency has been overloaded for the + * >>> tablespace storing the relation and compute the >>> target_prefetch_pages, + * or just get the current >>> target_prefetch_pages + */ +new_io_concurrency = >>> get_tablespace_io_concurrency( + >>> currentRelation->rd_rel->reltablespace); + + + >>> scanstate->target_prefetch_pages = target_prefetch_pages; + + >>> if (new_io_concurrency != effective_io_concurrency) +{ + >>> double prefetch_pages; + if >>> (compute_io_concurrency(new_io_concurrency, _pages)) + >>> scanstate->target_prefetch_pages = rint(prefetch_pages); + >>> } +#endif >> >> Maybe it's just me - but imo there should be as few USE_PREFETCH >> dependant places in the code as possible. It'll just be 0 when >> not supported, that's fine? > > It's not just you. Dealing with code with plenty of ifdefs is > annoying - it compiles just fine most of the time, until you > compile it with some rare configuration. Then it either starts > producing strange warnings or the compilation fails entirely. > > It might make a tiny difference on builds without prefetching > support because of code size, but realistically I think it's > noise. > >> Especially changing the size of externally visible structs >> depending ona configure detected ifdef seems wrong to me. > > +100 to that > I totally agree. I'll remove the ifdefs. >> Nitpick: Wrong comment style (/* stands on its own line). I did run pgindent before submitting patch, but apparently I picked the wrong one. Already fixed in local branch. >>> +/*-- + * The user-visible GUC parameter is the >>> number of drives (spindles), + * which we need to translate >>> to a number-of-pages-to-prefetch target. + * The target >>> value is stashed in *extra and then assigned to the actual + >>> * variable by assign_effective_io_concurrency. + * + * >>> The expected number of prefetch pages needed to keep N drives >>> busy is: + * + * drives | I/O requests + * >>> ---+ + *1 | 1 + * >>> 2 | 2/1 + 2/2 = 3 + *3 | 3/1 + 3/2 + 3/3 = 5 >>> 1/2 + *4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 + * >>> n | n * H(n) >> >> I know you just moved this code. But: I don't buy this formula. >> Like at all. Doesn't queuing and reordering entirely invalidate >> the logic here? > > Well, even the comment right next after the formula says that: > > * Experimental results show that both of these formulas aren't * > aggressiveenough, but we don't really have any better proposals. > > That's the reason why users generally either use 0 or some rather > high value (16 or 32 are the most common values see). The problem > is that we don't really care about the number of spindles (and not > just because SSDs don't have them at all), but about the target > queue length per device. Spinning rust uses TCQ/NCQ to optimize the > head movement, SSDs are parallel by nature (stacking multiple chips > with separate channels). > > I doubt we can really improve the formula, except maybe for saying > "we want 16 requests per device" and multiplying the number by > that. We don't really have the necessary introspection to determine > better values (and it's not really possible, because the devices > may be hidden behind a RAID controller (or a SAN). So we can't > really do much. > > Maybe the best thing we can do is just completely abandon the > "number of spindles" idea, and just say "number of I/O requests to > prefetch". Possibly with an explanation of how to estimate it > (devices * queue length). > >> I think that'd be a lot better. +1 for that too. If everone's ok with this change, I can submit a patch for that too. Should I split that into two patches, and/or start a new thread? - -- Julien Rouhaud http://dalibo.com - http://dalibo.org -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (GNU/Linux)
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/09/2015 15:53, Andres Freund wrote: > On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote: > > You also didn't touch /* * How many buffers PrefetchBuffer callers > should try to stay ahead of their * ReadBuffer calls by. This is > maintained by the assign hook for * effective_io_concurrency. Zero > means "never prefetch". */ inttarget_prefetch_pages = > 0; which > surely doesn't make sense anymore after these changes. > > But do we even need that variable now? > I thought this was related to the effective_io_concurrency GUC (possibly overloaded by the per-tablespace setting), so I didn't make any change on that. I also just found an issue with my previous patch, the global effective_io_concurrency GUC was ignored if the tablespace had a specific seq_page_cost or random_page_cost setting, I just fixed that in my local branch. >> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h >> index dc167f9..57008fc 100644 --- a/src/include/utils/guc.h +++ >> b/src/include/utils/guc.h @@ -26,6 +26,9 @@ #define MAX_KILOBYTES >> (INT_MAX / 1024) #endif >> >> +/* upper limit for effective_io_concurrency */ +#define >> MAX_IO_CONCURRENCY 1000 + /* * Automatic configuration file name >> for ALTER SYSTEM. * This file will be used to store values of >> configuration parameters @@ -256,6 +259,8 @@ extern int >> temp_file_limit; >> >> extern int num_temp_buffers; >> >> +extern int effective_io_concurrency; + > > target_prefetch_pages is declared in bufmgr.h - that seems like a > better place for these. > I was rather sceptical about that too. I'll move these in bufmgr.h. Regards. - -- Julien Rouhaud http://dalibo.com - http://dalibo.org -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (GNU/Linux) iQEcBAEBAgAGBQJV51pcAAoJELGaJ8vfEpOqIV0H/Rj1e/DtJS60X2mReWDyfooD G3j0Ptblhy+brYIIxo9Bdp9hVeYFmEqlOJIht9T/3gjfkg5IMz+5bV2waEbAan/m 9uedR/RmS9sz2YpwGgpd21bfSt2LwB+UC448t3rq8KtuzwmXgSVVEflmDmN1qV3z PseUFzS74HeIJWfxLRLGsJ5amN0hJ8bdolIfxdFR0FyFDv0tRv1DzppdMebVJmHs uIdJOU49sIDHjcnsUcq67jkP+IfTUon+nnwvk5FYVVKdBX2ka1Q/1VAvTfmWo0oV WZSlIjQdMUnlTX91zke0NdmsTnagIeRy1oISn/K1v+YmSqnsPqPAcZ6FFQhUMqI= =4ofZ -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On Wed, Sep 2, 2015 at 4:31 PM, Josh Berkuswrote: > On 09/02/2015 02:25 PM, Tomas Vondra wrote: >> >> As I explained, spindles have very little to do with it - you need >> multiple I/O requests per device, to get the benefit. Sure, the DBAs >> should know how many spindles they have and should be able to determine >> optimal IO depth. But we actually say this in the docs: > > My experience with performance tuning is that values above 3 have no > real effect on how queries are executed. That's the exact opposite of my findings on intel S3500 (see: http://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On 09/02/2015 02:25 PM, Tomas Vondra wrote: > > As I explained, spindles have very little to do with it - you need > multiple I/O requests per device, to get the benefit. Sure, the DBAs > should know how many spindles they have and should be able to determine > optimal IO depth. But we actually say this in the docs: My experience with performance tuning is that values above 3 have no real effect on how queries are executed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On Wed, Sep 2, 2015 at 2:31 PM, Josh Berkuswrote: > On 09/02/2015 02:25 PM, Tomas Vondra wrote: > > > > As I explained, spindles have very little to do with it - you need > > multiple I/O requests per device, to get the benefit. Sure, the DBAs > > should know how many spindles they have and should be able to determine > > optimal IO depth. But we actually say this in the docs: > > My experience with performance tuning is that values above 3 have no > real effect on how queries are executed. > Perhaps one reason is that the planner assumes it will get no benefit from this setting, meaning it is somewhat unlikely to choose the types of plans which would actually show a benefit from higher values. Cheers, Jeff
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On 2015-09-02 19:49:13 +0100, Greg Stark wrote: > I can take the blame for this formula. > > It's called the "Coupon Collector Problem". If you hit get a random > coupon from a set of n possible coupons, how many random coupons would > you have to collect before you expect to have at least one of each. My point is that that's just the entirely wrong way to model prefetching. Prefetching can be massively beneficial even if you only have a single platter! Even if there were no queues on the hardware or OS level! Concurrency isn't the right way to look at prefetching. You need to prefetch so far ahead that you'll never block on reading heap pages - and that's only the case if processing the next N heap blocks takes longer than the prefetch of the N+1 th page. That doesn't mean there continously have to be N+1 prefetches in progress - in fact that actually often will only be the case for the first few, after that you hopefully are bottlnecked on CPU. If you additionally take into account hardware realities where you have multiple platters, multiple spindles, command queueing etc, that's even more true. A single rotation of a single platter with command queuing can often read several non consecutive blocks if they're on a similar - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
On 2015-09-02 14:31:35 -0700, Josh Berkus wrote: > On 09/02/2015 02:25 PM, Tomas Vondra wrote: > > > > As I explained, spindles have very little to do with it - you need > > multiple I/O requests per device, to get the benefit. Sure, the DBAs > > should know how many spindles they have and should be able to determine > > optimal IO depth. But we actually say this in the docs: > > My experience with performance tuning is that values above 3 have no > real effect on how queries are executed. I saw pretty much the opposite - the benefits seldomly were significant below 30 or so. Even on single disks. Which actually isn't that surprising - to be actually beneficial (that is, turn an IO into a CPU bound workload) the prefetched buffer needs to actually have been read in by the time its needed. In many queries processing a single heap page takes far shorter than prefetching the data from storage, even if it's on good SSDs. Therefore what you actually need is a queue of prefetches for the next XX buffers so that between starting a prefetch and actually needing the buffer ienough time has passed that the data is completely read in. And the point is that that's the case even for a single rotating disk! Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
Hi, On 09/02/2015 08:49 PM, Greg Stark wrote: On 2 Sep 2015 14:54, "Andres Freund"wrote: + /*-- + * The user-visible GUC parameter is the number of drives (spindles), + * which we need to translate to a number-of-pages-to-prefetch target. + * The target value is stashed in *extra and then assigned to the actual + * variable by assign_effective_io_concurrency. + * + * The expected number of prefetch pages needed to keep N drives busy is: + * + * drives | I/O requests + * ---+ + * 1 | 1 + * 2 | 2/1 + 2/2 = 3 + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2 + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 + * n | n * H(n) I know you just moved this code. But: I don't buy this formula. Like at all. Doesn't queuing and reordering entirely invalidate the logic here? I can take the blame for this formula. It's called the "Coupon Collector Problem". If you hit get a random coupon from a set of n possible coupons, how many random coupons would you have to collect before you expect to have at least one of each. This computation model assumes we have no information about which spindle each block will hit. That's basically true for the case of bitmapheapscan for most cases because the idea of bitmapheapscan is to be picking a sparse set of blocks and there's no reason the blocks being read will have any regularity that causes them all to fall on the same spindles. If in fact you're reading a fairly dense set then bitmapheapscan probably is a waste of time and simply reading sequentially would be exactly as fast or even faster. There are different meanings of "busy". If I get the coupon collector problem right (after quickly skimming the wikipedia article today), it effectively makes sure that each "spindle" has at least 1 request in the queue. Which sucks in practice, because on spinning rust it makes queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one of the multiple channels. On spinning drives, it's usually good to keep the iodepth>=4. For example this 10k Seagate drive [1] can do ~450 random IOPS with iodepth=16, while 10k drive should be able to do ~150 IOPS (with iodepth=1). The other SAS drives behave quite similarly. [1] http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review On SSDs the good values usually start at 16, depending on the model (and controller), and size (large SSDs are basically multiple small ones glued together, thus have more channels). This is why the numbers from coupon collector are way too low in many cases. (OTOH this is done per backend, so if there are multiple backends doing prefetching ...) We talked about this quite a bit back then and there was no dispute that the aim is to provide GUCs that mean something meaningful to the DBA who can actually measure them. They know how many spindles they have. They do not know what the optimal prefetch depth is and the only way to determine it would be to experiment with Postgres. Worse, I As I explained, spindles have very little to do with it - you need multiple I/O requests per device, to get the benefit. Sure, the DBAs should know how many spindles they have and should be able to determine optimal IO depth. But we actually say this in the docs: A good starting point for this setting is the number of separate drives comprising a RAID 0 stripe or RAID 1 mirror being used for the database. (For RAID 5 the parity drive should not be counted.) However, if the database is often busy with multiple queries issued in concurrent sessions, lower values may be sufficient to keep the disk array busy. A value higher than needed to keep the disks busy will only result in extra CPU overhead. So we recommend number of drives as a good starting value, and then warn against increasing the value further. Moreover, ISTM it's very unclear what value to use even if you know the number of devices and optimal iodepth. Setting (devices * iodepth) doesn't really make much sense, because that effectively computes (devices*iodepth) * H(devices*iodepth) which says "there are (devices*iodepth) devices, make sure there's at least one request for each of them", right? I guess we actually want (devices*iodepth) * H(devices) Sadly that means we'd have to introduce another GUC, because we need track both ndevices and iodepth. There probably is a value X so that X * H(X) ~= (devices*iodepth) * H(devices) but it's far from clear that's what we need (it surely is not in the docs). think the above formula works for essentially random I/O but for more predictable I/O it might be possible to use a different formula. But if we made the GUC something low level like "how many blocks to prefetch" then we're left in the dark about how to handle that
Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting
Hi, On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote: > I didn't know that the thread must exists on -hackers to be able to add > a commitfest entry, so I transfer the thread here. Please, in the future, also update the title of the thread to something fitting. > @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState > *estate, int eflags) > { > BitmapHeapScanState *scanstate; > RelationcurrentRelation; > +#ifdef USE_PREFETCH > + int new_io_concurrency; > +#endif > > /* check for unsupported flags */ > Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); > @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState > *estate, int eflags) >*/ > currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, > eflags); > > +#ifdef USE_PREFETCH > + /* check if the effective_io_concurrency has been overloaded for the > + * tablespace storing the relation and compute the > target_prefetch_pages, > + * or just get the current target_prefetch_pages > + */ > + new_io_concurrency = get_tablespace_io_concurrency( > + currentRelation->rd_rel->reltablespace); > + > + > + scanstate->target_prefetch_pages = target_prefetch_pages; > + > + if (new_io_concurrency != effective_io_concurrency) > + { > + double prefetch_pages; > +if (compute_io_concurrency(new_io_concurrency, _pages)) > + scanstate->target_prefetch_pages = rint(prefetch_pages); > + } > +#endif Maybe it's just me - but imo there should be as few USE_PREFETCH dependant places in the code as possible. It'll just be 0 when not supported, that's fine? Especially changing the size of externally visible structs depending on a configure detected ifdef seems wrong to me. > +bool > +compute_io_concurrency(int io_concurrency, double *target_prefetch_pages) > +{ > + double new_prefetch_pages = 0.0; > + int i; > + > + /* make sure the io_concurrency value is correct, it may have been > forced > + * with a pg_tablespace UPDATE > + */ Nitpick: Wrong comment style (/* stands on its own line). > + if (io_concurrency > MAX_IO_CONCURRENCY) > + io_concurrency = MAX_IO_CONCURRENCY; > + > + /*-- > + * The user-visible GUC parameter is the number of drives (spindles), > + * which we need to translate to a number-of-pages-to-prefetch target. > + * The target value is stashed in *extra and then assigned to the actual > + * variable by assign_effective_io_concurrency. > + * > + * The expected number of prefetch pages needed to keep N drives busy > is: > + * > + * drives | I/O requests > + * ---+ > + * 1 | 1 > + * 2 | 2/1 + 2/2 = 3 > + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2 > + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 > + * n | n * H(n) I know you just moved this code. But: I don't buy this formula. Like at all. Doesn't queuing and reordering entirely invalidate the logic here? Perhaps more relevantly: Imo nodeBitmapHeapscan.c is the wrong place for this. bufmgr.c maybe? You also didn't touch /* * How many buffers PrefetchBuffer callers should try to stay ahead of their * ReadBuffer calls by. This is maintained by the assign hook for * effective_io_concurrency. Zero means "never prefetch". */ int target_prefetch_pages = 0; which surely doesn't make sense anymore after these changes. But do we even need that variable now? > diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h > index dc167f9..57008fc 100644 > --- a/src/include/utils/guc.h > +++ b/src/include/utils/guc.h > @@ -26,6 +26,9 @@ > #define MAX_KILOBYTES(INT_MAX / 1024) > #endif > > +/* upper limit for effective_io_concurrency */ > +#define MAX_IO_CONCURRENCY 1000 > + > /* > * Automatic configuration file name for ALTER SYSTEM. > * This file will be used to store values of configuration parameters > @@ -256,6 +259,8 @@ extern inttemp_file_limit; > > extern int num_temp_buffers; > > +extern int effective_io_concurrency; > + target_prefetch_pages is declared in bufmgr.h - that seems like a better place for these. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers