Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-08 Thread Alvaro Herrera
Julien Rouhaud wrote:
> Hi,
> 
> Please find attached a v2 of the patch. See below for changes.

Pushed after smallish tweaks.  Please test to verify I didn't break
anything.

(It's a pity that we can't add a regression test with a value other than
0.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-08 Thread Julien Rouhaud
On 08/09/2015 18:00, Alvaro Herrera wrote:
> Julien Rouhaud wrote:
>> Hi,
>>
>> Please find attached a v2 of the patch. See below for changes.
> 
> Pushed after smallish tweaks.  Please test to verify I didn't break
> anything.
> 

I just tried with all the cases I could think of, everything works fine.
Thanks!

> (It's a pity that we can't add a regression test with a value other than
> 0.)
> 


-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-06 Thread Julien Rouhaud
Hi,

Please find attached a v2 of the patch. See below for changes.


On 02/09/2015 15:53, Andres Freund wrote:
> 
> Hi,
> 
> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
>> I didn't know that the thread must exists on -hackers to be able to add
>> a commitfest entry, so I transfer the thread here.
> 
> Please, in the future, also update the title of the thread to something
> fitting.
> 
>> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
>> *estate, int eflags)
>>  {
>>  BitmapHeapScanState *scanstate;
>>  RelationcurrentRelation;
>> +#ifdef USE_PREFETCH
>> +int new_io_concurrency;
>> +#endif
>>  
>>  /* check for unsupported flags */
>>  Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
>> @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
>> *estate, int eflags)
>>   */
>>  currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, 
>> eflags);
>>  
>> +#ifdef USE_PREFETCH
>> +/* check if the effective_io_concurrency has been overloaded for the
>> + * tablespace storing the relation and compute the 
>> target_prefetch_pages,
>> + * or just get the current target_prefetch_pages
>> + */
>> +new_io_concurrency = get_tablespace_io_concurrency(
>> +currentRelation->rd_rel->reltablespace);
>> +
>> +
>> +scanstate->target_prefetch_pages = target_prefetch_pages;
>> +
>> +if (new_io_concurrency != effective_io_concurrency)
>> +{
>> +double prefetch_pages;
>> +   if (compute_io_concurrency(new_io_concurrency, _pages))
>> +scanstate->target_prefetch_pages = rint(prefetch_pages);
>> +}
>> +#endif
> 
> Maybe it's just me - but imo there should be as few USE_PREFETCH
> dependant places in the code as possible. It'll just be 0 when not
> supported, that's fine? Especially changing the size of externally
> visible structs depending on a configure detected ifdef seems wrong to
> me.
> 

I removed these ifdefs, and the more problematic one in the struct.

>> +bool
>> +compute_io_concurrency(int io_concurrency, double *target_prefetch_pages)
>> +{
>> +double  new_prefetch_pages = 0.0;
>> +int i;
>> +
>> +/* make sure the io_concurrency value is correct, it may have been 
>> forced
>> + * with a pg_tablespace UPDATE
>> + */
> 
> Nitpick: Wrong comment style (/* stands on its own line).
> 
>> +if (io_concurrency > MAX_IO_CONCURRENCY)
>> +io_concurrency = MAX_IO_CONCURRENCY;
>> +
>> +/*--
>> + * The user-visible GUC parameter is the number of drives (spindles),
>> + * which we need to translate to a number-of-pages-to-prefetch target.
>> + * The target value is stashed in *extra and then assigned to the actual
>> + * variable by assign_effective_io_concurrency.
>> + *
>> + * The expected number of prefetch pages needed to keep N drives busy 
>> is:
>> + *
>> + * drives |   I/O requests
>> + * ---+
>> + *  1 |   1
>> + *  2 |   2/1 + 2/2 = 3
>> + *  3 |   3/1 + 3/2 + 3/3 = 5 1/2
>> + *  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
>> + *  n |   n * H(n)
> 
> I know you just moved this code. But: I don't buy this formula. Like at
> all. Doesn't queuing and reordering entirely invalidate the logic here?
> 

Changing the formula, or changing the GUC to a number of pages to
prefetch is still discussed, so no change here.

> Perhaps more relevantly: Imo nodeBitmapHeapscan.c is the wrong place for
> this. bufmgr.c maybe?
> 

Moved to bufmgr.c


> You also didn't touch
> /*
>  * How many buffers PrefetchBuffer callers should try to stay ahead of their
>  * ReadBuffer calls by.  This is maintained by the assign hook for
>  * effective_io_concurrency.  Zero means "never prefetch".
>  */
> int   target_prefetch_pages = 0;
> which surely doesn't make sense anymore after these changes.
> 
> But do we even need that variable now?

I slighty updated the comment.  If the table doesn't belong to a
tablespace with an overloaded effective_io_concurrency, keeping this
pre-computed target_prefetch_pages can save a few cycles on each
execution, so I think it's better to keep it.

> 
>> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
>> index dc167f9..57008fc 100644
>> --- a/src/include/utils/guc.h
>> +++ b/src/include/utils/guc.h
>> @@ -26,6 +26,9 @@
>>  #define MAX_KILOBYTES   (INT_MAX / 1024)
>>  #endif
>>  
>> +/* upper limit for effective_io_concurrency */
>> +#define MAX_IO_CONCURRENCY 1000
>> +
>>  /*
>>   * Automatic configuration file name for ALTER SYSTEM.
>>   * This file will be used to store values of configuration parameters
>> @@ -256,6 +259,8 @@ extern int   temp_file_limit;
>>  
>>  extern int  num_temp_buffers;
>>  
>> +extern int  effective_io_concurrency;
>> +
> 
> target_prefetch_pages is 

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-05 Thread Merlin Moncure
On Fri, Sep 4, 2015 at 11:21 AM, Greg Stark  wrote:
> On Thu, Sep 3, 2015 at 2:13 PM, Merlin Moncure  wrote:
>>   I find this talk of platters and spindles to be somewhat
>> baroque; for a 200$ part I have to work pretty hard to max out the
>> drive when reading and I'm still not completely sure if it's the drive
>> itself, postgres, cpu, or sata interface bottlenecking me.  This will
>> require a rethink of e_i_o configuration; in the old days there were
>> physical limitations of the drives that were in the way regardless of
>> the software stack but we are in a new era, I think.  I'm convinced
>> prefetching works and we're going to want to aggressively prefetch
>> anything and everything possible.  SSD controllers (at least the intel
>> ones) are very smart.
>
>
> Wouldn't SSDs need much *less* aggressive prefetching? There's still
> latency and there are multiple I/O channels so they will still need
> some. But spinning media gives latencies measured in milliseconds. You
> can process a lot of tuples in milliseconds. If you have a hundred
> spindles you want them all busy doing seeks because in the 5ms it
> takes them to do that you can proess all the results on a single cpu
> and the rest of time is spend waiting.
>
> When your media has latency on the order of microseconds then you only
> need to have a small handful of I/O requests in flight to keep your
> processor busy.

I'm not sure I agree with that.  100 micosecond latency is still a
pretty big deal when memory latency is measured in nanoseconds.  This
is why we have pcie storage devices.  (if anyone has a fast pice flash
device I'd be interested in seing some e_i_o tests...I bet they'd
still show some impact but it'd be less).

For spinning media, any random i/o workload on large datasets is a
freight train to 100% iowait.  As each device has to process each data
set synchronously minus what small gains can be eeked out by
reordering and combining writes.  This drives data transfer rates
under a megabyte per second.

Flash devices OTOH are essentially giant raid 0 devices of electronic
media underneath the flash controller.  Each device can functionally
do many read operations in parallel so it makes perfect sense that
they perform better when given multiple overlapping requests; unlike
spinning media they can actually make use of that information and
those assumptions are well supported by measurement.

Flash is getting better all the time and storage technologies that
remove its one weakness, random writes, appear to be right around the
corner.   The age of storage being a bottleneck for database systems
without massive high cost engineering and optimization has come to an
end.  This is good for software developers and especially good for
databases because most of the perceived mythology of databases being
'slow' are in fact storage issues.  With that problem gone focus will
return to clean data processing models which is where database systems
excel with their many decades of refinement.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-04 Thread Andres Freund
On 2015-09-04 17:21:38 +0100, Greg Stark wrote:
> Wouldn't SSDs need much *less* aggressive prefetching? There's still
> latency and there are multiple I/O channels so they will still need
> some. But spinning media gives latencies measured in milliseconds. You
> can process a lot of tuples in milliseconds. If you have a hundred
> spindles you want them all busy doing seeks because in the 5ms it
> takes them to do that you can proess all the results on a single cpu
> and the rest of time is spend waiting.
> 
> When your media has latency on the order of microseconds then you only
> need to have a small handful of I/O requests in flight to keep your
> processor busy.

Most(?) SSDs have latencies between 0.1 and 0.4 ms for individual random
reads, often significantly more when a lot of IO is going on. In that
time you can still process a good number of pages on the CPU level.

Sure, there's few workloads where you need dozens of SSDs to parallelize
random reads like in the rotating media days. But that doesn't mean you
don't need prefetching?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-04 Thread Bruce Momjian
On Fri, Sep  4, 2015 at 05:21:38PM +0100, Greg Stark wrote:
> Wouldn't SSDs need much *less* aggressive prefetching? There's still
> latency and there are multiple I/O channels so they will still need
> some. But spinning media gives latencies measured in milliseconds. You
> can process a lot of tuples in milliseconds. If you have a hundred
> spindles you want them all busy doing seeks because in the 5ms it
> takes them to do that you can proess all the results on a single cpu
> and the rest of time is spend waiting.
> 
> When your media has latency on the order of microseconds then you only
> need to have a small handful of I/O requests in flight to keep your
> processor busy.

Well, there is still the processing time of getting that data ready. 
All I know is that people have reported that prefetching is even more
useful for SSDs.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-04 Thread Greg Stark
On Thu, Sep 3, 2015 at 2:13 PM, Merlin Moncure  wrote:
>   I find this talk of platters and spindles to be somewhat
> baroque; for a 200$ part I have to work pretty hard to max out the
> drive when reading and I'm still not completely sure if it's the drive
> itself, postgres, cpu, or sata interface bottlenecking me.  This will
> require a rethink of e_i_o configuration; in the old days there were
> physical limitations of the drives that were in the way regardless of
> the software stack but we are in a new era, I think.  I'm convinced
> prefetching works and we're going to want to aggressively prefetch
> anything and everything possible.  SSD controllers (at least the intel
> ones) are very smart.


Wouldn't SSDs need much *less* aggressive prefetching? There's still
latency and there are multiple I/O channels so they will still need
some. But spinning media gives latencies measured in milliseconds. You
can process a lot of tuples in milliseconds. If you have a hundred
spindles you want them all busy doing seeks because in the 5ms it
takes them to do that you can proess all the results on a single cpu
and the rest of time is spend waiting.

When your media has latency on the order of microseconds then you only
need to have a small handful of I/O requests in flight to keep your
processor busy.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-03 Thread Merlin Moncure
On Wed, Sep 2, 2015 at 5:38 PM, Andres Freund  wrote:
> If you additionally take into account hardware realities where you have
> multiple platters, multiple spindles, command queueing etc, that's even
> more true. A single rotation of a single platter with command queuing
> can often read several non consecutive blocks if they're on a similar

Yeah.  And in the case of solid state disks, it's really a much more
simple case of, "synchronously reading from the disk block by block
does not fully utilize the drive because of various introduced
latencies".  I find this talk of platters and spindles to be somewhat
baroque; for a 200$ part I have to work pretty hard to max out the
drive when reading and I'm still not completely sure if it's the drive
itself, postgres, cpu, or sata interface bottlenecking me.  This will
require a rethink of e_i_o configuration; in the old days there were
physical limitations of the drives that were in the way regardless of
the software stack but we are in a new era, I think.  I'm convinced
prefetching works and we're going to want to aggressively prefetch
anything and everything possible.  SSD controllers (at least the intel
ones) are very smart.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Tomas Vondra

Hi

On 09/02/2015 03:53 PM, Andres Freund wrote:


Hi,

On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:

I didn't know that the thread must exists on -hackers to be able to add
a commitfest entry, so I transfer the thread here.


Please, in the future, also update the title of the thread to something
fitting.


@@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
*estate, int eflags)
  {
BitmapHeapScanState *scanstate;
RelationcurrentRelation;
+#ifdef USE_PREFETCH
+   int new_io_concurrency;
+#endif

/* check for unsupported flags */
Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
*estate, int eflags)
 */
currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, 
eflags);

+#ifdef USE_PREFETCH
+   /* check if the effective_io_concurrency has been overloaded for the
+* tablespace storing the relation and compute the 
target_prefetch_pages,
+* or just get the current target_prefetch_pages
+*/
+   new_io_concurrency = get_tablespace_io_concurrency(
+   currentRelation->rd_rel->reltablespace);
+
+
+   scanstate->target_prefetch_pages = target_prefetch_pages;
+
+   if (new_io_concurrency != effective_io_concurrency)
+   {
+   double prefetch_pages;
+  if (compute_io_concurrency(new_io_concurrency, _pages))
+   scanstate->target_prefetch_pages = rint(prefetch_pages);
+   }
+#endif


Maybe it's just me - but imo there should be as few USE_PREFETCH
dependant places in the code as possible. It'll just be 0 when not
supported, that's fine?


It's not just you. Dealing with code with plenty of ifdefs is annoying - 
it compiles just fine most of the time, until you compile it with some 
rare configuration. Then it either starts producing strange warnings or 
the compilation fails entirely.


It might make a tiny difference on builds without prefetching support 
because of code size, but realistically I think it's noise.



Especially changing the size of externally visible structs depending
ona configure detected ifdef seems wrong to me.


+100 to that


+   /*--
+* The user-visible GUC parameter is the number of drives (spindles),
+* which we need to translate to a number-of-pages-to-prefetch target.
+* The target value is stashed in *extra and then assigned to the actual
+* variable by assign_effective_io_concurrency.
+*
+* The expected number of prefetch pages needed to keep N drives busy 
is:
+*
+* drives |   I/O requests
+* ---+
+*  1 |   1
+*  2 |   2/1 + 2/2 = 3
+*  3 |   3/1 + 3/2 + 3/3 = 5 1/2
+*  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+*  n |   n * H(n)


I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?


Well, even the comment right next after the formula says that:

* Experimental results show that both of these formulas aren't
* aggressiveenough, but we don't really have any better proposals.

That's the reason why users generally either use 0 or some rather high 
value (16 or 32 are the most common values see). The problem is that we 
don't really care about the number of spindles (and not just because 
SSDs don't have them at all), but about the target queue length per 
device. Spinning rust uses TCQ/NCQ to optimize the head movement, SSDs 
are parallel by nature (stacking multiple chips with separate channels).


I doubt we can really improve the formula, except maybe for saying "we 
want 16 requests per device" and multiplying the number by that. We 
don't really have the necessary introspection to determine better values 
(and it's not really possible, because the devices may be hidden behind 
a RAID controller (or a SAN). So we can't really do much.


Maybe the best thing we can do is just completely abandon the "number of 
spindles" idea, and just say "number of I/O requests to prefetch". 
Possibly with an explanation of how to estimate it (devices * queue length).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund
On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:
> Maybe the best thing we can do is just completely abandon the "number of
> spindles" idea, and just say "number of I/O requests to prefetch". Possibly
> with an explanation of how to estimate it (devices * queue length).

I think that'd be a lot better.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund
On 2015-09-03 01:59:13 +0200, Tomas Vondra wrote:
> That's a bit surprising, especially considering that e_i_c=30 means ~100
> pages to prefetch if I'm doing the math right.
> 
> AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 pages
> should not make a difference), 256 for SAS drives and ~1000 for most current
> RAID controllers.

I think the point is that after an initial buildup - we'll again only
prefetch pages in smaller increments because we already prefetched most
pages. The actual number of concurrent reads will then be determined by
how fast pages are processed.  A prefetch_target of a 100 does *not*
mean 100 requests are continously in flight. And even if it would, the
OS won't issue many more requests than the queue depth, so they'll just
sit in the OS queue.

> So instead of "How many blocks I need to prefetch to saturate the devices?"
> you're asking "How many blocks I need to prefetch to never actually wait for
> the I/O?"

Yes, pretty much.

> I do like this view, but I'm not really sure how could we determine the
> right value? It seems to be very dependent on hardware and workload.

Right.

> For spinning drives the speedup comes from optimizing random seeks to a more
> optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the parallel
> channels (and possibly faster access to the same block).

+ OS reordering & coalescing.

Don't forget that the OS processes the OS IO queues while the userland
process isn't scheduled - in a concurrent workload with more processes
than hardware threads that means that pending OS requests are being
issued while the query isn't actively being processed.

> I guess the best thing we could do at this level is simply keep the
> on-device queues fully saturated, no?

Well, being too aggressive can hurt throughput and latency of concurrent
processes, without being beneficial.

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Tomas Vondra



On 09/03/2015 12:23 AM, Andres Freund wrote:

On 2015-09-02 14:31:35 -0700, Josh Berkus wrote:

On 09/02/2015 02:25 PM, Tomas Vondra wrote:


As I explained, spindles have very little to do with it - you need
multiple I/O requests per device, to get the benefit. Sure, the DBAs
should know how many spindles they have and should be able to determine
optimal IO depth. But we actually say this in the docs:


My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.


I saw pretty much the opposite - the benefits seldomly were
significant below 30 or so. Even on single disks.


That's a bit surprising, especially considering that e_i_c=30 means ~100 
pages to prefetch if I'm doing the math right.


AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 
pages should not make a difference), 256 for SAS drives and ~1000 for 
most current RAID controllers.


I'm not entirely surprised that values beyond 30 make a difference, but 
that you seldomly saw significant improvements below this value.


No doubt there are workloads like that, but I'd expect them to be quite 
rare and not prevalent as you're claiming.



Which actually isn't that surprising - to be actually beneficial
(that is, turn an IO into a CPU bound workload) the prefetched buffer
needs to actually have been read in by the time its needed. In many
queries processing a single heap page takes far shorter than
prefetching the data from storage, even if it's on good SSDs.

>

Therefore what you actually need is a queue of prefetches for the
next XX buffers so that between starting a prefetch and actually
needing the buffer ienough time has passed that the data is
completely read in. And the point is that that's the case even for a
single rotating disk!


So instead of "How many blocks I need to prefetch to saturate the 
devices?" you're asking "How many blocks I need to prefetch to never 
actually wait for the I/O?"


I do like this view, but I'm not really sure how could we determine the 
right value? It seems to be very dependent on hardware and workload.


For spinning drives the speedup comes from optimizing random seeks to a 
more optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the 
parallel channels (and possibly faster access to the same block).


I guess the best thing we could do at this level is simply keep the 
on-device queues fully saturated, no?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Greg Stark
That doesn't match any of the empirical tests I did at the time. I posted
graphs of the throughput for varying numbers of spindles with varying
amount of prefetch. In every case more prefetching increases throuput up to
N times the single platter throuput where N was the number of spindles.

There can be a small speedup from overlapping CPU with I/O but that's a
really small effect. At most that can be a single request and it would be a
very rarely would the amount of CPU time be even a moderate fraction of the
I/O latency. The only case where they're comparable is when your reading
sequentially and then hopefully you wouldn't be using postgres prefetching
at all which is only really intended to help random I/O.
On 2 Sep 2015 23:38, "Andres Freund"  wrote:

> On 2015-09-02 19:49:13 +0100, Greg Stark wrote:
> > I can take the blame for this formula.
> >
> > It's called the "Coupon Collector Problem". If you hit get a random
> > coupon from a set of n possible coupons, how many random coupons would
> > you have to collect before you expect to have at least one of each.
>
> My point is that that's just the entirely wrong way to model
> prefetching. Prefetching can be massively beneficial even if you only
> have a single platter! Even if there were no queues on the hardware or
> OS level! Concurrency isn't the right way to look at prefetching.
>
> You need to prefetch so far ahead that you'll never block on reading
> heap pages - and that's only the case if processing the next N heap
> blocks takes longer than the prefetch of the N+1 th page. That doesn't
> mean there continously have to be N+1 prefetches in progress - in fact
> that actually often will only be the case for the first few, after that
> you hopefully are bottlnecked on CPU.
>
> If you additionally take into account hardware realities where you have
> multiple platters, multiple spindles, command queueing etc, that's even
> more true. A single rotation of a single platter with command queuing
> can often read several non consecutive blocks if they're on a similar
>
> - Andres
>


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Greg Stark
On 2 Sep 2015 14:54, "Andres Freund"  wrote:
>
>
> > + /*--
> > +  * The user-visible GUC parameter is the number of drives (spindles),
> > +  * which we need to translate to a number-of-pages-to-prefetch target.
> > +  * The target value is stashed in *extra and then assigned to the 
> > actual
> > +  * variable by assign_effective_io_concurrency.
> > +  *
> > +  * The expected number of prefetch pages needed to keep N drives busy 
> > is:
> > +  *
> > +  * drives |   I/O requests
> > +  * ---+
> > +  *  1 |   1
> > +  *  2 |   2/1 + 2/2 = 3
> > +  *  3 |   3/1 + 3/2 + 3/3 = 5 1/2
> > +  *  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
> > +  *  n |   n * H(n)
>
> I know you just moved this code. But: I don't buy this formula. Like at
> all. Doesn't queuing and reordering entirely invalidate the logic here?

I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.

We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I
think the above formula works for essentially random I/O but for more
predictable I/O it might be possible to use a different formula. But
if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access pattern.

I did speak to a dm developer and he suggested that the kernel could
help out with an API. He suggested something of the form "how many
blocks do I have to read before the end of the current device". I
wasn't sure exactly what we would do with something like that but it
would be better than just guessing how many I/O operations we need to
issue to keep all the spindles busy.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Julien Rouhaud
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

On 02/09/2015 18:06, Tomas Vondra wrote:
> Hi
> 
> On 09/02/2015 03:53 PM, Andres Freund wrote:
>> 
>> Hi,
>> 
>> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
>>> I didn't know that the thread must exists on -hackers to be
>>> able to add a commitfest entry, so I transfer the thread here.
>> 
>> Please, in the future, also update the title of the thread to
>> something fitting.
>> 

Sorry for that.

>>> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan
>>> *node, EState *estate, int eflags) { BitmapHeapScanState
>>> *scanstate; RelationcurrentRelation; +#ifdef USE_PREFETCH +
>>> int new_io_concurrency; +#endif
>>> 
>>> /* check for unsupported flags */ Assert(!(eflags &
>>> (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); @@ -598,6 +603,25 @@
>>> ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate,
>>> int eflags) */ currentRelation = ExecOpenScanRelation(estate, 
>>> node->scan.scanrelid, eflags);
>>> 
>>> +#ifdef USE_PREFETCH +/* check if the
>>> effective_io_concurrency has been overloaded for the + *
>>> tablespace storing the relation and compute the 
>>> target_prefetch_pages, + * or just get the current
>>> target_prefetch_pages + */ +new_io_concurrency =
>>> get_tablespace_io_concurrency( +
>>> currentRelation->rd_rel->reltablespace); + + +
>>> scanstate->target_prefetch_pages = target_prefetch_pages; + +
>>> if (new_io_concurrency != effective_io_concurrency) +{ +
>>> double prefetch_pages; +   if
>>> (compute_io_concurrency(new_io_concurrency, _pages)) +
>>> scanstate->target_prefetch_pages = rint(prefetch_pages); +
>>> } +#endif
>> 
>> Maybe it's just me - but imo there should be as few USE_PREFETCH 
>> dependant places in the code as possible. It'll just be 0 when
>> not supported, that's fine?
> 
> It's not just you. Dealing with code with plenty of ifdefs is
> annoying - it compiles just fine most of the time, until you
> compile it with some rare configuration. Then it either starts
> producing strange warnings or the compilation fails entirely.
> 
> It might make a tiny difference on builds without prefetching
> support because of code size, but realistically I think it's
> noise.
> 
>> Especially changing the size of externally visible structs
>> depending ona configure detected ifdef seems wrong to me.
> 
> +100 to that
> 

I totally agree. I'll remove the ifdefs.

>> Nitpick: Wrong comment style (/* stands on its own line).

I did run pgindent before submitting patch, but apparently I picked
the wrong one. Already fixed in local branch.

>>> +/*-- + * The user-visible GUC parameter is the
>>> number of drives (spindles), + * which we need to translate
>>> to a number-of-pages-to-prefetch target. + * The target
>>> value is stashed in *extra and then assigned to the actual +
>>> * variable by assign_effective_io_concurrency. + * + *
>>> The expected number of prefetch pages needed to keep N drives 
>>> busy is: + * + * drives |   I/O requests + *
>>> ---+ + *1 |   1 + *
>>> 2 |   2/1 + 2/2 = 3 + *3 |   3/1 + 3/2 + 3/3 = 5
>>> 1/2 + *4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 + *
>>> n |   n * H(n)
>> 
>> I know you just moved this code. But: I don't buy this formula.
>> Like at all. Doesn't queuing and reordering entirely invalidate
>> the logic here?
> 
> Well, even the comment right next after the formula says that:
> 
> * Experimental results show that both of these formulas aren't *
> aggressiveenough, but we don't really have any better proposals.
> 
> That's the reason why users generally either use 0 or some rather
> high value (16 or 32 are the most common values see). The problem
> is that we don't really care about the number of spindles (and not
> just because SSDs don't have them at all), but about the target
> queue length per device. Spinning rust uses TCQ/NCQ to optimize the
> head movement, SSDs are parallel by nature (stacking multiple chips
> with separate channels).
> 
> I doubt we can really improve the formula, except maybe for saying
> "we want 16 requests per device" and multiplying the number by
> that. We don't really have the necessary introspection to determine
> better values (and it's not really possible, because the devices
> may be hidden behind a RAID controller (or a SAN). So we can't
> really do much.
> 
> Maybe the best thing we can do is just completely abandon the
> "number of spindles" idea, and just say "number of I/O requests to
> prefetch". Possibly with an explanation of how to estimate it
> (devices * queue length).
> 
>> I think that'd be a lot better.

+1 for that too.

If everone's ok with this change, I can submit a patch for that too.
Should I split that into two patches, and/or start a new thread?



- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Julien Rouhaud
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/09/2015 15:53, Andres Freund wrote:
> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
> 
> You also didn't touch /* * How many buffers PrefetchBuffer callers
> should try to stay ahead of their * ReadBuffer calls by.  This is
> maintained by the assign hook for * effective_io_concurrency.  Zero
> means "never prefetch". */ inttarget_prefetch_pages = 
> 0; which
> surely doesn't make sense anymore after these changes.
> 
> But do we even need that variable now?
> 

I thought this was related to the effective_io_concurrency GUC
(possibly overloaded by the per-tablespace setting), so I didn't make
any change on that.

I also just found an issue with my previous patch, the global
effective_io_concurrency GUC was ignored if the tablespace had a
specific seq_page_cost or random_page_cost setting, I just fixed that
in my local branch.


>> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h 
>> index dc167f9..57008fc 100644 --- a/src/include/utils/guc.h +++
>> b/src/include/utils/guc.h @@ -26,6 +26,9 @@ #define MAX_KILOBYTES
>> (INT_MAX / 1024) #endif
>> 
>> +/* upper limit for effective_io_concurrency */ +#define
>> MAX_IO_CONCURRENCY 1000 + /* * Automatic configuration file name
>> for ALTER SYSTEM. * This file will be used to store values of
>> configuration parameters @@ -256,6 +259,8 @@ extern int
>> temp_file_limit;
>> 
>> extern int   num_temp_buffers;
>> 
>> +extern int  effective_io_concurrency; +
> 
> target_prefetch_pages is declared in bufmgr.h - that seems like a
> better place for these.
> 

I was rather sceptical about that too. I'll move these in bufmgr.h.


Regards.


- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJV51pcAAoJELGaJ8vfEpOqIV0H/Rj1e/DtJS60X2mReWDyfooD
G3j0Ptblhy+brYIIxo9Bdp9hVeYFmEqlOJIht9T/3gjfkg5IMz+5bV2waEbAan/m
9uedR/RmS9sz2YpwGgpd21bfSt2LwB+UC448t3rq8KtuzwmXgSVVEflmDmN1qV3z
PseUFzS74HeIJWfxLRLGsJ5amN0hJ8bdolIfxdFR0FyFDv0tRv1DzppdMebVJmHs
uIdJOU49sIDHjcnsUcq67jkP+IfTUon+nnwvk5FYVVKdBX2ka1Q/1VAvTfmWo0oV
WZSlIjQdMUnlTX91zke0NdmsTnagIeRy1oISn/K1v+YmSqnsPqPAcZ6FFQhUMqI=
=4ofZ
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Merlin Moncure
On Wed, Sep 2, 2015 at 4:31 PM, Josh Berkus  wrote:
> On 09/02/2015 02:25 PM, Tomas Vondra wrote:
>>
>> As I explained, spindles have very little to do with it - you need
>> multiple I/O requests per device, to get the benefit. Sure, the DBAs
>> should know how many spindles they have and should be able to determine
>> optimal IO depth. But we actually say this in the docs:
>
> My experience with performance tuning is that values above 3 have no
> real effect on how queries are executed.

That's the exact opposite of my findings on intel S3500 (see:
http://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Josh Berkus
On 09/02/2015 02:25 PM, Tomas Vondra wrote:
> 
> As I explained, spindles have very little to do with it - you need
> multiple I/O requests per device, to get the benefit. Sure, the DBAs
> should know how many spindles they have and should be able to determine
> optimal IO depth. But we actually say this in the docs:

My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Jeff Janes
On Wed, Sep 2, 2015 at 2:31 PM, Josh Berkus  wrote:

> On 09/02/2015 02:25 PM, Tomas Vondra wrote:
> >
> > As I explained, spindles have very little to do with it - you need
> > multiple I/O requests per device, to get the benefit. Sure, the DBAs
> > should know how many spindles they have and should be able to determine
> > optimal IO depth. But we actually say this in the docs:
>
> My experience with performance tuning is that values above 3 have no
> real effect on how queries are executed.
>

Perhaps one reason is that the planner assumes it will get no benefit from
this setting, meaning it is somewhat unlikely to choose the types of plans
which would actually show a benefit from higher values.

Cheers,

Jeff


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund
On 2015-09-02 19:49:13 +0100, Greg Stark wrote:
> I can take the blame for this formula.
> 
> It's called the "Coupon Collector Problem". If you hit get a random
> coupon from a set of n possible coupons, how many random coupons would
> you have to collect before you expect to have at least one of each.

My point is that that's just the entirely wrong way to model
prefetching. Prefetching can be massively beneficial even if you only
have a single platter! Even if there were no queues on the hardware or
OS level! Concurrency isn't the right way to look at prefetching.

You need to prefetch so far ahead that you'll never block on reading
heap pages - and that's only the case if processing the next N heap
blocks takes longer than the prefetch of the N+1 th page. That doesn't
mean there continously have to be N+1 prefetches in progress - in fact
that actually often will only be the case for the first few, after that
you hopefully are bottlnecked on CPU.

If you additionally take into account hardware realities where you have
multiple platters, multiple spindles, command queueing etc, that's even
more true. A single rotation of a single platter with command queuing
can often read several non consecutive blocks if they're on a similar

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund
On 2015-09-02 14:31:35 -0700, Josh Berkus wrote:
> On 09/02/2015 02:25 PM, Tomas Vondra wrote:
> > 
> > As I explained, spindles have very little to do with it - you need
> > multiple I/O requests per device, to get the benefit. Sure, the DBAs
> > should know how many spindles they have and should be able to determine
> > optimal IO depth. But we actually say this in the docs:
> 
> My experience with performance tuning is that values above 3 have no
> real effect on how queries are executed.

I saw pretty much the opposite - the benefits seldomly were significant
below 30 or so. Even on single disks. Which actually isn't that
surprising - to be actually beneficial (that is, turn an IO into a CPU
bound workload) the prefetched buffer needs to actually have been read
in by the time its needed. In many queries processing a single heap page
takes far shorter than prefetching the data from storage, even if it's
on good SSDs.

Therefore what you actually need is a queue of prefetches for the next
XX buffers so that between starting a prefetch and actually needing the
buffer ienough time has passed that the data is completely read in.  And
the point is that that's the case even for a single rotating disk!

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Tomas Vondra

Hi,

On 09/02/2015 08:49 PM, Greg Stark wrote:

On 2 Sep 2015 14:54, "Andres Freund"  wrote:




+ /*--
+  * The user-visible GUC parameter is the number of drives (spindles),
+  * which we need to translate to a number-of-pages-to-prefetch target.
+  * The target value is stashed in *extra and then assigned to the actual
+  * variable by assign_effective_io_concurrency.
+  *
+  * The expected number of prefetch pages needed to keep N drives busy is:
+  *
+  * drives |   I/O requests
+  * ---+
+  *  1 |   1
+  *  2 |   2/1 + 2/2 = 3
+  *  3 |   3/1 + 3/2 + 3/3 = 5 1/2
+  *  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+  *  n |   n * H(n)


I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?


I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.


There are different meanings of "busy". If I get the coupon collector 
problem right (after quickly skimming the wikipedia article today), it 
effectively makes sure that each "spindle" has at least 1 request in the 
queue. Which sucks in practice, because on spinning rust it makes 
queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one 
of the multiple channels.


On spinning drives, it's usually good to keep the iodepth>=4. For 
example this 10k Seagate drive [1] can do ~450 random IOPS with 
iodepth=16, while 10k drive should be able to do ~150 IOPS (with 
iodepth=1). The other SAS drives behave quite similarly.


[1] 
http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review


On SSDs the good values usually start at 16, depending on the model (and 
controller), and size (large SSDs are basically multiple small ones 
glued together, thus have more channels).


This is why the numbers from coupon collector are way too low in many 
cases. (OTOH this is done per backend, so if there are multiple backends 
doing prefetching ...)




We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I


As I explained, spindles have very little to do with it - you need 
multiple I/O requests per device, to get the benefit. Sure, the DBAs 
should know how many spindles they have and should be able to determine 
optimal IO depth. But we actually say this in the docs:


 A good starting point for this setting is the number of separate
 drives comprising a RAID 0 stripe or RAID 1 mirror being used for
 the database. (For RAID 5 the parity drive should not be counted.)
 However, if the database is often busy with multiple queries
 issued in concurrent sessions, lower values may be sufficient to
 keep the disk array busy. A value higher than needed to keep the
 disks busy will only result in extra CPU overhead.

So we recommend number of drives as a good starting value, and then warn 
against increasing the value further.


Moreover, ISTM it's very unclear what value to use even if you know the 
number of devices and optimal iodepth. Setting (devices * iodepth) 
doesn't really make much sense, because that effectively computes


(devices*iodepth) * H(devices*iodepth)

which says "there are (devices*iodepth) devices, make sure there's at 
least one request for each of them", right? I guess we actually want


(devices*iodepth) * H(devices)

Sadly that means we'd have to introduce another GUC, because we need 
track both ndevices and iodepth.


There probably is a value X so that

 X * H(X) ~= (devices*iodepth) * H(devices)

but it's far from clear that's what we need (it surely is not in the docs).



think the above formula works for essentially random I/O but for
more predictable I/O it might be possible to use a different formula.
But if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund

Hi,

On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
> I didn't know that the thread must exists on -hackers to be able to add
> a commitfest entry, so I transfer the thread here.

Please, in the future, also update the title of the thread to something
fitting.

> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
> *estate, int eflags)
>  {
>   BitmapHeapScanState *scanstate;
>   RelationcurrentRelation;
> +#ifdef USE_PREFETCH
> + int new_io_concurrency;
> +#endif
>  
>   /* check for unsupported flags */
>   Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
> @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
> *estate, int eflags)
>*/
>   currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, 
> eflags);
>  
> +#ifdef USE_PREFETCH
> + /* check if the effective_io_concurrency has been overloaded for the
> +  * tablespace storing the relation and compute the 
> target_prefetch_pages,
> +  * or just get the current target_prefetch_pages
> +  */
> + new_io_concurrency = get_tablespace_io_concurrency(
> + currentRelation->rd_rel->reltablespace);
> +
> +
> + scanstate->target_prefetch_pages = target_prefetch_pages;
> +
> + if (new_io_concurrency != effective_io_concurrency)
> + {
> + double prefetch_pages;
> +if (compute_io_concurrency(new_io_concurrency, _pages))
> + scanstate->target_prefetch_pages = rint(prefetch_pages);
> + }
> +#endif

Maybe it's just me - but imo there should be as few USE_PREFETCH
dependant places in the code as possible. It'll just be 0 when not
supported, that's fine? Especially changing the size of externally
visible structs depending on a configure detected ifdef seems wrong to
me.

> +bool
> +compute_io_concurrency(int io_concurrency, double *target_prefetch_pages)
> +{
> + double  new_prefetch_pages = 0.0;
> + int i;
> +
> + /* make sure the io_concurrency value is correct, it may have been 
> forced
> +  * with a pg_tablespace UPDATE
> +  */

Nitpick: Wrong comment style (/* stands on its own line).

> + if (io_concurrency > MAX_IO_CONCURRENCY)
> + io_concurrency = MAX_IO_CONCURRENCY;
> +
> + /*--
> +  * The user-visible GUC parameter is the number of drives (spindles),
> +  * which we need to translate to a number-of-pages-to-prefetch target.
> +  * The target value is stashed in *extra and then assigned to the actual
> +  * variable by assign_effective_io_concurrency.
> +  *
> +  * The expected number of prefetch pages needed to keep N drives busy 
> is:
> +  *
> +  * drives |   I/O requests
> +  * ---+
> +  *  1 |   1
> +  *  2 |   2/1 + 2/2 = 3
> +  *  3 |   3/1 + 3/2 + 3/3 = 5 1/2
> +  *  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
> +  *  n |   n * H(n)

I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?

Perhaps more relevantly: Imo nodeBitmapHeapscan.c is the wrong place for
this. bufmgr.c maybe?

You also didn't touch
/*
 * How many buffers PrefetchBuffer callers should try to stay ahead of their
 * ReadBuffer calls by.  This is maintained by the assign hook for
 * effective_io_concurrency.  Zero means "never prefetch".
 */
int target_prefetch_pages = 0;
which surely doesn't make sense anymore after these changes.

But do we even need that variable now?

> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
> index dc167f9..57008fc 100644
> --- a/src/include/utils/guc.h
> +++ b/src/include/utils/guc.h
> @@ -26,6 +26,9 @@
>  #define MAX_KILOBYTES(INT_MAX / 1024)
>  #endif
>  
> +/* upper limit for effective_io_concurrency */
> +#define MAX_IO_CONCURRENCY 1000
> +
>  /*
>   * Automatic configuration file name for ALTER SYSTEM.
>   * This file will be used to store values of configuration parameters
> @@ -256,6 +259,8 @@ extern inttemp_file_limit;
>  
>  extern int   num_temp_buffers;
>  
> +extern int   effective_io_concurrency;
> +

target_prefetch_pages is declared in bufmgr.h - that seems like a better
place for these.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers