Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-13 Thread Jim Nasby

On 3/13/16 6:30 PM, Peter Geoghegan wrote:

On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes  wrote:

Would the wiki be a good place for such tips?  Not as formal as the
documentation, and more centralized (and editable) than a collection
of blog posts.


That general direction makes sense, but I'm not sure if the Wiki is
something that this will work for. I fear that it could become
something like the TODO list page: a page that contains theoretically
accurate information, but isn't very helpful. The TODO list needs to
be heavily pruned, but that seems like something that will never
happen.

A centralized location for performance tips will probably only work
well if there are still high standards that are actively enforced.
There still needs to be tight editorial control.


I think there's ways to significantly restrict who can edit a page, so 
this could probably still be done via the wiki. IMO we should also be 
encouraging users to test various tips and provide feedback, so maybe a 
wiki page with a big fat request at the top asking users to submit any 
feedback about the page to -performance.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-13 Thread Peter Geoghegan
On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes  wrote:
> Would the wiki be a good place for such tips?  Not as formal as the
> documentation, and more centralized (and editable) than a collection
> of blog posts.

That general direction makes sense, but I'm not sure if the Wiki is
something that this will work for. I fear that it could become
something like the TODO list page: a page that contains theoretically
accurate information, but isn't very helpful. The TODO list needs to
be heavily pruned, but that seems like something that will never
happen.

A centralized location for performance tips will probably only work
well if there are still high standards that are actively enforced.
There still needs to be tight editorial control.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-12 Thread Jeff Janes
On Thu, Mar 10, 2016 at 11:25 PM, Peter Geoghegan  wrote:
> On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO  wrote:
>> I can only concur!
>>
>> The "Performance Tips" chapter (II.14) is more user/query oriented. The
>> "Server Administration" bool (III) does not discuss this much.
>
> That's definitely one area in which the docs are lacking -- I've heard
> several complaints about this myself. I think we've been hesitant to
> do more in part because the docs must always be categorically correct,
> and must not use weasel words. I think it's hard to talk about
> performance while maintaining the general tone of the documentation. I
> don't know what can be done about that.

Would the wiki be a good place for such tips?  Not as formal as the
documentation, and more centralized (and editable) than a collection
of blog posts.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Peter Geoghegan
On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO  wrote:
> I can only concur!
>
> The "Performance Tips" chapter (II.14) is more user/query oriented. The
> "Server Administration" bool (III) does not discuss this much.

That's definitely one area in which the docs are lacking -- I've heard
several complaints about this myself. I think we've been hesitant to
do more in part because the docs must always be categorically correct,
and must not use weasel words. I think it's hard to talk about
performance while maintaining the general tone of the documentation. I
don't know what can be done about that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Fabien COELHO



As you wish. I thought that understanding the underlying performance model
with sequential writes written in chunks is important for the admin, and as
this guc would have an impact on performance it should be hinted about,
including the limits of its effect where large bases will converge to random
io performance. But maybe that is not the right place.


I do agree that that's something interesting to document somewhere. But
I don't think any of the current places in the documentation are a good
fit, and it's a topic much more general than the feature we're debating
here.  I'm not volunteering, but a good discussion of storage and the
interactions with postgres surely would be a significant improvement to
the postgres docs.


I can only concur!

The "Performance Tips" chapter (II.14) is more user/query oriented. The 
"Server Administration" bool (III) does not discuss this much.


There is a wiki about performance tuning, but it is not integrated into 
the documentation. It could be a first documentation source.


Also the README in some development directories are very interesting, 
although they contains too much details about the implementation.


There has been a lot of presentations over the years, and blog posts.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-03-11 00:23:56 +0100, Fabien COELHO wrote:
> As you wish. I thought that understanding the underlying performance model
> with sequential writes written in chunks is important for the admin, and as
> this guc would have an impact on performance it should be hinted about,
> including the limits of its effect where large bases will converge to random
> io performance. But maybe that is not the right place.

I do agree that that's something interesting to document somewhere. But
I don't think any of the current places in the documentation are a good
fit, and it's a topic much more general than the feature we're debating
here.  I'm not volunteering, but a good discussion of storage and the
interactions with postgres surely would be a significant improvement to
the postgres docs.


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Fabien COELHO


[...]


If the default is in pages, maybe you could state it and afterwards
translate it in size.


Hm, I think that's more complicated for users than it's worth.


As you wish. I liked the number of pages you used initially because it 
really gives a hint of how much random IOs are avoided when they are 
contiguous, and I do not have the same just intuition with sizes. Also it 
is related to the io queue length manage by the OS.



The text could say something about sequential writes performance because
pages are sorted.., but that it is lost for large bases and/or short
checkpoints ?


I think that's an implementation detail.


As you wish. I thought that understanding the underlying performance model 
with sequential writes written in chunks is important for the admin, and 
as this guc would have an impact on performance it should be hinted about, 
including the limits of its effect where large bases will converge to 
random io performance. But maybe that is not the right place.


--
Fabien


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Fabien COELHO


Hello Andres,


I'm not sure I've seen these performance... If you have hard evidence,
please feel free to share it.


Man, are you intentionally trying to be hard to work with?


Sorry, I do not understand this remark.

You were refering to some latency measures in your answer, and I was just 
stating that I was interested in seeing these figures which were used to 
justify your choice to keep a shared writeback context.


I did not intend this wish to be an issue, I was expressing an interest.


To quote the email you responded to:

My current plan is to commit this with the current behaviour (as in 
this week[end]), and then do some actual benchmarking on this specific 
part. It's imo a relatively minor detail.


Good.

From the evidence in the thread, I would have given the per tablespace 
context the preference, but this is just a personal opinion and I agree 
that it can work the other way around.


I look forward to see these benchmarks later on, when you have them.

So all is well, and hopefully will be even better later on.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-03-10 23:43:46 +0100, Fabien COELHO wrote:
> 
> >   
> >Whenever more than bgwriter_flush_after bytes have
> >been written by the bgwriter, attempt to force the OS to issue these
> >writes to the underlying storage.  Doing so will limit the amount of
> >dirty data in the kernel's page cache, reducing the likelihood of
> >stalls when an fsync is issued at the end of a checkpoint, or when
> >the OS writes data back  in larger batches in the background.  Often
> >that will result in greatly reduced transaction latency, but there
> >also are some cases, especially with workloads that are bigger than
> >, but smaller than the OS's page
> >cache, where performance might degrade.  This setting may have no
> >effect on some platforms.  0 disables controlled
> >writeback. The default is 256Kb on Linux, 0
> >otherwise. This parameter can only be set in the
> >postgresql.conf file or on the server command line.
> >   
> >
> >(plus adjustments for the other gucs)

> What about the maximum value?

Added.

  
   bgwriter_flush_after (int)
   
bgwriter_flush_after configuration 
parameter
   
   
   

 Whenever more than bgwriter_flush_after bytes have
 been written by the bgwriter, attempt to force the OS to issue these
 writes to the underlying storage.  Doing so will limit the amount of
 dirty data in the kernel's page cache, reducing the likelihood of
 stalls when an fsync is issued at the end of a checkpoint, or when
 the OS writes data back in larger batches in the background.  Often
 that will result in greatly reduced transaction latency, but there
 also are some cases, especially with workloads that are bigger than
 , but smaller than the OS's page
 cache, where performance might degrade.  This setting may have no
 effect on some platforms.  The valid range is between
 0, which disables controlled writeback, and
 2MB.  The default is 256Kb on Linux,
 0 elsewhere.  (Non-default values of
 BLCKSZ change the default and maximum.)
 This parameter can only be set in the postgresql.conf
 file or on the server command line.

   
  
 


> If the default is in pages, maybe you could state it and afterwards
> translate it in size.

Hm, I think that's more complicated for users than it's worth.


> The text could say something about sequential writes performance because
> pages are sorted.., but that it is lost for large bases and/or short
> checkpoints ?

I think that's an implementation detail.


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Fabien COELHO



   
Whenever more than bgwriter_flush_after bytes have
been written by the bgwriter, attempt to force the OS to issue these
writes to the underlying storage.  Doing so will limit the amount of
dirty data in the kernel's page cache, reducing the likelihood of
stalls when an fsync is issued at the end of a checkpoint, or when
the OS writes data back  in larger batches in the background.  Often
that will result in greatly reduced transaction latency, but there
also are some cases, especially with workloads that are bigger than
, but smaller than the OS's page
cache, where performance might degrade.  This setting may have no
effect on some platforms.  0 disables controlled
writeback. The default is 256Kb on Linux, 0
otherwise. This parameter can only be set in the
postgresql.conf file or on the server command line.
   

(plus adjustments for the other gucs)


Some suggestions:

What about the maximum value?

If the default is in pages, maybe you could state it and afterwards 
translate it in size.


"The default is 64 pages on Linux (usually 256Kb)..."

The text could say something about sequential writes performance because 
pages are sorted.., but that it is lost for large bases and/or short 
checkpoints ?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-03-10 23:38:38 +0100, Fabien COELHO wrote:
> I'm not sure I've seen these performance... If you have hard evidence,
> please feel free to share it.

Man, are you intentionally trying to be hard to work with?  To quote the
email you responded to:

> My current plan is to commit this with the current behaviour (as in this
> week[end]), and then do some actual benchmarking on this specific
> part. It's imo a relatively minor detail.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Fabien COELHO


[...]

I had originally kept it with one context per tablespace after 
refactoring this, but found that it gave worse results in rate limited 
loads even over only two tablespaces. That's on SSDs though.


Might just mean that a smaller context size is better on SSD, and it could 
still be better per table space.



The number of pages still in writeback (i.e. for which sync_file_range
has been issued, but which haven't finished running yet) at the end of
the checkpoint matters for the latency hit incurred by the fsync()s from
smgrsync(); at least by my measurement.


I'm not sure I've seen these performance... If you have hard evidence, 
please feel free to share it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-03-10 17:33:33 -0500, Robert Haas wrote:
> On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund  wrote:
> > On 2016-02-21 09:49:53 +0530, Robert Haas wrote:
> >> I think there might be a semantic distinction between these two terms.
> >> Doesn't writeback mean writing pages to disk, and flushing mean making
> >> sure that they are durably on disk?  So for example when the Linux
> >> kernel thinks there is too much dirty data, it initiates writeback,
> >> not a flush; on the other hand, at transaction commit, we initiate a
> >> flush, not writeback.
> >
> > I don't think terminology is sufficiently clear to make such a
> > distinction. Take e.g. our FlushBuffer()...
> 
> Well then we should clarify it!

Trying that as we speak, err, write. How about:

 Whenever more than bgwriter_flush_after bytes have
 been written by the bgwriter, attempt to force the OS to issue these
 writes to the underlying storage.  Doing so will limit the amount of
 dirty data in the kernel's page cache, reducing the likelihood of
 stalls when an fsync is issued at the end of a checkpoint, or when
 the OS writes data back  in larger batches in the background.  Often
 that will result in greatly reduced transaction latency, but there
 also are some cases, especially with workloads that are bigger than
 , but smaller than the OS's page
 cache, where performance might degrade.  This setting may have no
 effect on some platforms.  0 disables controlled
 writeback. The default is 256Kb on Linux, 0
 otherwise. This parameter can only be set in the
 postgresql.conf file or on the server command line.


(plus adjustments for the other gucs)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Robert Haas
On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund  wrote:
> On 2016-02-21 09:49:53 +0530, Robert Haas wrote:
>> I think there might be a semantic distinction between these two terms.
>> Doesn't writeback mean writing pages to disk, and flushing mean making
>> sure that they are durably on disk?  So for example when the Linux
>> kernel thinks there is too much dirty data, it initiates writeback,
>> not a flush; on the other hand, at transaction commit, we initiate a
>> flush, not writeback.
>
> I don't think terminology is sufficiently clear to make such a
> distinction. Take e.g. our FlushBuffer()...

Well then we should clarify it!

:-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-02-21 09:49:53 +0530, Robert Haas wrote:
> I think there might be a semantic distinction between these two terms.
> Doesn't writeback mean writing pages to disk, and flushing mean making
> sure that they are durably on disk?  So for example when the Linux
> kernel thinks there is too much dirty data, it initiates writeback,
> not a flush; on the other hand, at transaction commit, we initiate a
> flush, not writeback.

I don't think terminology is sufficiently clear to make such a
distinction. Take e.g. our FlushBuffer()...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-03-08 09:28:15 +0100, Fabien COELHO wrote:
> 
> >>>Now I cannot see how having one context per table space would have a
> >>>significant negative performance impact.
> >>
> >>The 'dirty data' etc. limits are global, not per block device. By having
> >>several contexts with unflushed dirty data the total amount of dirty
> >>data in the kernel increases.
> >
> >Possibly, but how much?  Do you have experimental data to back up that
> >this is really an issue?
> >
> >We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB
> >of dirty buffers to manage for 16 table spaces, I do not see that as a
> >major issue for the kernel.

We flush in those increments, that doesn't mean there's only that much
dirty data. I regularly see one order of magnitude more being dirty.


I had originally kept it with one context per tablespace after
refactoring this, but found that it gave worse results in rate limited
loads even over only two tablespaces. That's on SSDs though.


> To complete the argument, the 4MB is just a worst case scenario, in reality
> flushing the different context would be randomized over time, so the
> frequency of flushing a context would be exactly the same in both cases
> (shared or per table space context) if the checkpoints are the same size,
> just that with shared table space each flushing potentially targets all
> tablespace with a few pages, while with the other version each flushing
> targets one table space only.

The number of pages still in writeback (i.e. for which sync_file_range
has been issued, but which haven't finished running yet) at the end of
the checkpoint matters for the latency hit incurred by the fsync()s from
smgrsync(); at least by my measurement.


My current plan is to commit this with the current behaviour (as in this
week[end]), and then do some actual benchmarking on this specific
part. It's imo a relatively minor detail.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-08 Thread Fabien COELHO



Now I cannot see how having one context per table space would have a
significant negative performance impact.


The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases.


Possibly, but how much?  Do you have experimental data to back up that this 
is really an issue?


We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB of 
dirty buffers to manage for 16 table spaces, I do not see that as a major 
issue for the kernel.


More thoughts about your theoretical argument:

To complete the argument, the 4MB is just a worst case scenario, in 
reality flushing the different context would be randomized over time, so 
the frequency of flushing a context would be exactly the same in both 
cases (shared or per table space context) if the checkpoints are the same 
size, just that with shared table space each flushing potentially targets 
all tablespace with a few pages, while with the other version each 
flushing targets one table space only.


So my handwaving analysis is that the flow of dirty buffers is the same 
with both approaches, but for the shared version buffers are more equaly 
distributed on table spaces, hence reducing sequential write 
effectiveness, and for the other the dirty buffers are grouped more 
clearly per table space, so it should get better sequential write 
performance.



--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-07 Thread Fabien COELHO


Hello Andres,


Now I cannot see how having one context per table space would have a
significant negative performance impact.


The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases.


Possibly, but how much?  Do you have experimental data to back up that 
this is really an issue?


We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB 
of dirty buffers to manage for 16 table spaces, I do not see that as a 
major issue for the kernel.


Thus you're more likely to see stalls by the kernel moving pages into 
writeback.


I do not see the above data having a 30% negative impact on tps, given the 
quite small amount of data under discussion, and switching to random IOs 
cost so much that it must really be avoided.


Without further experimental data, I still think that the one context per 
table space is the reasonnable choice.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-07 Thread Andres Freund
On 2016-03-07 21:10:19 +0100, Fabien COELHO wrote:
> Now I cannot see how having one context per table space would have a
> significant negative performance impact.

The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases. Thus you're more likely to see stalls by
the kernel moving pages into writeback.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-07 Thread Fabien COELHO


Hello Andres,


(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
   per second avg, stddev [ min q1 median d3 max ] <=300tps
   679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
   per second avg, stddev [ min q1 median d3 max ] <=300tps
   956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%


Well, that's not a particularly meaningful workload. You increased the 
number of flushed to the same number of disks considerably.


It is just a simple workload designed to emphasize the effect of having 
one context shared for all table space instead of on per tablespace, 
without rewriting the patch and without a large host with multiple disks.


For a meaningful comparison you'd have to compare using one writeback 
context for N tablespaces on N separate disks/raids, and using N 
writeback contexts for the same.


Sure, it would be better to do that, but that would require (1) rewriting 
the patch, which is a small work, and also (2) having access to a machine 
with a number of disks/raids, that I do NOT have available.



What happens in the 16 tb workload is that much smaller flushes are 
performed on the 16 files writen in parallel, so the tps performance is 
significantly degraded, despite the writes being sorted in each file. On 
one tb, all buffers flushed are in the same file, so flushes are much more 
effective.


When the context is shared and checkpointer buffer writes are balanced 
against table spaces, then when the limit is reached the flushing gets few 
buffers per tablespace, so this limits sequential writes to few buffers, 
hence the performance degradation.


So I can explain the performance degradation *because* the flush context 
is shared between the table spaces, which is a logical argument backed 
with experimental data, so it is better than handwaving. Given the 
available hardware, this is the best proof I can have that context should 
be per table space.


Now I cannot see how having one context per table space would have a 
significant negative performance impact.


So the logical conclusion for me is that without further experimental data 
it is better to have one context per table space.


If you have a hardware with plenty disks available for testing, that would 
provide better data, obviously.


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-07 Thread Andres Freund
On 2016-02-22 20:44:35 +0100, Fabien COELHO wrote:
> 
> >>Random updates on 16 tables which total to 1.1GB of data, so this is in
> >>buffer, no significant "read" traffic.
> >>
> >>(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
> >>per second avg, stddev [ min q1 median d3 max ] <=300tps
> >>679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%
> >>
> >>(2) with 1 tablespace on 1 disk : 956.0 tps
> >>per second avg, stddev [ min q1 median d3 max ] <=300tps
> >>956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%
> >
> >Interesting. That doesn't reflect my own tests, even on rotating media,
> >at all. I wonder if it's related to:
> >https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5
> >
> >If you use your 12.04 kernel, that'd not be fixed. Which might be a
> >reason to do it as you suggest.
> >
> >Could you share the exact details of that workload?
> 
> See attached scripts (sh to create the 16 tables in the default or 16 table
> spaces, small sql bench script, stat computation script).
> 
> The per-second stats were computed with:
> 
>   grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 
> --limit=300
> 
> Host is 8 cpu 16 GB, 2 HDD in RAID 1.

Well, that's not a particularly meaningful workload. You increased the
number of flushed to the same number of disks considerably. For a
meaningful comparison you'd have to compare using one writeback context
for N tablespaces on N separate disks/raids, and using N writeback
contexts for the same.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Fabien COELHO



Random updates on 16 tables which total to 1.1GB of data, so this is in
buffer, no significant "read" traffic.

(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%


Interesting. That doesn't reflect my own tests, even on rotating media,
at all. I wonder if it's related to:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

If you use your 12.04 kernel, that'd not be fixed. Which might be a
reason to do it as you suggest.

Could you share the exact details of that workload?


See attached scripts (sh to create the 16 tables in the default or 16 
table spaces, small sql bench script, stat computation script).


The per-second stats were computed with:

  grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 --limit=300

Host is 8 cpu 16 GB, 2 HDD in RAID 1.

--
Fabien.

ts_create.sh
Description: Bourne shell script


ts_test.sql
Description: application/sql
#! /usr/bin/env python
# -*- coding: utf-8 -*-
#
# $Id: avg.py 1242 2016-02-06 14:44:02Z coelho $
#

import argparse
ap = argparse.ArgumentParser(description='show stats about data: count average stddev [min q1 median q3 max]...')
ap.add_argument('--median', default=True, action='store_true',
help='compute median and quartile values')
ap.add_argument('--no-median', dest='median', default=True,
action='store_false',
help='do not compute median and quartile values')
ap.add_argument('--more', default=False, action='store_true',
help='show some more stats')
ap.add_argument('--limit', type=float, default=None,
help='set limit for counting below limit values')
ap.add_argument('--length', type=int, default=None,
help='set expected length, assume 0 if beyond')
ap.add_argument('--precision', type=int, default=1,
help='floating point precision')
ap.add_argument('file', nargs='*', help='list of files to process')
opt = ap.parse_args()

# option consistency
if opt.limit != None:
	opt.more = True
if opt.more:
	opt.median = True

# reset arguments for fileinput
import sys
sys.argv[1:] = opt.file

import fileinput

n, skipped, vals = 0, 0, []
k, vmin, vmax = None, None, None
sum1, sum2 = 0.0, 0.0

for line in fileinput.input():
	try:
		v = float(line)
		if opt.median: # keep track only if needed
			vals.append(v)
		if k is None: # first time
			k, vmin, vmax = v, v, v
		else: # next time
			vmin = min(vmin, v)
			vmax = max(vmax, v)
		n += 1
		vmk = v - k
		sum1 += vmk
		sum2 += vmk * vmk
	except ValueError: # float conversion failed
		skipped += 1

if n == 0:
	# avoid ops on None below
	k, vmin, vmax = 0.0, 0.0, 0.0

if opt.length:
	assert "some data seen", n > 0
	missing = int(opt.length) - len(vals)
	assert "positive number of missing data", missing >= 0
	if missing > 0:
		print("warning: %d missing data, expanding with zeros" % missing)
		if opt.median:
			vals += [ 0.0 ] * missing
		vmin = min(vmin, 0.0)
		sum1 += - k * missing
		sum2 += k * k * missing
		n += missing
		assert len(vals) == int(opt.length)

if opt.median:
	assert "consistent length", len(vals) == n

# five numbers...
# numpy.percentile requires numpy at least 1.9 to use 'midpoint'
# statistics.median requires python 3.4 (?)
def median(vals, start, length):
	if len(vals) == 1:
		start, length = 0, 1
	m, odd = divmod(length, 2)
	#return 0.5 * (vals[start + m + odd - 1] + vals[start + m])
	return  vals[start + m] if odd else \
		0.5 * (vals[start + m-1] + vals[start + m])

# return ratio of below limit (limit included) values
def below(vals, limit):
	# hmmm... short but generates a list
	#return float(len([ v for v in vals if v <= limit ])) / len(vals)
	below_limit = 0
	for v in vals:
		if v <= limit:
			below_limit += 1
	return float(below_limit) / len(vals)

# float prettyprint with precision
def f(v):
	return ('%.' + str(opt.precision) + 'f') % v

# output
if skipped:
	print("warning: %d lines skipped" % skipped)

if n > 0:
	# show result (hmmm, precision is truncated...)
	from math import sqrt
	avg, stddev = k + sum1 / n, sqrt((sum2 - (sum1 * sum1) / n) / n)
	if opt.median:
		vals.sort()
		med = median(vals, 0, len(vals))
		# not sure about odd/even issues here... q3 needs fixing if len is 1
		q1 = median(vals, 0, len(vals) // 2)
		q3 = median(vals, (len(vals)+1) // 2, len(vals) // 2)
		# build summary message
		msg = "avg over %d: %s ± %s [%s, %s, %s, %s, %s]" % \
			  (n, f(avg), f(stddev), f(vmin), f(q1), f(med), f(q3), f(vmax))
		if opt.more:
			limit = opt.limit if opt.limit != None else 0.1 * med
			# msg += " <=%s:" % f(limit)
			msg += " %s%%" % f(100.0 * below(vals, limit))
	else:
		msg = "avg over %d: %s ± %s [%s, %s]" % \
			  (n, f(avg), f(stddev), f(vmin), f(vmax))
else:
	msg = "no data 

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Andres Freund
On 2016-02-22 11:05:20 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Interesting. That doesn't reflect my own tests, even on rotating media,
> > at all. I wonder if it's related to:
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5
> 
> > If you use your 12.04 kernel, that'd not be fixed. Which might be a
> > reason to do it as you suggest.
> 
> Hmm ... that kernel commit is less than 4 months old.  Would it be
> reflected in *any* production kernels yet?

Probably not - so far I though it mainly has some performance benefits
on relatively extreme workloads; where without the patch, flushing still
is better performancewise than not flushing. But in the scenario Fabien
has brought up it seems quite possible that sync_file_range emitting
"storage cache flush" instructions, could explain the rather large
performance difference between his and my experiments.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Tom Lane
Andres Freund  writes:
> Interesting. That doesn't reflect my own tests, even on rotating media,
> at all. I wonder if it's related to:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

> If you use your 12.04 kernel, that'd not be fixed. Which might be a
> reason to do it as you suggest.

Hmm ... that kernel commit is less than 4 months old.  Would it be
reflected in *any* production kernels yet?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Andres Freund
On 2016-02-22 14:11:05 +0100, Fabien COELHO wrote:
> 
> >I did a quick & small test with random updates on 16 tables with
> >checkpoint_flush_after=16 checkpoint_timeout=30
> 
> Another run with more "normal" settings and over 1000 seconds, so less
> "quick & small" that the previous one.
> 
>  checkpoint_flush_after = 16
>  checkpoint_timeout = 5min # default
>  shared_buffers = 2GB # 1/8 of available memory
> 
> Random updates on 16 tables which total to 1.1GB of data, so this is in
> buffer, no significant "read" traffic.
> 
> (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
> per second avg, stddev [ min q1 median d3 max ] <=300tps
> 679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%
> 
> (2) with 1 tablespace on 1 disk : 956.0 tps
> per second avg, stddev [ min q1 median d3 max ] <=300tps
> 956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%

Interesting. That doesn't reflect my own tests, even on rotating media,
at all. I wonder if it's related to:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

If you use your 12.04 kernel, that'd not be fixed. Which might be a
reason to do it as you suggest.

Could you share the exact details of that workload?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Fabien COELHO


I did a quick & small test with random updates on 16 tables with 
checkpoint_flush_after=16 checkpoint_timeout=30


Another run with more "normal" settings and over 1000 seconds, so less 
"quick & small" that the previous one.


 checkpoint_flush_after = 16
 checkpoint_timeout = 5min # default
 shared_buffers = 2GB # 1/8 of available memory

Random updates on 16 tables which total to 1.1GB of data, so this is in 
buffer, no significant "read" traffic.


(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Fabien COELHO


Hallo Andres,


AFAICR I used a "flush context" for each table space in some version
I submitted, because I do think that this whole writeback logic really
does make sense *per table space*, which suggest that there should be as
many write backs contexts as table spaces, otherwise the positive effect
may going to be totally lost of tables spaces are used. Any thoughts?


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I did a quick & small test with random updates on 16 tables with 
checkpoint_flush_after=16 checkpoint_timeout=30


(1) with 16 tablespaces (1 per table, but same disk) :
tps = 1100, 27% time under 100 tps

(2) with 1 tablespace :
tps = 1200,  3% time under 100 tps

This result is logical: with one writeback context shared between 
tablespaces the sync_file_range is issued on a few buffers per file at a 
time on the 16 files, no coalescing occurs there, so this result in random 
IOs, while with one table space all writes are aggregated per file.


ISTM that this quick test shows that a writeback context are relevant per 
tablespace, as I expected.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO



ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.


They don't cause much space usage, and we access the values frequently. 
So why not store them?


The same question would work the other way around: these values are one 
division away, why not compute them when needed? No big deal.


[...] Given realistic amounts of memory the max potential "skew" seems 
fairly small with float8. If we ever flush one buffer "too much" for a 
tablespace it's pretty much harmless.


I do agree. I'm suggesting that a comment should be added to justify why 
float8 accuracy is okay.



I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?


Hm. I think we really should use a memory context for all of this - we
could after all error out somewhere in the middle...


I'm not sure that a memory context is justified here, there are only two 
mallocs and the checkpointer works for very long times. I think that it is 
simpler to just get the malloc/free right.


[...] I'm not arguing for ripping it out, what I mean is that we don't 
set a nondefault value for the GUCs on platforms with just 
posix_fadivise available...


Ok with that.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO



[...] I do think that this whole writeback logic really does make
sense *per table space*,


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per device
as well (otherwise what the point?), so you should want to coalesce and
"writeback" pages per device as wel. Calling sync_file_range on distinct
devices should probably be issued more or less randomly, and should not
interfere one with the other.


The kernel's dirty buffer accounting is global, not per block device.


Sure, but this is not my point. My point is that "sync_file_range" moves 
buffers to the device io queues, which are per device. If there is one 
queue in pg and many queues on many devices, the whole point of coalescing 
to get sequential writes is somehow lost.



It's also actually rather common to have multiple tablespaces on a
single block device. Especially if SANs and such are involved; where you
don't even know which partitions are on which disks.


Ok, some people would not benefit if the use many tablespaces on one 
device, too bad but that does not look like a useful very setting anyway, 
and I do not think it would harm much in this case.



If you use just one context, the more table spaces the less performance
gains, because there is less and less aggregation thus sequential writes per
device.

So for me there should really be one context per tablespace. That would
suggest a hashtable or some other structure to keep and retrieve them, which
would not be that bad, and I think that it is what is needed.


That'd be much easier to do by just keeping the context in the
per-tablespace struct. But anyway, I'm really doubtful about going for
that; I had it that way earlier, and observing IO showed it not being
beneficial.


ISTM that you would need a significant number of tablespaces to see the 
benefit. If you do not do that, the more table spaces the more random the 
IOs, which is disappointing. Also, "the cost is marginal", so I do not see 
any good argument not to do it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Andres Freund
Hi,

On 2016-02-21 10:52:45 +0100, Fabien COELHO wrote:
> * CpktSortItem:
> 
> I think that allocating 20 bytes per buffer in shared memory is a little on
> the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes
> to hold 4 values, could be one byte or even 2 bits somewhere. Also, there
> are very few tablespaces, they could be given a small number and this number
> could be used instead of the Oid, so the space requirement could be reduced
> to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping
> 4 bytes alignement and also getting 8 byte alignement... If this is too
> much, I have shown that it can work with only 4 bytes per buffer, as the
> sorting is really just a performance optimisation and is not broken if some
> stuff changes between sorting & writeback, but you did not like the idea. If
> the amount of shared memory required is a significant concern, it could be
> resurrected, though.

This is less than 0.2 % of memory related to shared buffers. We have the
same amount of memory allocated in CheckpointerShmemSize(), and nobody
has complained so far.  And sorry, going back to the previous approach
isn't going to fly, and I've no desire to discuss that *again*.


> ISTM that "progress" and "progress_slice" only depend on num_scanned and
> per-tablespace num_to_scan and total num_to_scan, so they are somehow
> redundant and the progress could be recomputed from the initial figures
> when needed.

They don't cause much space usage, and we access the values
frequently. So why not store them?


> If these fields are kept, I think that a comment should justify why float8
> precision is okay for the purpose. I think it is quite certainly fine in the
> worst case with 32 bits buffer_ids, but it would not be if this size is
> changed someday.

That seems pretty much unrelated to having the fields - the question of
accuracy plays a role regardless, no? Given realistic amounts of memory
the max potential "skew" seems fairly small with float8. If we ever
flush one buffer "too much" for a tablespace it's pretty much harmless.

> ISTM that nearly all of the collected data on the second sweep could be
> collected on the first sweep, so that this second sweep could be avoided
> altogether. The only missing data is the index of the first buffer in the
> array, which can be computed by considering tablespaces only, sweeping over
> buffers is not needed. That would suggest creating the heap or using a hash
> in the initial buffer sweep to keep this information. This would also
> provide a point where to number tablespaces for compressing the CkptSortItem
> struct.

Doesn't seem worth the complexity to me.


> I'm wondering about calling CheckpointWriteDelay on each round, maybe
> a minimum amount of write would make sense.

Why? There's not really much benefit of doing more work than needed. I
think we should sleep far shorter in many cases, but that's indeed a
separate issue.

> I see a binary_heap_allocate but no corresponding deallocation, this
> looks like a memory leak... or is there some magic involved?

Hm. I think we really should use a memory context for all of this - we
could after all error out somewhere in the middle...


> >I think this patch primarily needs:
> >* Benchmarking on FreeBSD/OSX to see whether we should enable the
> > mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
> > inclined to leave it off till then.
> 
> I do not have that. As "msync" seems available on Linux, it is possible to
> force using it with a "ifdef 0" to skip sync_file_range and check whether it
> does some good there.

Unfortunately it doesn't work well on linux:
 * On many OSs msync() on a mmap'ed file triggers writeback. On 
linux
 * it only does so when MS_SYNC is specified, but then it does 
the
 * writeback synchronously. Luckily all common linux systems 
have
 * sync_file_range().  This is preferrable over FADV_DONTNEED 
because
 * it doesn't flush out clean data.

I've verified beforehand, with a simple demo program, that
msync(MS_ASYNC) does something reasonable of freebsd...


> Idem for the "posix_fadvise" stuff. I can try to do
> that, but it takes time to do so, if someone can test on other OS it would
> be much better. I think that if it works it should be kept in, so it is just
> a matter of testing it.

I'm not arguing for ripping it out, what I mean is that we don't set a
nondefault value for the GUCs on platforms with just posix_fadivise
available...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Andres Freund
On 2016-02-21 08:26:28 +0100, Fabien COELHO wrote:
> >>In the discussion in the wal section, I'm not sure about the effect of
> >>setting writebacks on SSD, [...]
> >
> >Yea, that paragraph needs some editing. I think we should basically
> >remove that last sentence.
> 
> Ok, fine with me. Does that mean that flushing as a significant positive
> impact on SSD in your tests?

Yes. The reason we need flushing is that the kernel amasses dirty pages,
and then flushes them at once. That hurts for both SSDs and rotational
media. Sorting is the the bigger question, but I've seen it have clearly
beneficial performance impacts. I guess if you look at devices with a
internal block size bigger than 8k, you'd even see larger differences.

> >>Maybe the merging strategy could be more aggressive than just strict
> >>neighbors?
> >
> >I don't think so. If you flush more than neighbouring writes you'll
> >often end up flushing buffers dirtied by another backend, causing
> >additional stalls.
> 
> Ok. Maybe the neightbor definition could be relaxed just a little bit so
> that small holes are overtake, but not large holes? If there is only a few
> pages in between, even if written by another process, then writing them
> together should be better? Well, this can wait for a clear case, because
> hopefully the OS will recoalesce them behind anyway.

I'm against doing so without clear measurements of a benefit.

> >Also because the infrastructure is used for more than checkpoint
> >writes. There's absolutely no ordering guarantees there.
> 
> Yep, but not much benefit to expect from a few dozens random pages either.

Actually, there's kinda frequently a benefit observable. Even if few
requests can be merged, doing IO requests in an order more likely doable
within a few rotations is beneficial. Also, the cost is marginal, so why
worry?

> >>[...] I do think that this whole writeback logic really does make
> >>sense *per table space*,
> >
> >Leads to less regular IO, because if your tablespaces are evenly sized
> >(somewhat common) you'll sometimes end up issuing sync_file_range's
> >shortly after each other.  For latency outside checkpoints it's
> >important to control the total amount of dirty buffers, and that's
> >obviously independent of tablespaces.
> 
> I do not understand/buy this argument.
> 
> The underlying IO queue is per device, and table spaces should be per device
> as well (otherwise what the point?), so you should want to coalesce and
> "writeback" pages per device as wel. Calling sync_file_range on distinct
> devices should probably be issued more or less randomly, and should not
> interfere one with the other.

The kernel's dirty buffer accounting is global, not per block device.
It's also actually rather common to have multiple tablespaces on a
single block device. Especially if SANs and such are involved; where you
don't even know which partitions are on which disks.


> If you use just one context, the more table spaces the less performance
> gains, because there is less and less aggregation thus sequential writes per
> device.
> 
> So for me there should really be one context per tablespace. That would
> suggest a hashtable or some other structure to keep and retrieve them, which
> would not be that bad, and I think that it is what is needed.

That'd be much easier to do by just keeping the context in the
per-tablespace struct. But anyway, I'm really doubtful about going for
that; I had it that way earlier, and observing IO showed it not being
beneficial.


> >>For the checkpointer, a key aspect is that the scheduling process goes
> >>to sleep from time to time, and this sleep time looked like a great
> >>opportunity to do this kind of flushing. You choose not to take advantage
> >>of the behavior, why?
> >
> >Several reasons: Most importantly there's absolutely no guarantee that
> >you'll ever end up sleeping, it's quite common to happen only seldomly.
> 
> Well, that would be under a situation when pg is completely unresponsive.
> More so, this behavior *makes* pg unresponsive.

No. The checkpointer being bottlenecked on actual IO performance doesn't
impact production that badly. It'll just sometimes block in
sync_file_range(), but the IO queues will have enough space to
frequently give way to other backends, particularly to synchronous reads
(most pg reads) and synchronous writes (fdatasync()).  So a single
checkpoint will take a bit longer, but otherwise the system will mostly
keep up the work in a regular manner.  Without the sync_file_range()
calls the kernel will amass dirty buffers until global dirty limits are
reached, which then will bring the whole system to a standstill.

It's pretty common that checkpoint_timeout is too short to be able to
write all shared_buffers out, in that case it's much better to slow down
the whole checkpoint, instead of being incredibly slow at the end.

> >I also don't really believe it helps that much, although that's a complex
> >argument to make.
> 
> Yep. My 

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO


Hallo Andres,

Here is a review for the second patch.


For 0002 I've recently changed:
* Removed the sort timing information, we've proven sufficiently that
 it doesn't take a lot of time.


I put it there initialy to demonstrate that there was no cache performance 
issue when sorting on just buffer indexes. As it is always small, I agree 
that it is not needed. Well, it could be still be in seconds on a very 
large shared buffers setting with a very large checkpoint, but then the 
checkpoint would be tremendously huge...



* Minor comment polishing.


Patch applies and checks on Linux.

* CpktSortItem:

I think that allocating 20 bytes per buffer in shared memory is a little 
on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 
bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also, 
there are very few tablespaces, they could be given a small number and 
this number could be used instead of the Oid, so the space requirement 
could be reduced to say 16 bytes per buffer by combining space & fork in 2 
shorts and keeping 4 bytes alignement and also getting 8 byte 
alignement... If this is too much, I have shown that it can work with only 
4 bytes per buffer, as the sorting is really just a performance 
optimisation and is not broken if some stuff changes between sorting & 
writeback, but you did not like the idea. If the amount of shared memory 
required is a significant concern, it could be resurrected, though.


* CkptTsStatus:

As I suggested in the other mail, I think that this structure should also keep
a per tablespace WritebackContext so that coalescing is done per tablespace.

ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.

If these fields are kept, I think that a comment should justify why float8 
precision is okay for the purpose. I think it is quite certainly fine in 
the worst case with 32 bits buffer_ids, but it would not be if this size 
is changed someday.


* BufferSync

After a first sweep to collect buffers to write, they are sorted, and then 
there those buffers are swept again to compute some per tablespace data 
and organise a heap.


ISTM that nearly all of the collected data on the second sweep could be 
collected on the first sweep, so that this second sweep could be avoided 
altogether. The only missing data is the index of the first buffer in the 
array, which can be computed by considering tablespaces only, sweeping 
over buffers is not needed. That would suggest creating the heap or using 
a hash in the initial buffer sweep to keep this information. This would 
also provide a point where to number tablespaces for compressing the 
CkptSortItem struct.


I'm wondering about calling CheckpointWriteDelay on each round, maybe
a minimum amount of write would make sense. This remark is independent of 
this patch. Probably it works fine because after a sleep the checkpointer 
is behind enough so that it will write a bunch of buffers before sleeping

again.

I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?

There are some debug stuff to remove in #ifdefs.

I think that the buffer/README should be updated with explanations about
sorting in the checkpointer.


I think this patch primarily needs:
* Benchmarking on FreeBSD/OSX to see whether we should enable the
 mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
 inclined to leave it off till then.


I do not have that. As "msync" seems available on Linux, it is possible to 
force using it with a "ifdef 0" to skip sync_file_range and check whether 
it does some good there. Idem for the "posix_fadvise" stuff. I can try to 
do that, but it takes time to do so, if someone can test on other OS it 
would be much better. I think that if it works it should be kept in, so it 
is just a matter of testing it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO


Hallo Andres,

[...] I do think that this whole writeback logic really does make sense 
*per table space*,


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per device 
as well (otherwise what the point?), so you should want to coalesce and 
"writeback" pages per device as wel. Calling sync_file_range on distinct 
devices should probably be issued more or less randomly, and should not 
interfere one with the other.


If you use just one context, the more table spaces the less performance 
gains, because there is less and less aggregation thus sequential writes per 
device.


So for me there should really be one context per tablespace. That would 
suggest a hashtable or some other structure to keep and retrieve them, which 
would not be that bad, and I think that it is what is needed.


Note: I think that an easy way to do that in the "checkpoint sort" patch 
is simply to keep a WritebackContext in CkptTsStatus structure which is 
per table space in the checkpointer.


For bgwriter & backends it can wait, there is few "writeback" coalescing 
because IO should be pretty random, so it does not matter much.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-20 Thread Fabien COELHO


Hallo Andres,


In some previous version I think a warning was shown if the feature was
requested but not available.


I think we should either silently ignore it, or error out. Warnings
somewhere in the background aren't particularly meaningful.


I like "ignoring with a warning" in the log file, because when things do 
not behave as expected that is where I'll be looking. I do not think that 
it should error out.


The sgml documentation about "*_flush_after" configuration parameter 
talks about bytes, but the actual unit should be buffers.


The unit actually is buffers, but you can configure it using
bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
...). Refering to bytes is easier because you don't have to explain that
it depends on compilation settings how many data it actually is and
such.


So I understand that it works with kb as well. Now I do not think that it 
would need a lot if explanations if you say that it is a number of pages, 
and I think that a number of pages is significant because it is a number 
of IO requests to be coalesced, eventually.



In the discussion in the wal section, I'm not sure about the effect of
setting writebacks on SSD, [...]


Yea, that paragraph needs some editing. I think we should basically
remove that last sentence.


Ok, fine with me. Does that mean that flushing as a significant positive 
impact on SSD in your tests?


However it does not address the point that bgwriter and backends 
basically issue random writes, [...]


The benefit is primarily that you don't collect large amounts of dirty
buffers in the kernel page cache. In most cases the kernel will not be
able to coalesce these writes either...  I've measured *massive*
performance latency differences for workloads that are bigger than
shared buffers - because suddenly bgwriter / backends do the majority of
the writes. Flushing in the checkpoint quite possibly makes nearly no
difference in such cases.


So I understand that there is a positive impact under some load. Good!


Maybe the merging strategy could be more aggressive than just strict
neighbors?


I don't think so. If you flush more than neighbouring writes you'll
often end up flushing buffers dirtied by another backend, causing
additional stalls.


Ok. Maybe the neightbor definition could be relaxed just a little bit so 
that small holes are overtake, but not large holes? If there is only a few 
pages in between, even if written by another process, then writing them 
together should be better? Well, this can wait for a clear case, because 
hopefully the OS will recoalesce them behind anyway.



struct WritebackContext: keeping a pointer to guc variables is a kind of
trick, I think it deserves a comment.


It has, it's just in WritebackContextInit(). Can duplicateit.


I missed it, I expected something in the struct definition. Do not 
duplicate, but cross reference it?



IssuePendingWritebacks: I understand that qsort is needed "again"
because when balancing writes over tablespaces they may be intermixed.


Also because the infrastructure is used for more than checkpoint
writes. There's absolutely no ordering guarantees there.


Yep, but not much benefit to expect from a few dozens random pages either.

[...] I do think that this whole writeback logic really does make sense 
*per table space*,


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per 
device as well (otherwise what the point?), so you should want to coalesce 
and "writeback" pages per device as wel. Calling sync_file_range on 
distinct devices should probably be issued more or less randomly, and 
should not interfere one with the other.


If you use just one context, the more table spaces the less performance 
gains, because there is less and less aggregation thus sequential writes 
per device.


So for me there should really be one context per tablespace. That would 
suggest a hashtable or some other structure to keep and retrieve them, 
which would not be that bad, and I think that it is what is needed.



For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to time, and this sleep time looked like a great
opportunity to do this kind of flushing. You choose not to take advantage
of the behavior, why?


Several reasons: Most importantly there's absolutely no guarantee that 
you'll ever end up sleeping, it's quite common to happen only seldomly.


Well, that would be under a situation when pg is completely unresponsive. 
More so, this behavior *makes* pg unresponsive.



If you're bottlenecked on IO, you can end up being behind all the time.


Hopefully 

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-20 Thread Robert Haas
On Sun, Feb 21, 2016 at 3:37 AM, Andres Freund  wrote:
>> The documentation seems to use "flush" but the code talks about "writeback"
>> or "flush", depending. I think one vocabulary, whichever it is, should be
>> chosen and everything should stick to it, otherwise everything look kind of
>> fuzzy and raises doubt for the reader (is it the same thing? is it something
>> else?). I initially used "flush", but it seems a bad idea because it has
>> nothing to do with the flush function, so I'm fine with writeback or anything
>> else, I just think that *one* word should be chosen and used everywhere.
>
> Hm.

I think there might be a semantic distinction between these two terms.
Doesn't writeback mean writing pages to disk, and flushing mean making
sure that they are durably on disk?  So for example when the Linux
kernel thinks there is too much dirty data, it initiates writeback,
not a flush; on the other hand, at transaction commit, we initiate a
flush, not writeback.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-20 Thread Andres Freund
Hi,

On 2016-02-20 20:56:31 +0100, Fabien COELHO wrote:
> >* Currently *_flush_after can be set to a nonzero value, even if there's
> > no support for flushing on that platform. Imo that's ok, but perhaps
> > other people's opinion differ.
> 
> In some previous version I think a warning was shown of the feature was
> requested but not available.

I think we should either silently ignore it, or error out. Warnings
somewhere in the background aren't particularly meaningful.

> Here are some quick comments on the patch:
> 
> Patch applies cleanly on head. Compiled and checked on Linux. Compilation
> issues on other systems, see below.

For those I've already pushed a small fixup commit to git... Stupid
mistake.


> The documentation seems to use "flush" but the code talks about "writeback"
> or "flush", depending. I think one vocabulary, whichever it is, should be
> chosen and everything should stick to it, otherwise everything look kind of
> fuzzy and raises doubt for the reader (is it the same thing? is it something
> else?). I initially used "flush", but it seems a bad idea because it has
> nothing to do with the flush function, so I'm fine with writeback or anything
> else, I just think that *one* word should be chosen and used everywhere.

Hm.


> The sgml documentation about "*_flush_after" configuration parameter talks
> about bytes, but the actual unit should be buffers.

The unit actually is buffers, but you can configure it using
bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
...). Refering to bytes is easier because you don't have to explain that
it depends on compilation settings how many data it actually is and
such.

> Also, the maximum value (128 ?) should appear in the text. \

Right.


> In the discussion in the wal section, I'm not sure about the effect of
> setting writebacks on SSD, but I think that you have made some tests
> so maybe you have an answer and the corresponding section could be
> written with some more definitive text than "probably brings no
> benefit on SSD".

Yea, that paragraph needs some editing. I think we should basically
remove that last sentence.


> A good point of the whole approach is that it is available to all kind
> of pg processes.

Exactly.


> However it does not address the point that bgwriter and
> backends basically issue random writes, so I would not expect much positive
> effect before these writes are somehow sorted, which means doing some
> compromise in the LRU/LFU logic...

The benefit is primarily that you don't collect large amounts of dirty
buffers in the kernel page cache. In most cases the kernel will not be
able to coalesce these writes either...  I've measured *massive*
performance latency differences for workloads that are bigger than
shared buffers - because suddenly bgwriter / backends do the majority of
the writes. Flushing in the checkpoint quite possibly makes nearly no
difference in such cases.


> well, all this is best kept for later, and I'm fine to have the logic
> flushing logic there. I'm wondering why you choose 16 & 64 as default
> for backends & bgwriter, though.

I chose a small value for backends because there often are a large
number of backends, and thus the amount of dirty data of each adds up. I
used a larger value for bgwriter because I saw that ending up using
bigger IOs.


> IssuePendingWritebacks: you merge only strictly neightboring writes.
> Maybe the merging strategy could be more aggressive than just strict
> neighbors?

I don't think so. If you flush more than neighbouring writes you'll
often end up flushing buffers dirtied by another backend, causing
additional stalls. And if the writes aren't actually neighbouring
there's not much gained from issuing them in one sync_file_range call.


> mdwriteback: all variables could be declared within the while, I do not
> understand why some are in and some are out.

Right.


> ISTM that putting writeback management at the relation level does not
> help a lot, because you have to translate again from relation to
> files.

Sure, but what's the problem with that? That's how normal read/write IO
works as well?


> struct WritebackContext: keeping a pointer to guc variables is a kind of
> trick, I think it deserves a comment.

It has, it's just in WritebackContextInit(). Can duplicateit.


> ScheduleBufferTagForWriteback: the "pending" variable is not very
> useful.

Shortens line length a good bit, at no cost.



> IssuePendingWritebacks: I understand that qsort is needed "again"
> because when balancing writes over tablespaces they may be intermixed.

Also because the infrastructure is used for more than checkpoint
writes. There's absolutely no ordering guarantees there.


> AFAICR I used a "flush context" for each table space in some version
> I submitted, because I do think that this whole writeback logic really
> does make sense *per table space*, which suggest that there should be as
> many write backs contexts as table spaces, otherwise the 

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-20 Thread Fabien COELHO


Hello Andres,


For 0001 I've recently changed:
* Don't schedule writeback after smgrextend() - that defeats linux
 delayed allocation mechanism, increasing fragmentation noticeably.
* Add docs for the new GUC variables
* comment polishing
* BackendWritebackContext now isn't dynamically allocated anymore


I think this patch primarily needs:
* review of the docs, not sure if they're easy enough to
 understand. Some language polishing might also be needed.


Yep, see below.


* review of the writeback API, combined with the smgr/md.c changes.


See various comments below.


* Currently *_flush_after can be set to a nonzero value, even if there's
 no support for flushing on that platform. Imo that's ok, but perhaps
 other people's opinion differ.


In some previous version I think a warning was shown of the feature was
requested but not available.


Here are some quick comments on the patch:

Patch applies cleanly on head. Compiled and checked on Linux. Compilation 
issues on other systems, see below.


When pages are written by a process (checkpointer, bgwriter, backend worker),
the list of recently written pages is kept and every so often an advisory
fsync (sync_file_range, other options for other systems) is issued so that
the data is sent to the io system without relying on more or less
(un)controllable os policy.

The documentation seems to use "flush" but the code talks about "writeback"
or "flush", depending. I think one vocabulary, whichever it is, should be
chosen and everything should stick to it, otherwise everything look kind of
fuzzy and raises doubt for the reader (is it the same thing? is it something
else?). I initially used "flush", but it seems a bad idea because it has
nothing to do with the flush function, so I'm fine with writeback or anything
else, I just think that *one* word should be chosen and used everywhere.

The sgml documentation about "*_flush_after" configuration parameter talks
about bytes, but the actual unit should be buffers. I think that keeping
a number of buffers should be fine, because that is what the internal stuff
will manage, not bytes. Also, the maximum value (128 ?) should appear in
the text. In the discussion in the wal section, I'm not sure about the effect
of setting writebacks on SSD, but I think that you have made some tests so
maybe you have an answer and the corresponding section could be written with
some more definitive text than "probably brings no benefit on SSD".

A good point of the whole approach is that it is available to all kind
of pg processes. However it does not address the point that bgwriter and
backends basically issue random writes, so I would not expect much positive
effect before these writes are somehow sorted, which means doing some
compromise in the LRU/LFU logic... well, all this is best kept for later,
and I'm fine to have the logic flushing logic there. I'm wondering why you
choose 16 & 64 as default for backends & bgwriter, though.

IssuePendingWritebacks: you merge only strictly neightboring writes.
Maybe the merging strategy could be more aggressive than just strict
neighbors?

mdwriteback: all variables could be declared within the while, I do not
understand why some are in and some are out. ISTM that putting writeback
management at the relation level does not help a lot, because you have to
translate again from relation to files. The good news is that it should work
as well, and that it does avoid the issue that the file may have been closed
in between, so why not.

The PendingWriteback struct looks useless. I think it should be removed,
and maybe put back if one day if it is needed, which I rather doubt it.

struct WritebackContext: keeping a pointer to guc variables is a kind of
trick, I think it deserves a comment.

ScheduleBufferTagForWriteback: the "pending" variable is not very useful.
Maybe consider shortening the "pending_writebacks" field name to "writebacks"?

IssuePendingWritebacks: I understand that qsort is needed "again"
because when balancing writes over tablespaces they may be intermixed.
AFAICR I used a "flush context" for each table space in some version
I submitted, because I do think that this whole writeback logic really
does make sense *per table space*, which suggest that there should be as
many write backs contexts as table spaces, otherwise the positive effect
may going to be totally lost of tables spaces are used. Any thoughts?

Assert(*context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES); is always
true, I think, it is already checked in the initialization and when setting
gucs.

SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock.
I guess it is okay because it is still pinned.

pg_flush_data: in the first #elif, "context" is undeclared line 446.
Label "out" is not defined line 455. In the second #elif, "context" is
undeclared line 490 and label "out" line 500 is not defined either.

For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to 

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-19 Thread Andres Freund
On 2016-02-19 22:46:44 +0100, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> >Here's the next two (the most important) patches of the series:
> >0001: Allow to trigger kernel writeback after a configurable number of 
> >writes.
> >0002: Checkpoint sorting and balancing.
> 
> I will look into these two in depth.
> 
> Note that I would have ordered them in reverse because sorting is nearly
> always very beneficial, and "writeback" (formely called flushing) is then
> nearly always very beneficial on sorted buffers.

I had it that way earlier. I actually saw pretty large regressions from
sorting alone in some cases as well, apparently because the kernel
submits much larger IOs to disk; although that probably only shows on
SSDs.  This way the modifications imo look a trifle better ;). I'm
intending to commit both at the same time, keep them separate only
because they're easier to ynderstand separately.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-19 Thread Fabien COELHO


Hello Andres,


Here's the next two (the most important) patches of the series:
0001: Allow to trigger kernel writeback after a configurable number of writes.
0002: Checkpoint sorting and balancing.


I will look into these two in depth.

Note that I would have ordered them in reverse because sorting is nearly 
always very beneficial, and "writeback" (formely called flushing) is then 
nearly always very beneficial on sorted buffers.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-19 Thread Andres Freund
On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> Hi,
> 
> Fabien asked me to post a new version of the checkpoint flushing patch
> series. While this isn't entirely ready for commit, I think we're
> getting closer.
> 
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

I've updated the git tree.

Here's the next two (the most important) patches of the series:
0001: Allow to trigger kernel writeback after a configurable number of writes.
0002: Checkpoint sorting and balancing.

For 0001 I've recently changed:
* Don't schedule writeback after smgrextend() - that defeats linux
  delayed allocation mechanism, increasing fragmentation noticeably.
* Add docs for the new GUC variables
* comment polishing
* BackendWritebackContext now isn't dynamically allocated anymore


I think this patch primarily needs:
* review of the docs, not sure if they're easy enough to
  understand. Some language polishing might also be needed.
* review of the writeback API, combined with the smgr/md.c changes.
* Currently *_flush_after can be set to a nonzero value, even if there's
  no support for flushing on that platform. Imo that's ok, but perhaps
  other people's opinion differ.


For 0002 I've recently changed:
* Removed the sort timing information, we've proven sufficiently that
  it doesn't take a lot of time.
* Minor comment polishing.

I think this patch primarily needs:
* Benchmarking on FreeBSD/OSX to see whether we should enable the
  mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
  inclined to leave it off till then.


Regards,

Andres
>From 58aee659417372f3dda4420d8f2a4f4d41c56d31 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 19 Feb 2016 12:13:05 -0800
Subject: [PATCH 1/4] Allow to trigger kernel writeback after a configurable
 number of writes.

Currently writes to the main data files of postgres all go through the
OS page cache. This means that currently several operating systems can
end up collecting a large number of dirty buffers in their respective
page caches.  When these dirty buffers are flushed to storage rapidly,
be it because of fsync(), timeouts, or dirty ratios, latency for other
writes can increase massively.  This is the primary reason for regular
massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.

On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.

Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache.  Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.

With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.

A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences.  This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.

TODO:
* verify msync codepath
* properly detect mmap() && msync(MS_ASYNC) support, use it by default
  if available and sync_file_range is *not* available

Discussion: alpine.DEB.2.10.150601132.28433@sto
Author: Fabien Coelho and Andres Freund
---
 doc/src/sgml/config.sgml  |  81 +++
 doc/src/sgml/wal.sgml |  13 +++
 src/backend/postmaster/bgwriter.c |   8 +-
 src/backend/storage/buffer/buf_init.c |   5 +