Re: [HACKERS] checkpointer continuous flushing - V18
On 3/13/16 6:30 PM, Peter Geoghegan wrote: On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes wrote: Would the wiki be a good place for such tips? Not as formal as the documentation, and more centralized (and editable) than a collection of blog posts. That general direction makes sense, but I'm not sure if the Wiki is something that this will work for. I fear that it could become something like the TODO list page: a page that contains theoretically accurate information, but isn't very helpful. The TODO list needs to be heavily pruned, but that seems like something that will never happen. A centralized location for performance tips will probably only work well if there are still high standards that are actively enforced. There still needs to be tight editorial control. I think there's ways to significantly restrict who can edit a page, so this could probably still be done via the wiki. IMO we should also be encouraging users to test various tips and provide feedback, so maybe a wiki page with a big fat request at the top asking users to submit any feedback about the page to -performance. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes wrote: > Would the wiki be a good place for such tips? Not as formal as the > documentation, and more centralized (and editable) than a collection > of blog posts. That general direction makes sense, but I'm not sure if the Wiki is something that this will work for. I fear that it could become something like the TODO list page: a page that contains theoretically accurate information, but isn't very helpful. The TODO list needs to be heavily pruned, but that seems like something that will never happen. A centralized location for performance tips will probably only work well if there are still high standards that are actively enforced. There still needs to be tight editorial control. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On Thu, Mar 10, 2016 at 11:25 PM, Peter Geoghegan wrote: > On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO wrote: >> I can only concur! >> >> The "Performance Tips" chapter (II.14) is more user/query oriented. The >> "Server Administration" bool (III) does not discuss this much. > > That's definitely one area in which the docs are lacking -- I've heard > several complaints about this myself. I think we've been hesitant to > do more in part because the docs must always be categorically correct, > and must not use weasel words. I think it's hard to talk about > performance while maintaining the general tone of the documentation. I > don't know what can be done about that. Would the wiki be a good place for such tips? Not as formal as the documentation, and more centralized (and editable) than a collection of blog posts. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO wrote: > I can only concur! > > The "Performance Tips" chapter (II.14) is more user/query oriented. The > "Server Administration" bool (III) does not discuss this much. That's definitely one area in which the docs are lacking -- I've heard several complaints about this myself. I think we've been hesitant to do more in part because the docs must always be categorically correct, and must not use weasel words. I think it's hard to talk about performance while maintaining the general tone of the documentation. I don't know what can be done about that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
As you wish. I thought that understanding the underlying performance model with sequential writes written in chunks is important for the admin, and as this guc would have an impact on performance it should be hinted about, including the limits of its effect where large bases will converge to random io performance. But maybe that is not the right place. I do agree that that's something interesting to document somewhere. But I don't think any of the current places in the documentation are a good fit, and it's a topic much more general than the feature we're debating here. I'm not volunteering, but a good discussion of storage and the interactions with postgres surely would be a significant improvement to the postgres docs. I can only concur! The "Performance Tips" chapter (II.14) is more user/query oriented. The "Server Administration" bool (III) does not discuss this much. There is a wiki about performance tuning, but it is not integrated into the documentation. It could be a first documentation source. Also the README in some development directories are very interesting, although they contains too much details about the implementation. There has been a lot of presentations over the years, and blog posts. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-11 00:23:56 +0100, Fabien COELHO wrote: > As you wish. I thought that understanding the underlying performance model > with sequential writes written in chunks is important for the admin, and as > this guc would have an impact on performance it should be hinted about, > including the limits of its effect where large bases will converge to random > io performance. But maybe that is not the right place. I do agree that that's something interesting to document somewhere. But I don't think any of the current places in the documentation are a good fit, and it's a topic much more general than the feature we're debating here. I'm not volunteering, but a good discussion of storage and the interactions with postgres surely would be a significant improvement to the postgres docs. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
[...] If the default is in pages, maybe you could state it and afterwards translate it in size. Hm, I think that's more complicated for users than it's worth. As you wish. I liked the number of pages you used initially because it really gives a hint of how much random IOs are avoided when they are contiguous, and I do not have the same just intuition with sizes. Also it is related to the io queue length manage by the OS. The text could say something about sequential writes performance because pages are sorted.., but that it is lost for large bases and/or short checkpoints ? I think that's an implementation detail. As you wish. I thought that understanding the underlying performance model with sequential writes written in chunks is important for the admin, and as this guc would have an impact on performance it should be hinted about, including the limits of its effect where large bases will converge to random io performance. But maybe that is not the right place. -- Fabien -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hello Andres, I'm not sure I've seen these performance... If you have hard evidence, please feel free to share it. Man, are you intentionally trying to be hard to work with? Sorry, I do not understand this remark. You were refering to some latency measures in your answer, and I was just stating that I was interested in seeing these figures which were used to justify your choice to keep a shared writeback context. I did not intend this wish to be an issue, I was expressing an interest. To quote the email you responded to: My current plan is to commit this with the current behaviour (as in this week[end]), and then do some actual benchmarking on this specific part. It's imo a relatively minor detail. Good. From the evidence in the thread, I would have given the per tablespace context the preference, but this is just a personal opinion and I agree that it can work the other way around. I look forward to see these benchmarks later on, when you have them. So all is well, and hopefully will be even better later on. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-10 23:43:46 +0100, Fabien COELHO wrote: > > > > >Whenever more than bgwriter_flush_after bytes have > >been written by the bgwriter, attempt to force the OS to issue these > >writes to the underlying storage. Doing so will limit the amount of > >dirty data in the kernel's page cache, reducing the likelihood of > >stalls when an fsync is issued at the end of a checkpoint, or when > >the OS writes data back in larger batches in the background. Often > >that will result in greatly reduced transaction latency, but there > >also are some cases, especially with workloads that are bigger than > >, but smaller than the OS's page > >cache, where performance might degrade. This setting may have no > >effect on some platforms. 0 disables controlled > >writeback. The default is 256Kb on Linux, 0 > >otherwise. This parameter can only be set in the > >postgresql.conf file or on the server command line. > > > > > >(plus adjustments for the other gucs) > What about the maximum value? Added. bgwriter_flush_after (int) bgwriter_flush_after configuration parameter Whenever more than bgwriter_flush_after bytes have been written by the bgwriter, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the background. Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than , but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. The valid range is between 0, which disables controlled writeback, and 2MB. The default is 256Kb on Linux, 0 elsewhere. (Non-default values of BLCKSZ change the default and maximum.) This parameter can only be set in the postgresql.conf file or on the server command line. > If the default is in pages, maybe you could state it and afterwards > translate it in size. Hm, I think that's more complicated for users than it's worth. > The text could say something about sequential writes performance because > pages are sorted.., but that it is lost for large bases and/or short > checkpoints ? I think that's an implementation detail. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Whenever more than bgwriter_flush_after bytes have been written by the bgwriter, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the background. Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than , but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. 0 disables controlled writeback. The default is 256Kb on Linux, 0 otherwise. This parameter can only be set in the postgresql.conf file or on the server command line. (plus adjustments for the other gucs) Some suggestions: What about the maximum value? If the default is in pages, maybe you could state it and afterwards translate it in size. "The default is 64 pages on Linux (usually 256Kb)..." The text could say something about sequential writes performance because pages are sorted.., but that it is lost for large bases and/or short checkpoints ? -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-10 23:38:38 +0100, Fabien COELHO wrote: > I'm not sure I've seen these performance... If you have hard evidence, > please feel free to share it. Man, are you intentionally trying to be hard to work with? To quote the email you responded to: > My current plan is to commit this with the current behaviour (as in this > week[end]), and then do some actual benchmarking on this specific > part. It's imo a relatively minor detail. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
[...] I had originally kept it with one context per tablespace after refactoring this, but found that it gave worse results in rate limited loads even over only two tablespaces. That's on SSDs though. Might just mean that a smaller context size is better on SSD, and it could still be better per table space. The number of pages still in writeback (i.e. for which sync_file_range has been issued, but which haven't finished running yet) at the end of the checkpoint matters for the latency hit incurred by the fsync()s from smgrsync(); at least by my measurement. I'm not sure I've seen these performance... If you have hard evidence, please feel free to share it. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-10 17:33:33 -0500, Robert Haas wrote: > On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund wrote: > > On 2016-02-21 09:49:53 +0530, Robert Haas wrote: > >> I think there might be a semantic distinction between these two terms. > >> Doesn't writeback mean writing pages to disk, and flushing mean making > >> sure that they are durably on disk? So for example when the Linux > >> kernel thinks there is too much dirty data, it initiates writeback, > >> not a flush; on the other hand, at transaction commit, we initiate a > >> flush, not writeback. > > > > I don't think terminology is sufficiently clear to make such a > > distinction. Take e.g. our FlushBuffer()... > > Well then we should clarify it! Trying that as we speak, err, write. How about: Whenever more than bgwriter_flush_after bytes have been written by the bgwriter, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the background. Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than , but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. 0 disables controlled writeback. The default is 256Kb on Linux, 0 otherwise. This parameter can only be set in the postgresql.conf file or on the server command line. (plus adjustments for the other gucs) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund wrote: > On 2016-02-21 09:49:53 +0530, Robert Haas wrote: >> I think there might be a semantic distinction between these two terms. >> Doesn't writeback mean writing pages to disk, and flushing mean making >> sure that they are durably on disk? So for example when the Linux >> kernel thinks there is too much dirty data, it initiates writeback, >> not a flush; on the other hand, at transaction commit, we initiate a >> flush, not writeback. > > I don't think terminology is sufficiently clear to make such a > distinction. Take e.g. our FlushBuffer()... Well then we should clarify it! :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-21 09:49:53 +0530, Robert Haas wrote: > I think there might be a semantic distinction between these two terms. > Doesn't writeback mean writing pages to disk, and flushing mean making > sure that they are durably on disk? So for example when the Linux > kernel thinks there is too much dirty data, it initiates writeback, > not a flush; on the other hand, at transaction commit, we initiate a > flush, not writeback. I don't think terminology is sufficiently clear to make such a distinction. Take e.g. our FlushBuffer()... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-08 09:28:15 +0100, Fabien COELHO wrote: > > >>>Now I cannot see how having one context per table space would have a > >>>significant negative performance impact. > >> > >>The 'dirty data' etc. limits are global, not per block device. By having > >>several contexts with unflushed dirty data the total amount of dirty > >>data in the kernel increases. > > > >Possibly, but how much? Do you have experimental data to back up that > >this is really an issue? > > > >We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB > >of dirty buffers to manage for 16 table spaces, I do not see that as a > >major issue for the kernel. We flush in those increments, that doesn't mean there's only that much dirty data. I regularly see one order of magnitude more being dirty. I had originally kept it with one context per tablespace after refactoring this, but found that it gave worse results in rate limited loads even over only two tablespaces. That's on SSDs though. > To complete the argument, the 4MB is just a worst case scenario, in reality > flushing the different context would be randomized over time, so the > frequency of flushing a context would be exactly the same in both cases > (shared or per table space context) if the checkpoints are the same size, > just that with shared table space each flushing potentially targets all > tablespace with a few pages, while with the other version each flushing > targets one table space only. The number of pages still in writeback (i.e. for which sync_file_range has been issued, but which haven't finished running yet) at the end of the checkpoint matters for the latency hit incurred by the fsync()s from smgrsync(); at least by my measurement. My current plan is to commit this with the current behaviour (as in this week[end]), and then do some actual benchmarking on this specific part. It's imo a relatively minor detail. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Now I cannot see how having one context per table space would have a significant negative performance impact. The 'dirty data' etc. limits are global, not per block device. By having several contexts with unflushed dirty data the total amount of dirty data in the kernel increases. Possibly, but how much? Do you have experimental data to back up that this is really an issue? We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB of dirty buffers to manage for 16 table spaces, I do not see that as a major issue for the kernel. More thoughts about your theoretical argument: To complete the argument, the 4MB is just a worst case scenario, in reality flushing the different context would be randomized over time, so the frequency of flushing a context would be exactly the same in both cases (shared or per table space context) if the checkpoints are the same size, just that with shared table space each flushing potentially targets all tablespace with a few pages, while with the other version each flushing targets one table space only. So my handwaving analysis is that the flow of dirty buffers is the same with both approaches, but for the shared version buffers are more equaly distributed on table spaces, hence reducing sequential write effectiveness, and for the other the dirty buffers are grouped more clearly per table space, so it should get better sequential write performance. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hello Andres, Now I cannot see how having one context per table space would have a significant negative performance impact. The 'dirty data' etc. limits are global, not per block device. By having several contexts with unflushed dirty data the total amount of dirty data in the kernel increases. Possibly, but how much? Do you have experimental data to back up that this is really an issue? We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB of dirty buffers to manage for 16 table spaces, I do not see that as a major issue for the kernel. Thus you're more likely to see stalls by the kernel moving pages into writeback. I do not see the above data having a 30% negative impact on tps, given the quite small amount of data under discussion, and switching to random IOs cost so much that it must really be avoided. Without further experimental data, I still think that the one context per table space is the reasonnable choice. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-07 21:10:19 +0100, Fabien COELHO wrote: > Now I cannot see how having one context per table space would have a > significant negative performance impact. The 'dirty data' etc. limits are global, not per block device. By having several contexts with unflushed dirty data the total amount of dirty data in the kernel increases. Thus you're more likely to see stalls by the kernel moving pages into writeback. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hello Andres, (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% (2) with 1 tablespace on 1 disk : 956.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% Well, that's not a particularly meaningful workload. You increased the number of flushed to the same number of disks considerably. It is just a simple workload designed to emphasize the effect of having one context shared for all table space instead of on per tablespace, without rewriting the patch and without a large host with multiple disks. For a meaningful comparison you'd have to compare using one writeback context for N tablespaces on N separate disks/raids, and using N writeback contexts for the same. Sure, it would be better to do that, but that would require (1) rewriting the patch, which is a small work, and also (2) having access to a machine with a number of disks/raids, that I do NOT have available. What happens in the 16 tb workload is that much smaller flushes are performed on the 16 files writen in parallel, so the tps performance is significantly degraded, despite the writes being sorted in each file. On one tb, all buffers flushed are in the same file, so flushes are much more effective. When the context is shared and checkpointer buffer writes are balanced against table spaces, then when the limit is reached the flushing gets few buffers per tablespace, so this limits sequential writes to few buffers, hence the performance degradation. So I can explain the performance degradation *because* the flush context is shared between the table spaces, which is a logical argument backed with experimental data, so it is better than handwaving. Given the available hardware, this is the best proof I can have that context should be per table space. Now I cannot see how having one context per table space would have a significant negative performance impact. So the logical conclusion for me is that without further experimental data it is better to have one context per table space. If you have a hardware with plenty disks available for testing, that would provide better data, obviously. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-22 20:44:35 +0100, Fabien COELHO wrote: > > >>Random updates on 16 tables which total to 1.1GB of data, so this is in > >>buffer, no significant "read" traffic. > >> > >>(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps > >>per second avg, stddev [ min q1 median d3 max ] <=300tps > >>679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% > >> > >>(2) with 1 tablespace on 1 disk : 956.0 tps > >>per second avg, stddev [ min q1 median d3 max ] <=300tps > >>956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% > > > >Interesting. That doesn't reflect my own tests, even on rotating media, > >at all. I wonder if it's related to: > >https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 > > > >If you use your 12.04 kernel, that'd not be fixed. Which might be a > >reason to do it as you suggest. > > > >Could you share the exact details of that workload? > > See attached scripts (sh to create the 16 tables in the default or 16 table > spaces, small sql bench script, stat computation script). > > The per-second stats were computed with: > > grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 > --limit=300 > > Host is 8 cpu 16 GB, 2 HDD in RAID 1. Well, that's not a particularly meaningful workload. You increased the number of flushed to the same number of disks considerably. For a meaningful comparison you'd have to compare using one writeback context for N tablespaces on N separate disks/raids, and using N writeback contexts for the same. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Random updates on 16 tables which total to 1.1GB of data, so this is in buffer, no significant "read" traffic. (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% (2) with 1 tablespace on 1 disk : 956.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% Interesting. That doesn't reflect my own tests, even on rotating media, at all. I wonder if it's related to: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 If you use your 12.04 kernel, that'd not be fixed. Which might be a reason to do it as you suggest. Could you share the exact details of that workload? See attached scripts (sh to create the 16 tables in the default or 16 table spaces, small sql bench script, stat computation script). The per-second stats were computed with: grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 --limit=300 Host is 8 cpu 16 GB, 2 HDD in RAID 1. -- Fabien. ts_create.sh Description: Bourne shell script ts_test.sql Description: application/sql #! /usr/bin/env python # -*- coding: utf-8 -*- # # $Id: avg.py 1242 2016-02-06 14:44:02Z coelho $ # import argparse ap = argparse.ArgumentParser(description='show stats about data: count average stddev [min q1 median q3 max]...') ap.add_argument('--median', default=True, action='store_true', help='compute median and quartile values') ap.add_argument('--no-median', dest='median', default=True, action='store_false', help='do not compute median and quartile values') ap.add_argument('--more', default=False, action='store_true', help='show some more stats') ap.add_argument('--limit', type=float, default=None, help='set limit for counting below limit values') ap.add_argument('--length', type=int, default=None, help='set expected length, assume 0 if beyond') ap.add_argument('--precision', type=int, default=1, help='floating point precision') ap.add_argument('file', nargs='*', help='list of files to process') opt = ap.parse_args() # option consistency if opt.limit != None: opt.more = True if opt.more: opt.median = True # reset arguments for fileinput import sys sys.argv[1:] = opt.file import fileinput n, skipped, vals = 0, 0, [] k, vmin, vmax = None, None, None sum1, sum2 = 0.0, 0.0 for line in fileinput.input(): try: v = float(line) if opt.median: # keep track only if needed vals.append(v) if k is None: # first time k, vmin, vmax = v, v, v else: # next time vmin = min(vmin, v) vmax = max(vmax, v) n += 1 vmk = v - k sum1 += vmk sum2 += vmk * vmk except ValueError: # float conversion failed skipped += 1 if n == 0: # avoid ops on None below k, vmin, vmax = 0.0, 0.0, 0.0 if opt.length: assert "some data seen", n > 0 missing = int(opt.length) - len(vals) assert "positive number of missing data", missing >= 0 if missing > 0: print("warning: %d missing data, expanding with zeros" % missing) if opt.median: vals += [ 0.0 ] * missing vmin = min(vmin, 0.0) sum1 += - k * missing sum2 += k * k * missing n += missing assert len(vals) == int(opt.length) if opt.median: assert "consistent length", len(vals) == n # five numbers... # numpy.percentile requires numpy at least 1.9 to use 'midpoint' # statistics.median requires python 3.4 (?) def median(vals, start, length): if len(vals) == 1: start, length = 0, 1 m, odd = divmod(length, 2) #return 0.5 * (vals[start + m + odd - 1] + vals[start + m]) return vals[start + m] if odd else \ 0.5 * (vals[start + m-1] + vals[start + m]) # return ratio of below limit (limit included) values def below(vals, limit): # hmmm... short but generates a list #return float(len([ v for v in vals if v <= limit ])) / len(vals) below_limit = 0 for v in vals: if v <= limit: below_limit += 1 return float(below_limit) / len(vals) # float prettyprint with precision def f(v): return ('%.' + str(opt.precision) + 'f') % v # output if skipped: print("warning: %d lines skipped" % skipped) if n > 0: # show result (hmmm, precision is truncated...) from math import sqrt avg, stddev = k + sum1 / n, sqrt((sum2 - (sum1 * sum1) / n) / n) if opt.median: vals.sort() med = median(vals, 0, len(vals)) # not sure about odd/even issues here... q3 needs fixing if len is 1 q1 = median(vals, 0, len(vals) // 2) q3 = median(vals, (len(vals)+1) // 2, len(vals) // 2) # build summary message msg = "avg over %d: %s ± %s [%s, %s, %s, %s, %s]" % \ (n, f(avg), f(stddev), f(vmin), f(q1), f(med), f(q3), f(vmax)) if opt.more: limit = opt.limit if opt.limit != None else 0.1 * med # msg += " <=%s:" % f(limit) msg += " %s%%" % f(100.0 * below(vals, limit)) else: msg = "avg over %d: %s ± %s [%s, %s]" % \ (n, f(avg), f(stddev), f(vmin), f(vmax)) else: msg = "no data se
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-22 11:05:20 -0500, Tom Lane wrote: > Andres Freund writes: > > Interesting. That doesn't reflect my own tests, even on rotating media, > > at all. I wonder if it's related to: > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 > > > If you use your 12.04 kernel, that'd not be fixed. Which might be a > > reason to do it as you suggest. > > Hmm ... that kernel commit is less than 4 months old. Would it be > reflected in *any* production kernels yet? Probably not - so far I though it mainly has some performance benefits on relatively extreme workloads; where without the patch, flushing still is better performancewise than not flushing. But in the scenario Fabien has brought up it seems quite possible that sync_file_range emitting "storage cache flush" instructions, could explain the rather large performance difference between his and my experiments. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Andres Freund writes: > Interesting. That doesn't reflect my own tests, even on rotating media, > at all. I wonder if it's related to: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 > If you use your 12.04 kernel, that'd not be fixed. Which might be a > reason to do it as you suggest. Hmm ... that kernel commit is less than 4 months old. Would it be reflected in *any* production kernels yet? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-22 14:11:05 +0100, Fabien COELHO wrote: > > >I did a quick & small test with random updates on 16 tables with > >checkpoint_flush_after=16 checkpoint_timeout=30 > > Another run with more "normal" settings and over 1000 seconds, so less > "quick & small" that the previous one. > > checkpoint_flush_after = 16 > checkpoint_timeout = 5min # default > shared_buffers = 2GB # 1/8 of available memory > > Random updates on 16 tables which total to 1.1GB of data, so this is in > buffer, no significant "read" traffic. > > (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps > per second avg, stddev [ min q1 median d3 max ] <=300tps > 679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% > > (2) with 1 tablespace on 1 disk : 956.0 tps > per second avg, stddev [ min q1 median d3 max ] <=300tps > 956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% Interesting. That doesn't reflect my own tests, even on rotating media, at all. I wonder if it's related to: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 If you use your 12.04 kernel, that'd not be fixed. Which might be a reason to do it as you suggest. Could you share the exact details of that workload? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
I did a quick & small test with random updates on 16 tables with checkpoint_flush_after=16 checkpoint_timeout=30 Another run with more "normal" settings and over 1000 seconds, so less "quick & small" that the previous one. checkpoint_flush_after = 16 checkpoint_timeout = 5min # default shared_buffers = 2GB # 1/8 of available memory Random updates on 16 tables which total to 1.1GB of data, so this is in buffer, no significant "read" traffic. (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% (2) with 1 tablespace on 1 disk : 956.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hallo Andres, AFAICR I used a "flush context" for each table space in some version I submitted, because I do think that this whole writeback logic really does make sense *per table space*, which suggest that there should be as many write backs contexts as table spaces, otherwise the positive effect may going to be totally lost of tables spaces are used. Any thoughts? Leads to less regular IO, because if your tablespaces are evenly sized (somewhat common) you'll sometimes end up issuing sync_file_range's shortly after each other. For latency outside checkpoints it's important to control the total amount of dirty buffers, and that's obviously independent of tablespaces. I did a quick & small test with random updates on 16 tables with checkpoint_flush_after=16 checkpoint_timeout=30 (1) with 16 tablespaces (1 per table, but same disk) : tps = 1100, 27% time under 100 tps (2) with 1 tablespace : tps = 1200, 3% time under 100 tps This result is logical: with one writeback context shared between tablespaces the sync_file_range is issued on a few buffers per file at a time on the 16 files, no coalescing occurs there, so this result in random IOs, while with one table space all writes are aggregated per file. ISTM that this quick test shows that a writeback context are relevant per tablespace, as I expected. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
ISTM that "progress" and "progress_slice" only depend on num_scanned and per-tablespace num_to_scan and total num_to_scan, so they are somehow redundant and the progress could be recomputed from the initial figures when needed. They don't cause much space usage, and we access the values frequently. So why not store them? The same question would work the other way around: these values are one division away, why not compute them when needed? No big deal. [...] Given realistic amounts of memory the max potential "skew" seems fairly small with float8. If we ever flush one buffer "too much" for a tablespace it's pretty much harmless. I do agree. I'm suggesting that a comment should be added to justify why float8 accuracy is okay. I see a binary_heap_allocate but no corresponding deallocation, this looks like a memory leak... or is there some magic involved? Hm. I think we really should use a memory context for all of this - we could after all error out somewhere in the middle... I'm not sure that a memory context is justified here, there are only two mallocs and the checkpointer works for very long times. I think that it is simpler to just get the malloc/free right. [...] I'm not arguing for ripping it out, what I mean is that we don't set a nondefault value for the GUCs on platforms with just posix_fadivise available... Ok with that. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
[...] I do think that this whole writeback logic really does make sense *per table space*, Leads to less regular IO, because if your tablespaces are evenly sized (somewhat common) you'll sometimes end up issuing sync_file_range's shortly after each other. For latency outside checkpoints it's important to control the total amount of dirty buffers, and that's obviously independent of tablespaces. I do not understand/buy this argument. The underlying IO queue is per device, and table spaces should be per device as well (otherwise what the point?), so you should want to coalesce and "writeback" pages per device as wel. Calling sync_file_range on distinct devices should probably be issued more or less randomly, and should not interfere one with the other. The kernel's dirty buffer accounting is global, not per block device. Sure, but this is not my point. My point is that "sync_file_range" moves buffers to the device io queues, which are per device. If there is one queue in pg and many queues on many devices, the whole point of coalescing to get sequential writes is somehow lost. It's also actually rather common to have multiple tablespaces on a single block device. Especially if SANs and such are involved; where you don't even know which partitions are on which disks. Ok, some people would not benefit if the use many tablespaces on one device, too bad but that does not look like a useful very setting anyway, and I do not think it would harm much in this case. If you use just one context, the more table spaces the less performance gains, because there is less and less aggregation thus sequential writes per device. So for me there should really be one context per tablespace. That would suggest a hashtable or some other structure to keep and retrieve them, which would not be that bad, and I think that it is what is needed. That'd be much easier to do by just keeping the context in the per-tablespace struct. But anyway, I'm really doubtful about going for that; I had it that way earlier, and observing IO showed it not being beneficial. ISTM that you would need a significant number of tablespaces to see the benefit. If you do not do that, the more table spaces the more random the IOs, which is disappointing. Also, "the cost is marginal", so I do not see any good argument not to do it. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hi, On 2016-02-21 10:52:45 +0100, Fabien COELHO wrote: > * CpktSortItem: > > I think that allocating 20 bytes per buffer in shared memory is a little on > the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes > to hold 4 values, could be one byte or even 2 bits somewhere. Also, there > are very few tablespaces, they could be given a small number and this number > could be used instead of the Oid, so the space requirement could be reduced > to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping > 4 bytes alignement and also getting 8 byte alignement... If this is too > much, I have shown that it can work with only 4 bytes per buffer, as the > sorting is really just a performance optimisation and is not broken if some > stuff changes between sorting & writeback, but you did not like the idea. If > the amount of shared memory required is a significant concern, it could be > resurrected, though. This is less than 0.2 % of memory related to shared buffers. We have the same amount of memory allocated in CheckpointerShmemSize(), and nobody has complained so far. And sorry, going back to the previous approach isn't going to fly, and I've no desire to discuss that *again*. > ISTM that "progress" and "progress_slice" only depend on num_scanned and > per-tablespace num_to_scan and total num_to_scan, so they are somehow > redundant and the progress could be recomputed from the initial figures > when needed. They don't cause much space usage, and we access the values frequently. So why not store them? > If these fields are kept, I think that a comment should justify why float8 > precision is okay for the purpose. I think it is quite certainly fine in the > worst case with 32 bits buffer_ids, but it would not be if this size is > changed someday. That seems pretty much unrelated to having the fields - the question of accuracy plays a role regardless, no? Given realistic amounts of memory the max potential "skew" seems fairly small with float8. If we ever flush one buffer "too much" for a tablespace it's pretty much harmless. > ISTM that nearly all of the collected data on the second sweep could be > collected on the first sweep, so that this second sweep could be avoided > altogether. The only missing data is the index of the first buffer in the > array, which can be computed by considering tablespaces only, sweeping over > buffers is not needed. That would suggest creating the heap or using a hash > in the initial buffer sweep to keep this information. This would also > provide a point where to number tablespaces for compressing the CkptSortItem > struct. Doesn't seem worth the complexity to me. > I'm wondering about calling CheckpointWriteDelay on each round, maybe > a minimum amount of write would make sense. Why? There's not really much benefit of doing more work than needed. I think we should sleep far shorter in many cases, but that's indeed a separate issue. > I see a binary_heap_allocate but no corresponding deallocation, this > looks like a memory leak... or is there some magic involved? Hm. I think we really should use a memory context for all of this - we could after all error out somewhere in the middle... > >I think this patch primarily needs: > >* Benchmarking on FreeBSD/OSX to see whether we should enable the > > mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm > > inclined to leave it off till then. > > I do not have that. As "msync" seems available on Linux, it is possible to > force using it with a "ifdef 0" to skip sync_file_range and check whether it > does some good there. Unfortunately it doesn't work well on linux: * On many OSs msync() on a mmap'ed file triggers writeback. On linux * it only does so when MS_SYNC is specified, but then it does the * writeback synchronously. Luckily all common linux systems have * sync_file_range(). This is preferrable over FADV_DONTNEED because * it doesn't flush out clean data. I've verified beforehand, with a simple demo program, that msync(MS_ASYNC) does something reasonable of freebsd... > Idem for the "posix_fadvise" stuff. I can try to do > that, but it takes time to do so, if someone can test on other OS it would > be much better. I think that if it works it should be kept in, so it is just > a matter of testing it. I'm not arguing for ripping it out, what I mean is that we don't set a nondefault value for the GUCs on platforms with just posix_fadivise available... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-21 08:26:28 +0100, Fabien COELHO wrote: > >>In the discussion in the wal section, I'm not sure about the effect of > >>setting writebacks on SSD, [...] > > > >Yea, that paragraph needs some editing. I think we should basically > >remove that last sentence. > > Ok, fine with me. Does that mean that flushing as a significant positive > impact on SSD in your tests? Yes. The reason we need flushing is that the kernel amasses dirty pages, and then flushes them at once. That hurts for both SSDs and rotational media. Sorting is the the bigger question, but I've seen it have clearly beneficial performance impacts. I guess if you look at devices with a internal block size bigger than 8k, you'd even see larger differences. > >>Maybe the merging strategy could be more aggressive than just strict > >>neighbors? > > > >I don't think so. If you flush more than neighbouring writes you'll > >often end up flushing buffers dirtied by another backend, causing > >additional stalls. > > Ok. Maybe the neightbor definition could be relaxed just a little bit so > that small holes are overtake, but not large holes? If there is only a few > pages in between, even if written by another process, then writing them > together should be better? Well, this can wait for a clear case, because > hopefully the OS will recoalesce them behind anyway. I'm against doing so without clear measurements of a benefit. > >Also because the infrastructure is used for more than checkpoint > >writes. There's absolutely no ordering guarantees there. > > Yep, but not much benefit to expect from a few dozens random pages either. Actually, there's kinda frequently a benefit observable. Even if few requests can be merged, doing IO requests in an order more likely doable within a few rotations is beneficial. Also, the cost is marginal, so why worry? > >>[...] I do think that this whole writeback logic really does make > >>sense *per table space*, > > > >Leads to less regular IO, because if your tablespaces are evenly sized > >(somewhat common) you'll sometimes end up issuing sync_file_range's > >shortly after each other. For latency outside checkpoints it's > >important to control the total amount of dirty buffers, and that's > >obviously independent of tablespaces. > > I do not understand/buy this argument. > > The underlying IO queue is per device, and table spaces should be per device > as well (otherwise what the point?), so you should want to coalesce and > "writeback" pages per device as wel. Calling sync_file_range on distinct > devices should probably be issued more or less randomly, and should not > interfere one with the other. The kernel's dirty buffer accounting is global, not per block device. It's also actually rather common to have multiple tablespaces on a single block device. Especially if SANs and such are involved; where you don't even know which partitions are on which disks. > If you use just one context, the more table spaces the less performance > gains, because there is less and less aggregation thus sequential writes per > device. > > So for me there should really be one context per tablespace. That would > suggest a hashtable or some other structure to keep and retrieve them, which > would not be that bad, and I think that it is what is needed. That'd be much easier to do by just keeping the context in the per-tablespace struct. But anyway, I'm really doubtful about going for that; I had it that way earlier, and observing IO showed it not being beneficial. > >>For the checkpointer, a key aspect is that the scheduling process goes > >>to sleep from time to time, and this sleep time looked like a great > >>opportunity to do this kind of flushing. You choose not to take advantage > >>of the behavior, why? > > > >Several reasons: Most importantly there's absolutely no guarantee that > >you'll ever end up sleeping, it's quite common to happen only seldomly. > > Well, that would be under a situation when pg is completely unresponsive. > More so, this behavior *makes* pg unresponsive. No. The checkpointer being bottlenecked on actual IO performance doesn't impact production that badly. It'll just sometimes block in sync_file_range(), but the IO queues will have enough space to frequently give way to other backends, particularly to synchronous reads (most pg reads) and synchronous writes (fdatasync()). So a single checkpoint will take a bit longer, but otherwise the system will mostly keep up the work in a regular manner. Without the sync_file_range() calls the kernel will amass dirty buffers until global dirty limits are reached, which then will bring the whole system to a standstill. It's pretty common that checkpoint_timeout is too short to be able to write all shared_buffers out, in that case it's much better to slow down the whole checkpoint, instead of being incredibly slow at the end. > >I also don't really believe it helps that much, although that's a complex > >argument to make. > > Yep. My thinkin
Re: [HACKERS] checkpointer continuous flushing - V18
Hallo Andres, Here is a review for the second patch. For 0002 I've recently changed: * Removed the sort timing information, we've proven sufficiently that it doesn't take a lot of time. I put it there initialy to demonstrate that there was no cache performance issue when sorting on just buffer indexes. As it is always small, I agree that it is not needed. Well, it could be still be in seconds on a very large shared buffers setting with a very large checkpoint, but then the checkpoint would be tremendously huge... * Minor comment polishing. Patch applies and checks on Linux. * CpktSortItem: I think that allocating 20 bytes per buffer in shared memory is a little on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also, there are very few tablespaces, they could be given a small number and this number could be used instead of the Oid, so the space requirement could be reduced to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping 4 bytes alignement and also getting 8 byte alignement... If this is too much, I have shown that it can work with only 4 bytes per buffer, as the sorting is really just a performance optimisation and is not broken if some stuff changes between sorting & writeback, but you did not like the idea. If the amount of shared memory required is a significant concern, it could be resurrected, though. * CkptTsStatus: As I suggested in the other mail, I think that this structure should also keep a per tablespace WritebackContext so that coalescing is done per tablespace. ISTM that "progress" and "progress_slice" only depend on num_scanned and per-tablespace num_to_scan and total num_to_scan, so they are somehow redundant and the progress could be recomputed from the initial figures when needed. If these fields are kept, I think that a comment should justify why float8 precision is okay for the purpose. I think it is quite certainly fine in the worst case with 32 bits buffer_ids, but it would not be if this size is changed someday. * BufferSync After a first sweep to collect buffers to write, they are sorted, and then there those buffers are swept again to compute some per tablespace data and organise a heap. ISTM that nearly all of the collected data on the second sweep could be collected on the first sweep, so that this second sweep could be avoided altogether. The only missing data is the index of the first buffer in the array, which can be computed by considering tablespaces only, sweeping over buffers is not needed. That would suggest creating the heap or using a hash in the initial buffer sweep to keep this information. This would also provide a point where to number tablespaces for compressing the CkptSortItem struct. I'm wondering about calling CheckpointWriteDelay on each round, maybe a minimum amount of write would make sense. This remark is independent of this patch. Probably it works fine because after a sleep the checkpointer is behind enough so that it will write a bunch of buffers before sleeping again. I see a binary_heap_allocate but no corresponding deallocation, this looks like a memory leak... or is there some magic involved? There are some debug stuff to remove in #ifdefs. I think that the buffer/README should be updated with explanations about sorting in the checkpointer. I think this patch primarily needs: * Benchmarking on FreeBSD/OSX to see whether we should enable the mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm inclined to leave it off till then. I do not have that. As "msync" seems available on Linux, it is possible to force using it with a "ifdef 0" to skip sync_file_range and check whether it does some good there. Idem for the "posix_fadvise" stuff. I can try to do that, but it takes time to do so, if someone can test on other OS it would be much better. I think that if it works it should be kept in, so it is just a matter of testing it. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hallo Andres, [...] I do think that this whole writeback logic really does make sense *per table space*, Leads to less regular IO, because if your tablespaces are evenly sized (somewhat common) you'll sometimes end up issuing sync_file_range's shortly after each other. For latency outside checkpoints it's important to control the total amount of dirty buffers, and that's obviously independent of tablespaces. I do not understand/buy this argument. The underlying IO queue is per device, and table spaces should be per device as well (otherwise what the point?), so you should want to coalesce and "writeback" pages per device as wel. Calling sync_file_range on distinct devices should probably be issued more or less randomly, and should not interfere one with the other. If you use just one context, the more table spaces the less performance gains, because there is less and less aggregation thus sequential writes per device. So for me there should really be one context per tablespace. That would suggest a hashtable or some other structure to keep and retrieve them, which would not be that bad, and I think that it is what is needed. Note: I think that an easy way to do that in the "checkpoint sort" patch is simply to keep a WritebackContext in CkptTsStatus structure which is per table space in the checkpointer. For bgwriter & backends it can wait, there is few "writeback" coalescing because IO should be pretty random, so it does not matter much. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hallo Andres, In some previous version I think a warning was shown if the feature was requested but not available. I think we should either silently ignore it, or error out. Warnings somewhere in the background aren't particularly meaningful. I like "ignoring with a warning" in the log file, because when things do not behave as expected that is where I'll be looking. I do not think that it should error out. The sgml documentation about "*_flush_after" configuration parameter talks about bytes, but the actual unit should be buffers. The unit actually is buffers, but you can configure it using bytes. We've done the same for other GUCs (shared_buffers, wal_buffers, ...). Refering to bytes is easier because you don't have to explain that it depends on compilation settings how many data it actually is and such. So I understand that it works with kb as well. Now I do not think that it would need a lot if explanations if you say that it is a number of pages, and I think that a number of pages is significant because it is a number of IO requests to be coalesced, eventually. In the discussion in the wal section, I'm not sure about the effect of setting writebacks on SSD, [...] Yea, that paragraph needs some editing. I think we should basically remove that last sentence. Ok, fine with me. Does that mean that flushing as a significant positive impact on SSD in your tests? However it does not address the point that bgwriter and backends basically issue random writes, [...] The benefit is primarily that you don't collect large amounts of dirty buffers in the kernel page cache. In most cases the kernel will not be able to coalesce these writes either... I've measured *massive* performance latency differences for workloads that are bigger than shared buffers - because suddenly bgwriter / backends do the majority of the writes. Flushing in the checkpoint quite possibly makes nearly no difference in such cases. So I understand that there is a positive impact under some load. Good! Maybe the merging strategy could be more aggressive than just strict neighbors? I don't think so. If you flush more than neighbouring writes you'll often end up flushing buffers dirtied by another backend, causing additional stalls. Ok. Maybe the neightbor definition could be relaxed just a little bit so that small holes are overtake, but not large holes? If there is only a few pages in between, even if written by another process, then writing them together should be better? Well, this can wait for a clear case, because hopefully the OS will recoalesce them behind anyway. struct WritebackContext: keeping a pointer to guc variables is a kind of trick, I think it deserves a comment. It has, it's just in WritebackContextInit(). Can duplicateit. I missed it, I expected something in the struct definition. Do not duplicate, but cross reference it? IssuePendingWritebacks: I understand that qsort is needed "again" because when balancing writes over tablespaces they may be intermixed. Also because the infrastructure is used for more than checkpoint writes. There's absolutely no ordering guarantees there. Yep, but not much benefit to expect from a few dozens random pages either. [...] I do think that this whole writeback logic really does make sense *per table space*, Leads to less regular IO, because if your tablespaces are evenly sized (somewhat common) you'll sometimes end up issuing sync_file_range's shortly after each other. For latency outside checkpoints it's important to control the total amount of dirty buffers, and that's obviously independent of tablespaces. I do not understand/buy this argument. The underlying IO queue is per device, and table spaces should be per device as well (otherwise what the point?), so you should want to coalesce and "writeback" pages per device as wel. Calling sync_file_range on distinct devices should probably be issued more or less randomly, and should not interfere one with the other. If you use just one context, the more table spaces the less performance gains, because there is less and less aggregation thus sequential writes per device. So for me there should really be one context per tablespace. That would suggest a hashtable or some other structure to keep and retrieve them, which would not be that bad, and I think that it is what is needed. For the checkpointer, a key aspect is that the scheduling process goes to sleep from time to time, and this sleep time looked like a great opportunity to do this kind of flushing. You choose not to take advantage of the behavior, why? Several reasons: Most importantly there's absolutely no guarantee that you'll ever end up sleeping, it's quite common to happen only seldomly. Well, that would be under a situation when pg is completely unresponsive. More so, this behavior *makes* pg unresponsive. If you're bottlenecked on IO, you can end up being behind all the time. Hopefully sortin
Re: [HACKERS] checkpointer continuous flushing - V18
On Sun, Feb 21, 2016 at 3:37 AM, Andres Freund wrote: >> The documentation seems to use "flush" but the code talks about "writeback" >> or "flush", depending. I think one vocabulary, whichever it is, should be >> chosen and everything should stick to it, otherwise everything look kind of >> fuzzy and raises doubt for the reader (is it the same thing? is it something >> else?). I initially used "flush", but it seems a bad idea because it has >> nothing to do with the flush function, so I'm fine with writeback or anything >> else, I just think that *one* word should be chosen and used everywhere. > > Hm. I think there might be a semantic distinction between these two terms. Doesn't writeback mean writing pages to disk, and flushing mean making sure that they are durably on disk? So for example when the Linux kernel thinks there is too much dirty data, it initiates writeback, not a flush; on the other hand, at transaction commit, we initiate a flush, not writeback. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hi, On 2016-02-20 20:56:31 +0100, Fabien COELHO wrote: > >* Currently *_flush_after can be set to a nonzero value, even if there's > > no support for flushing on that platform. Imo that's ok, but perhaps > > other people's opinion differ. > > In some previous version I think a warning was shown of the feature was > requested but not available. I think we should either silently ignore it, or error out. Warnings somewhere in the background aren't particularly meaningful. > Here are some quick comments on the patch: > > Patch applies cleanly on head. Compiled and checked on Linux. Compilation > issues on other systems, see below. For those I've already pushed a small fixup commit to git... Stupid mistake. > The documentation seems to use "flush" but the code talks about "writeback" > or "flush", depending. I think one vocabulary, whichever it is, should be > chosen and everything should stick to it, otherwise everything look kind of > fuzzy and raises doubt for the reader (is it the same thing? is it something > else?). I initially used "flush", but it seems a bad idea because it has > nothing to do with the flush function, so I'm fine with writeback or anything > else, I just think that *one* word should be chosen and used everywhere. Hm. > The sgml documentation about "*_flush_after" configuration parameter talks > about bytes, but the actual unit should be buffers. The unit actually is buffers, but you can configure it using bytes. We've done the same for other GUCs (shared_buffers, wal_buffers, ...). Refering to bytes is easier because you don't have to explain that it depends on compilation settings how many data it actually is and such. > Also, the maximum value (128 ?) should appear in the text. \ Right. > In the discussion in the wal section, I'm not sure about the effect of > setting writebacks on SSD, but I think that you have made some tests > so maybe you have an answer and the corresponding section could be > written with some more definitive text than "probably brings no > benefit on SSD". Yea, that paragraph needs some editing. I think we should basically remove that last sentence. > A good point of the whole approach is that it is available to all kind > of pg processes. Exactly. > However it does not address the point that bgwriter and > backends basically issue random writes, so I would not expect much positive > effect before these writes are somehow sorted, which means doing some > compromise in the LRU/LFU logic... The benefit is primarily that you don't collect large amounts of dirty buffers in the kernel page cache. In most cases the kernel will not be able to coalesce these writes either... I've measured *massive* performance latency differences for workloads that are bigger than shared buffers - because suddenly bgwriter / backends do the majority of the writes. Flushing in the checkpoint quite possibly makes nearly no difference in such cases. > well, all this is best kept for later, and I'm fine to have the logic > flushing logic there. I'm wondering why you choose 16 & 64 as default > for backends & bgwriter, though. I chose a small value for backends because there often are a large number of backends, and thus the amount of dirty data of each adds up. I used a larger value for bgwriter because I saw that ending up using bigger IOs. > IssuePendingWritebacks: you merge only strictly neightboring writes. > Maybe the merging strategy could be more aggressive than just strict > neighbors? I don't think so. If you flush more than neighbouring writes you'll often end up flushing buffers dirtied by another backend, causing additional stalls. And if the writes aren't actually neighbouring there's not much gained from issuing them in one sync_file_range call. > mdwriteback: all variables could be declared within the while, I do not > understand why some are in and some are out. Right. > ISTM that putting writeback management at the relation level does not > help a lot, because you have to translate again from relation to > files. Sure, but what's the problem with that? That's how normal read/write IO works as well? > struct WritebackContext: keeping a pointer to guc variables is a kind of > trick, I think it deserves a comment. It has, it's just in WritebackContextInit(). Can duplicateit. > ScheduleBufferTagForWriteback: the "pending" variable is not very > useful. Shortens line length a good bit, at no cost. > IssuePendingWritebacks: I understand that qsort is needed "again" > because when balancing writes over tablespaces they may be intermixed. Also because the infrastructure is used for more than checkpoint writes. There's absolutely no ordering guarantees there. > AFAICR I used a "flush context" for each table space in some version > I submitted, because I do think that this whole writeback logic really > does make sense *per table space*, which suggest that there should be as > many write backs contexts as table spaces, otherwise the pos
Re: [HACKERS] checkpointer continuous flushing - V18
Hello Andres, For 0001 I've recently changed: * Don't schedule writeback after smgrextend() - that defeats linux delayed allocation mechanism, increasing fragmentation noticeably. * Add docs for the new GUC variables * comment polishing * BackendWritebackContext now isn't dynamically allocated anymore I think this patch primarily needs: * review of the docs, not sure if they're easy enough to understand. Some language polishing might also be needed. Yep, see below. * review of the writeback API, combined with the smgr/md.c changes. See various comments below. * Currently *_flush_after can be set to a nonzero value, even if there's no support for flushing on that platform. Imo that's ok, but perhaps other people's opinion differ. In some previous version I think a warning was shown of the feature was requested but not available. Here are some quick comments on the patch: Patch applies cleanly on head. Compiled and checked on Linux. Compilation issues on other systems, see below. When pages are written by a process (checkpointer, bgwriter, backend worker), the list of recently written pages is kept and every so often an advisory fsync (sync_file_range, other options for other systems) is issued so that the data is sent to the io system without relying on more or less (un)controllable os policy. The documentation seems to use "flush" but the code talks about "writeback" or "flush", depending. I think one vocabulary, whichever it is, should be chosen and everything should stick to it, otherwise everything look kind of fuzzy and raises doubt for the reader (is it the same thing? is it something else?). I initially used "flush", but it seems a bad idea because it has nothing to do with the flush function, so I'm fine with writeback or anything else, I just think that *one* word should be chosen and used everywhere. The sgml documentation about "*_flush_after" configuration parameter talks about bytes, but the actual unit should be buffers. I think that keeping a number of buffers should be fine, because that is what the internal stuff will manage, not bytes. Also, the maximum value (128 ?) should appear in the text. In the discussion in the wal section, I'm not sure about the effect of setting writebacks on SSD, but I think that you have made some tests so maybe you have an answer and the corresponding section could be written with some more definitive text than "probably brings no benefit on SSD". A good point of the whole approach is that it is available to all kind of pg processes. However it does not address the point that bgwriter and backends basically issue random writes, so I would not expect much positive effect before these writes are somehow sorted, which means doing some compromise in the LRU/LFU logic... well, all this is best kept for later, and I'm fine to have the logic flushing logic there. I'm wondering why you choose 16 & 64 as default for backends & bgwriter, though. IssuePendingWritebacks: you merge only strictly neightboring writes. Maybe the merging strategy could be more aggressive than just strict neighbors? mdwriteback: all variables could be declared within the while, I do not understand why some are in and some are out. ISTM that putting writeback management at the relation level does not help a lot, because you have to translate again from relation to files. The good news is that it should work as well, and that it does avoid the issue that the file may have been closed in between, so why not. The PendingWriteback struct looks useless. I think it should be removed, and maybe put back if one day if it is needed, which I rather doubt it. struct WritebackContext: keeping a pointer to guc variables is a kind of trick, I think it deserves a comment. ScheduleBufferTagForWriteback: the "pending" variable is not very useful. Maybe consider shortening the "pending_writebacks" field name to "writebacks"? IssuePendingWritebacks: I understand that qsort is needed "again" because when balancing writes over tablespaces they may be intermixed. AFAICR I used a "flush context" for each table space in some version I submitted, because I do think that this whole writeback logic really does make sense *per table space*, which suggest that there should be as many write backs contexts as table spaces, otherwise the positive effect may going to be totally lost of tables spaces are used. Any thoughts? Assert(*context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES); is always true, I think, it is already checked in the initialization and when setting gucs. SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock. I guess it is okay because it is still pinned. pg_flush_data: in the first #elif, "context" is undeclared line 446. Label "out" is not defined line 455. In the second #elif, "context" is undeclared line 490 and label "out" line 500 is not defined either. For the checkpointer, a key aspect is that the scheduling process goes to sleep from time to t
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-19 22:46:44 +0100, Fabien COELHO wrote: > > Hello Andres, > > >Here's the next two (the most important) patches of the series: > >0001: Allow to trigger kernel writeback after a configurable number of > >writes. > >0002: Checkpoint sorting and balancing. > > I will look into these two in depth. > > Note that I would have ordered them in reverse because sorting is nearly > always very beneficial, and "writeback" (formely called flushing) is then > nearly always very beneficial on sorted buffers. I had it that way earlier. I actually saw pretty large regressions from sorting alone in some cases as well, apparently because the kernel submits much larger IOs to disk; although that probably only shows on SSDs. This way the modifications imo look a trifle better ;). I'm intending to commit both at the same time, keep them separate only because they're easier to ynderstand separately. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hello Andres, Here's the next two (the most important) patches of the series: 0001: Allow to trigger kernel writeback after a configurable number of writes. 0002: Checkpoint sorting and balancing. I will look into these two in depth. Note that I would have ordered them in reverse because sorting is nearly always very beneficial, and "writeback" (formely called flushing) is then nearly always very beneficial on sorted buffers. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-04 16:54:58 +0100, Andres Freund wrote: > Hi, > > Fabien asked me to post a new version of the checkpoint flushing patch > series. While this isn't entirely ready for commit, I think we're > getting closer. > > I don't want to post a full series right now, but my working state is > available on > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush > git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush I've updated the git tree. Here's the next two (the most important) patches of the series: 0001: Allow to trigger kernel writeback after a configurable number of writes. 0002: Checkpoint sorting and balancing. For 0001 I've recently changed: * Don't schedule writeback after smgrextend() - that defeats linux delayed allocation mechanism, increasing fragmentation noticeably. * Add docs for the new GUC variables * comment polishing * BackendWritebackContext now isn't dynamically allocated anymore I think this patch primarily needs: * review of the docs, not sure if they're easy enough to understand. Some language polishing might also be needed. * review of the writeback API, combined with the smgr/md.c changes. * Currently *_flush_after can be set to a nonzero value, even if there's no support for flushing on that platform. Imo that's ok, but perhaps other people's opinion differ. For 0002 I've recently changed: * Removed the sort timing information, we've proven sufficiently that it doesn't take a lot of time. * Minor comment polishing. I think this patch primarily needs: * Benchmarking on FreeBSD/OSX to see whether we should enable the mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm inclined to leave it off till then. Regards, Andres >From 58aee659417372f3dda4420d8f2a4f4d41c56d31 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 19 Feb 2016 12:13:05 -0800 Subject: [PATCH 1/4] Allow to trigger kernel writeback after a configurable number of writes. Currently writes to the main data files of postgres all go through the OS page cache. This means that currently several operating systems can end up collecting a large number of dirty buffers in their respective page caches. When these dirty buffers are flushed to storage rapidly, be it because of fsync(), timeouts, or dirty ratios, latency for other writes can increase massively. This is the primary reason for regular massive stalls observed in real world scenarios and artificial benchmarks; on rotating disks stalls on the order of hundreds of seconds have been observed. On linux it is possible to control this by reducing the global dirty limits significantly, reducing the above problem. But global configuration is rather problematic because it'll affect other applications; also PostgreSQL itself doesn't always generally want this behavior, e.g. for temporary files it's undesirable. Several operating systems allow some control over the kernel page cache. Linux has sync_file_range(2), several posix systems have msync(2) and posix_fadvise(2). sync_file_range(2) is preferable because it requires no special setup, whereas msync() requires the to-be-flushed range to be mmap'ed. For the purpose of flushing dirty data posix_fadvise(2) is the worst alternative, as flushing dirty data is just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages from the page cache. Thus the feature is enabled by default only on linux, but can be enabled on all systems that have any of the above APIs. With the infrastructure added, writes made via checkpointer, bgwriter and normal user backends can be flushed after a configurable number of writes. Each of these sources of writes controlled by a separate GUC, checkpointer_flush_after, bgwriter_flush_after and backend_flush_after respectively; they're separate because the number of flushes that are good are separate, and because the performance considerations of controlled flushing for each of these are different. A later patch will add checkpoint sorting - after that flushes from the ckeckpoint will almost always be desirable. Bgwriter flushes are most of the time going to be random, which are slow on lots of storage hardware. Flushing in backends works well if the storage and bgwriter can keep up, but if not it can have negative consequences. This patch is likely to have negative performance consequences without checkpoint sorting, but unfortunately so has sorting without flush control. TODO: * verify msync codepath * properly detect mmap() && msync(MS_ASYNC) support, use it by default if available and sync_file_range is *not* available Discussion: alpine.DEB.2.10.150601132.28433@sto Author: Fabien Coelho and Andres Freund --- doc/src/sgml/config.sgml | 81 +++ doc/src/sgml/wal.sgml | 13 +++ src/backend/postmaster/bgwriter.c | 8 +- src/backend/storage/buffer/buf_init.c | 5 + src/backend/storage/buffer/bufmgr.c