Re: [HACKERS] Patch for fail-back without fresh backup
On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote: > On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila > wrote: > > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote: > > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila > wrote: > >> > >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote: > >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila > wrote: > >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: > Hello, > >>> > >> We have already started a discussion on pgsql-hackers for the > problem of > >> taking fresh backup during the failback operation here is the > link for that: > >> > >> > >> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe- > 6OzRaew5pWhk7yQtb > >> jgwrfu513...@mail.gmail.com > >> > >> Let me again summarize the problem we are trying to address. > >> > >> > > How will you take care of extra WAL on old master during > recovery. If it > > plays the WAL which has not reached new-master, it can be a > problem. > >> > you means that there is possible that old master's data ahead of > new > master's data. > >> > >>> I mean to say is that WAL of old master can be ahead of new > master. I understood that > >>> data files of old master can't be ahead, but I think WAL can be > ahead. > >> > so there is inconsistent data between those server when fail back. > right? > if so , there is not possible inconsistent. because if you use GUC > option > as his propose (i.g., failback_safe_standby_mode = remote_flush), > when old master is working fine, all file system level changes > aren't > done before WAL replicated. > >> > >>> Would the propose patch will take care that old master's WAL is > also not ahead in some way? > >>> If yes, I think i am missing some point. > > > >> yes it will happen that old master's WAL ahead of new master's WAL > as you said. > >> but I think that we can solve them by delete all WAL file when old > >> master starts as new standby. > > > > I think ideally, it should reset WAL location at the point where new > master has forrked off. > > In such a scenario it would be difficult for user who wants to get a > dump of some data in > > old master which hasn't gone to new master. I am not sure if such a > need is there for real users, but if it > > is there, then providing this solution will have some drawbacks. > I think that we can dumping data before all WAL files deleting. All > WAL files deleting is done when old master starts as new standby. Can we dump data without starting server? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SET work_mem = '1TB';
On Tuesday, May 21, 2013, Simon Riggs wrote: > I worked up a small patch to support Terabyte setting for memory. > Which is OK, but it only works for 1TB, not for 2TB or above. > I've incorporated my review into a new version, attached. Added "TB" to the docs, added the macro KB_PER_TB, and made "show" to print "1TB" rather than "1024GB". I tested several of the memory settings to see that it can be set and retrieved. I haven't tested actual execution as I don't have that kind of RAM. I don't see how it could have a performance impact, it passes make check etc., and I don't think it warrants a new regression test. I'll set it to ready for committer. Cheers, Jeff diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml new file mode 100644 index c7d84b5..940ed6e *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 39,45 For convenience, a different unit can also be specified explicitly. Valid memory units are kB (kilobytes), MB ! (megabytes), and GB (gigabytes); valid time units are ms (milliseconds), s (seconds), min (minutes), h (hours), and d (days). Note that the multiplier --- 39,45 For convenience, a different unit can also be specified explicitly. Valid memory units are kB (kilobytes), MB ! (megabytes), GB (gigabytes), and TB (terabytes); valid time units are ms (milliseconds), s (seconds), min (minutes), h (hours), and d (days). Note that the multiplier diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c new file mode 100644 index ea16c64..0e5b0c9 *** a/src/backend/utils/misc/guc.c --- b/src/backend/utils/misc/guc.c *** *** 105,110 --- 105,111 #define KB_PER_MB (1024) #define KB_PER_GB (1024*1024) + #define KB_PER_TB (1024*1024*1024) #define MS_PER_S 1000 #define S_PER_MIN 60 *** parse_int(const char *value, int *result *** 4837,4843 { /* Set hint for use if no match or trailing garbage */ if (hintmsg) ! *hintmsg = gettext_noop("Valid units for this parameter are \"kB\", \"MB\", and \"GB\"."); #if BLCKSZ < 1024 || BLCKSZ > (1024*1024) #error BLCKSZ must be between 1KB and 1MB --- 4838,4844 { /* Set hint for use if no match or trailing garbage */ if (hintmsg) ! *hintmsg = gettext_noop("Valid units for this parameter are \"kB\", \"MB\", \"GB\" and \"TB\"."); #if BLCKSZ < 1024 || BLCKSZ > (1024*1024) #error BLCKSZ must be between 1KB and 1MB *** parse_int(const char *value, int *result *** 4891,4896 --- 4892,4913 break; } } + else if (strncmp(endptr, "TB", 2) == 0) + { + endptr += 2; + switch (flags & GUC_UNIT_MEMORY) + { + case GUC_UNIT_KB: + val *= KB_PER_TB; + break; + case GUC_UNIT_BLOCKS: + val *= KB_PER_TB / (BLCKSZ / 1024); + break; + case GUC_UNIT_XBLOCKS: + val *= KB_PER_TB / (XLOG_BLCKSZ / 1024); + break; + } + } } else if (flags & GUC_UNIT_TIME) { *** _ShowOption(struct config_generic * reco *** 7384,7390 break; } ! if (result % KB_PER_GB == 0) { result /= KB_PER_GB; unit = "GB"; --- 7401,7412 break; } ! if (result % KB_PER_TB == 0) ! { ! result /= KB_PER_TB; ! unit = "TB"; ! } ! else if (result % KB_PER_GB == 0) { result /= KB_PER_GB; unit = "GB"; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] spurious wrap-around shutdown
On Sun, Jun 16, 2013 at 11:54 AM, Jeff Janes > wrote: > In 9.3 HEAD I am getting what seems to be spurious wrap-around shutdowns. > > > postgres=# SELECT datname, datfrozenxid, age(datfrozenxid) FROM > pg_database; > > datname | datfrozenxid |age > ---+--+--- > template1 | 2621759843 | 0 > template0 | 2621759843 | 0 > postgres | 2571759843 | 5000 > jjanes| 2437230921 | 184528922 > While the behavior is weird, it is not a regression (also present in 9.2 with suitable changes in timing) and the shutdown is not spurious. If I execute the above query immediately after the shutdown, I see what I would expect, jjanes has an age of about 2^31. The one table that is holding everything back is already getting autovac for wraparound at that point, and eventually that vacuum finishes. When done, pg_class and pg_database are updated (I don't know how they get updated without trying to assign another transaction), and then I get the above query results. I would think the database would re-allow new transactions at this point, but it does not. I don't know why. Since this isn't a regression in 9.3, I probably won't pursue it any more at this time, unless encouraged to. Cheers, Jeff
Re: [HACKERS] How do we track backpatches?
On Mon, 2013-06-17 at 17:11 -0700, Josh Berkus wrote: > Contributors, > > While going through this mailing list looking for missing 9.4 patches, I > realized that we don't track backpatches (that is, fixes to prior > versions) at all anywhere. Where backpatches are submitted by > committers this isn't an issue, but we had a couple major ones (like the > autovacuum fix) which were submitted by general contributors. The same > goes for beta fixes. > > Should we add a "prior version" category to the CF to make sure these > don't get dropped? Or do something else? A separate commit fest for tracking proposed backpatches might be useful. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers
On Fri, 2013-06-14 at 17:00 -0400, Robert Haas wrote: > Alvaro's work on 9.3, we now have the ability to configure background > workers via shared_preload_libraries. But if you don't have the right > library loaded at startup time, and subsequently wish to add a > background worker while the server is running, you are out of luck. We could tweak shared_preload_libraries so that it reacts sensibly to reloads. I basically gave up on that by writing session_preload_libraries, but if there is more general use for that, we could try. (That doesn't invalidate your work, but it's a thought.) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to add support of "IF NOT EXISTS" to others "CREATE" statements
On Wed, 2013-06-12 at 16:31 -0300, Fabrízio de Royes Mello wrote: > > Btw., I also want REPLACE BUT DO NOT CREATE. > > Can you explain more about it? > Replace/alter the object if it already exists, but fail if it does not exist. The complete set of variants is: - object does not exist: - proceed (normal CREATE) - error (my above description) - object exists: - replace (CREATE OR REPLACE) - skip (CREATE IF NOT EXISTS) - error (normal CREATE) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add regression tests for SET xxx
Thanks ! PFA the updated patch. Also remove a trailing whitespace at the end of SQL script. -- Robins Tharakan On 17 June 2013 17:29, Szymon Guz wrote: > On 26 May 2013 19:56, Robins Tharakan wrote: > >> Hi, >> >> Please find attached a patch to take code-coverage of SET (SESSION / SEED >> / TRANSACTION / DATESTYLE / TIME ZONE) (src/backend/commands/variable.c) >> from 65% to 82%. >> >> Any and all feedback is welcome. >> -- >> Robins Tharakan >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers >> >> > Hi, > the patch applies cleanly on code from trunk, however there are failing > tests, diff attached. > > regards > Szymon > regress_variable_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] How do we track backpatches?
Contributors, While going through this mailing list looking for missing 9.4 patches, I realized that we don't track backpatches (that is, fixes to prior versions) at all anywhere. Where backpatches are submitted by committers this isn't an issue, but we had a couple major ones (like the autovacuum fix) which were submitted by general contributors. The same goes for beta fixes. Should we add a "prior version" category to the CF to make sure these don't get dropped? Or do something else? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 CF 1] Added in missing patches
On Tue, Jun 18, 2013 at 9:00 AM, Josh Berkus wrote: > >> Just wondering, how many patches did you add? 8? I saw a total of 98 >> patches a couple of days ago, now up to 106. > > Then it must be 8. That sounds a about right. Mind you, I immediately > marked 2 as already committed. Just did the same with the patch about removal of pageinspect.sql, it has already been committed by Heikki. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 CF 1] Added in missing patches
> Just wondering, how many patches did you add? 8? I saw a total of 98 > patches a couple of days ago, now up to 106. Then it must be 8. That sounds a about right. Mind you, I immediately marked 2 as already committed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 CF 1] Added in missing patches
On Tue, Jun 18, 2013 at 7:41 AM, Josh Berkus wrote: > Folks, > > At this stage, all of the patches which where not already added into CF1 > should be there. So look carefully and make sure *all* of your patches > are there. > > Amusingly, it's not the new submitters who forgot to add their patch to > the CF, but rather experienced contributors, and even a committer. > > In some cases, I may have accidentally added a patch whose status is > already resolved; if so, that's because I couldn't determine the > resolution status. Better to add some unnecessary patch entries than to > miss a patch. This is great. Thanks! Just wondering, how many patches did you add? 8? I saw a total of 98 patches a couple of days ago, now up to 106. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [9.4 CF 1] What the 5-day Deadline Means
Hackers, I got a question on RRR which I thought should be addressed on this list. Basically, the questioner asked me "I have a day job, I can't promise to review all of these patches in 5 days". The answer is: only put your name down on patches which you *can* review in the next 5 days. Don't claim every patch you might want to review sometime before July 15th; instead, claim patch reviews as you get to actually working on them. Of course, sometimes life gets in the way. At which point, I ask people to *remove* their own names as reviewers as soon as possible when they know they won't have time to review something. It doesn't benefit the CF process to have names down of people who aren't actually doing reviews. Thanks! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers
BTW, one of the ideas that popped up in the unConference session on replication was "why couldn't we use a background worker as a replication agent?" The main reason pointed out was 'because that means you have to restart the postmaster to add a replication agent.' (e.g. - like a Slony "slon" process) There may well be other better reasons not to do so, but it would be nice to eliminate this reason. It seems seriously limiting to the bg-worker concept for them to be thus restricted. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
[HACKERS] [9.4 CF 1] Added in missing patches
Folks, At this stage, all of the patches which where not already added into CF1 should be there. So look carefully and make sure *all* of your patches are there. Amusingly, it's not the new submitters who forgot to add their patch to the CF, but rather experienced contributors, and even a committer. In some cases, I may have accidentally added a patch whose status is already resolved; if so, that's because I couldn't determine the resolution status. Better to add some unnecessary patch entries than to miss a patch. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Add transforms feature
On 06/18/2013 04:58 AM, Peter Eisentraut wrote: > On 6/14/13 11:48 PM, Craig Ringer wrote: >> I wonder if that should be extended to install headers for hstore, >> ltree, and while we're at it, intarray as well? > Sure, if someone wants to go through and check which headers are > independently usable, and do the necessarily cleanups with necessary. I can do that if there are no objections. It's only tangental to this work really, so I'll post a separate thread when I get on to it. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch API for After Triggers
On 06/18/2013 01:25 AM, Pavel Stehule wrote: >> > and also one called >> > UPDATED >> > which would have two row vars called OLD and NEW >> > so you would access it like e.g. IF UPDATED.OLD.id = 7 >> > > nice idea > > +1 Much better naming than OLD_AND_NEW. I'm not so sure about OLD NEW INSERTED DELETED in that I imagine we'd want to pick one pair and stick with it. Since using "INSERTED" / "DELETED" makes "UPDATED" make sense, and since "OLD" and "NEW" are already used to refer to the magic variables of those names in for each row triggers, I think INSERTED / UPDATED / DELETED is the way to go. INSERTED and UPDATED could just be views of the same data as UPDATED that show only the OLD or only the NEW composite type fields. That'd allow you to write a trigger without TG_OP tests in many cases, as UPDATED would always contain what you wanted. It seems slightly weird to have INSERTED and DELETED populated for an UPDATE, but when an UPDATE is logically an INSERT+DELETE anyway... -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Minmax indexes
Hi! This sounds really interesting, so a few quick comments. On 15.6.2013 00:28, Alvaro Herrera wrote: > In each index tuple (corresponding to one page range), we store: - > first block this tuple applies to - last block this tuple applies to > - for each indexed column: * min() value across all tuples in the > range * max() value across all tuples in the range * nulls present in > any tuple? What about adding a counter how many times the min/max value is present in the range? The other messages in this thread suggest that the refresh after UPDATE/DELETE is one of the concerns - as Greg Stark mentioned the range invalidation may only happen when running DELETE on a row matching the min/max value. I believe having a counter would be an improvement - a refresh would be needed only if it reaches 0. > Summarization - > > At index creation time, the whole table is scanned; for each page > range the min() and max() values and nulls bitmap are collected, and > stored in the index. The possibly-incomplete range at the end of the > table is also included. Would it make sense to do this using an existing (b-tree) index for very large tables? Clearly it doesn't make sense to create a b-tree index and then minmax on the same column, but for very large databases upgraded using pg_upgrade this might be interesting. > Once in a while, it is necessary to summarize a bunch of unsummarized > pages (because the table has grown since the index was created), or > re-summarize a range that has been marked invalid. This is simple: > scan the page range calculating the min() and max() for each indexed > column, then insert the new index entry at the end of the index. The > main interesting questions are: > > a) when to do it The perfect time to do it is as soon as a complete > page range of the configured range size has been filled (assuming > page ranges are constant size). > > b) who does it (what process) It doesn't seem a good idea to have a > client-connected process do it; it would incur unwanted latency. > Three other options are (i) to spawn a specialized process to do it, > which perhaps can be signalled by a client-connected process that > executes a scan and notices the need to run summarization; or (ii) to > let autovacuum do it, as a separate new maintenance task. This seems > simple enough to bolt on top of already existing autovacuum > infrastructure. The timing constraints of autovacuum might be > undesirable, though. (iii) wait for user command. > > > The easiest way to go around this seems to have vacuum do it. That > way we can simply do re-summarization on the amvacuumcleanup routine. > Other answers would mean we need a separate AM routine, which appears > unwarranted at this stage. I don't think this should be added to the autovacuum daemon. It's quite tricky to tune autovacuum to be aggressive just enough, i.e. not to run too frequently / not to lag. I'm afraid this would add task consuming unpredictable amounts of time, which would make the autovacuum tuning even trickier. I can live with non-summarized minmax indexes, but I can't live with non-vacuumed tables. > Open questions -- > > * Same-size page ranges? Current related literature seems to consider > that each "index entry" in a minmax index must cover the same number > of pages. There doesn't seem to be a hard reason for this to be so; > it might make sense to allow the index to self-tune so that some > index entries cover smaller page ranges, if this allows the > min()/max() values to be more compact. This would incur larger space > overhead for the index itself, but might allow better pruning of page > ranges during scan. In the limit of one index tuple per page, the > index itself would occupy too much space, even though we would be > able to skip reading the most heap pages, because the min()/max() > ranges are tight; in the opposite limit of a single tuple that > summarizes the whole table, we wouldn't be able to prune anything > from the seqscan even though the index is very small. I see no particular reason not to allow that, and having variable range size would be a very nice feature IMHO. Do you have an indea on how the self-tuning might work? I was thinking about something like this: (1) estimate the initial range size, trying to keep good "selectivity" while not having very small ranges, using (a) average row length (b) number of distinct values in the column (c) MCV/histogram/correlation (2) once the index is built, running a "merge" on the ranges - merging the adjacent pages if the resulting selectivity is not significantly worse (again, this might be evaluated using MCV, histogram) But it should be possible to override this (initial range size, max range size) or disable heuristics completely, as having large ranges probably makes the resummarizing more expensive. Although having the counter might fix that as it's more likely there's another row with min/max value. BTW
Re: [HACKERS] [PATCH] Add transforms feature
Peter Eisentraut wrote: > A transform is an SQL object that supplies to functions for converting > between data types and procedural languages. For example, a transform > could arrange that hstore is converted to an appropriate hash or > dictionary object in PL/Perl or PL/Python. > > Externally visible changes: This is a large patch. Do you intend to push the whole thing as a single commit, or split it? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Add transforms feature
On 6/14/13 11:48 PM, Craig Ringer wrote: > I wonder if that should be extended to install headers for hstore, > ltree, and while we're at it, intarray as well? Sure, if someone wants to go through and check which headers are independently usable, and do the necessarily cleanups with necessary. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-06-17 13:46:07 -0700, Josh Berkus wrote: > On 06/17/2013 01:40 PM, Alvaro Herrera wrote: > > Andres Freund wrote: > > > >> PS: Josh, minor thing, but could you please not trim the CC list, at > >> least when I am on it? > > > > Yes, it's annoying. > > I also get private comments from people who don't want me to cc them > when they are already on the list. I can't satisfy everyone. Given that nobody but you trims the CC list I don't find that a convincing argument. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 06/17/2013 01:40 PM, Alvaro Herrera wrote: > Andres Freund wrote: > >> PS: Josh, minor thing, but could you please not trim the CC list, at >> least when I am on it? > > Yes, it's annoying. I also get private comments from people who don't want me to cc them when they are already on the list. I can't satisfy everyone. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
Andres Freund wrote: > PS: Josh, minor thing, but could you please not trim the CC list, at > least when I am on it? Yes, it's annoying. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Minmax indexes
>> This begins to sound like these indexes are only useful on append-only >> tables. Not that there aren't plenty of those, but ... > > But what? ... "but" the other comments further down in my email. Also, my successive comments in other emails. >> Why? Why can't we just update the affected pages in the index? > > The page range has to be scanned in order to find out the min/max values > for the indexed columns on the range; and then, with these data, update > the index. Seems like you could incrementally update the range, at least for inserts. If you insert a row which doesn't decrease the min or increase the max, you can ignore it, and if it does increase/decrease, you can change the min/max. No? For updates, things are more complicated. If the row you're updating was the min/max, in theory you should update it to adjust that, but you can't verify that it was the ONLY min/max row without doing a full scan. My suggestion would be to add a "dirty" flag which would indicate that that block could use a rescan next VACUUM, and otherwise ignore changing the min/max. After all, the only defect to having min to low or max too high for a block would be scanning too many blocks. Which you'd do anyway with it marked "invalid". > This is not a requirement. It merely makes the index more effective. Right. So I'm saying let's do this index without the FSM modifications, and then consider those as their own, separate patch, if we even do them. > Eh? Sure, my intention for this reloption is for the user to be able to > state their intention for the table, and each feature that has > append-only table optimization does its thing. I wasn't thinking in > anything automatic. 99.7% of our users have no idea what to do with reloptions. We'd have to expose it with an ALTER TABLE SET append_only=true. >> Also, I hate the name ... > > Feel free to propose other names; that way I can hate your proposals > too (or maybe not). Well, my first thought was "block-range indexing", which I think is the best description, but that isn't exactly an exciting name for a feature which will likely be worthy of short-listing for 9.4. I'd prefer it over minmax, which users will think only works on aggregates, but it's still not a great name. "Summary Index" also comes to mind, but really isn't a lot more exciting. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Minmax indexes
Tom Lane wrote: > We've talked a lot about index-organized tables in the past. How much > of the use case for this would be subsumed by a feature like that? IOTs are not flexible enough. You can only have one index that you index-organize the table on; and you can search only based on a prefix of the index key. If you want to change the key, ... um. I don't even know what you'd do. With minmax indexes, on the other hand, you can create one or several, and they let you scan the table based on any of the indexed columns. So you can create a minmax index on creation_date, insertion_date, ship_date; and have it serve queries that use any of these columns. (You probably don't add key column(s) to the minmax index because you already have btrees on them.) On the other hand, IOTs are expensive to insert into. For each tuple to insert you need to start from the root and descend the tree, insert your tuple, then propagate splits upwards. If you have a 10 TB table, you cannot afford to have to do all that for each and every tuple. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Minmax indexes
On 2013-06-17 16:23:40 -0400, Alvaro Herrera wrote: > > > Re-summarization is relatively expensive, because the complete page range > > > has > > > to be scanned. > > > > Why? Why can't we just update the affected pages in the index? > > The page range has to be scanned in order to find out the min/max values > for the indexed columns on the range; and then, with these data, update > the index. Why? Assuming the initial summarization has been performed you can check the current min/max value, check whether it's still smaller/bigger than the value you're inserting and if not update the index accordingly with the new value. You don't even need to wait for the new value to become visible since the ranges don't need to be minimal. I think the contention this possibly causes may a better argument, but I am not sure how much of a problem that really becomes if we find a deadlock free way to only lock the minmax pages exlusively if the range is violated. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Minmax indexes
Josh Berkus wrote: > > Value changes in columns that are part of a minmax index, and tuple > > insertion > > in summarized pages, would invalidate the stored min/max values. To support > > this, each minmax index has a validity map; a range can only be considered > > in a > > scan if it hasn't been invalidated by such changes (A range "not > > considered" in > > the scan needs to be returned in whole regardless of the stored min/max > > values, > > that is, it cannot be pruned per query quals). The validity map is very > > similar to the visibility map in terms of performance characteristics: quick > > enough that it's not contentious, allowing updates and insertions to proceed > > even when data values violate the minmax index conditions. An invalidated > > range can be made valid by re-summarization (see below). > > This begins to sound like these indexes are only useful on append-only > tables. Not that there aren't plenty of those, but ... But what? This is a useful use-case; one that's not served by any other type of index. Sure, you can have btrees over append-only tables, but they are really large and have huge maintainance costs. A smaller lossy index is useful if low-maintenance and if it avoids keeping the cost of scanning the table low enough. These indexes can be considered a sort of partitioning of a large table. Only instead of having to define CHECK (insert_date >= 'a month') for each partition, you just create the index on the insert_date column, and the index will allow a seqscan of the table to skip pages that don't match the query's WHERE clause on that column. > > Re-summarization is relatively expensive, because the complete page range > > has > > to be scanned. > > Why? Why can't we just update the affected pages in the index? The page range has to be scanned in order to find out the min/max values for the indexed columns on the range; and then, with these data, update the index. > > To avoid this, a table having a minmax index would be > > configured so that inserts only go to the page(s) at the end of the table; > > this > > avoids frequent invalidation of ranges in the middle of the table. We > > provide > > a table reloption that tweaks the FSM behavior, so that summarized pages are > > not candidates for insertion. > > We haven't had an index type which modifies table insertion behavior > before, and I'm not keen to start now; imagine having two indexes on the > same table each with their own, conflicting, requirements. This is not a requirement. It merely makes the index more effective. > If we're going to start adding reloptions for specific table behavior, > I'd rather think of all of the optimizations we might have for a > prospective "append-only table" and bundle those, rather than tying it > to whether a certain index exists or not. Eh? Sure, my intention for this reloption is for the user to be able to state their intention for the table, and each feature that has append-only table optimization does its thing. I wasn't thinking in anything automatic. > Also, I hate the name ... Feel free to propose other names; that way I can hate your proposals too (or maybe not). -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Minmax indexes
Greg Stark wrote: > On Fri, Jun 14, 2013 at 11:28 PM, Alvaro Herrera > wrote: > > Re-summarization is relatively expensive, because the complete page range > > has > > to be scanned. > > That doesn't sound too bad to me. It just means there's a downside to > having larger page ranges. I would expect the page ranges to be > something in the ballpark of 32 pages -- scanning 32 pages to > resummarize doesn't sound that painful but sounds like it's large > enough that the resulting index would be a reasonable size. Actually, I'm thinking that a range is more like, say, 1280 pages, or 10 MB. My goal is to consider tables in the 10 TB magnitude; if I store one index tuple for every 32 pages, I would end up with too large an index. With 10 MBs per range I can index the whole table with an index of 50 MB, which seems reasonable to scan. But anyway my intention is that page range size is configurable. > But I don't understand why an insert would invalid a tuple. An insert > can just update the min and max incrementally. It's a delete that > invalidates the range but as you note it doesn't really invalidate it, > just mark it as needing a refresh -- and even then only if the value > being deleted is equal to either the min or max. No, I don't intend to update the index tuple with each heap insert. I think this will end up being too slow. The validity map is there to hold a single bit for each page saying whether the page range is known valid or not; one insert into any page in the range invalidates the range (and any scan of the table needs to scan that range as a whole). This way I can wait until the storm of inserts has passed from a range, and only then do the summarization for that range. This avoids having to summarize the range over and over. Alternatively, I could consider the index tuple always valid, and update it online as soon as we do an insert or update (i.e. examine the min/max values in the index tuple, and update it to match if the new value is out of bounds). This seems too slow, so I won't try. Also, a delete does not invalidate a range either. As Simon said elsewhere in a thread, if the range is not minimal, this is not a problem; we might have to scan some more ranges than absolutely necessary, but it's not a correctness problem. The heap scan recheck node will get rid of the unwanted tuples anyway. > > Same-size page ranges? > > Current related literature seems to consider that each "index entry" in a > > minmax index must cover the same number of pages. There doesn't seem to be > > a > > I assume the reason for this in the literature is the need to quickly > find the summary for a given page when you're handling an insert or > delete. Yeah, that's one thing to keep in mind. I haven't gotten too much into this; I only added these two entries at the end for my own future reference, because I will need to consider them at some point. Right now my intention is to have each index have a fixed page range size, which is defined at index creation time. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-06-17 12:52:36 -0700, Josh Berkus wrote: > > > Well, it can do lots stuff that DROP/CREATE CONCURRENTLY can't: > > * reindex primary keys > > * reindex keys referenced by foreign keys > > * reindex exclusion constraints > > * reindex toast tables > > * do all that for a whole database > > so I don't think that comparison is fair. Having it would have made > > several previous point releases far less painful (e.g. 9.1.6/9.2.1). > > FWIW, I have a client who needs this implementation enough that we're > backporting it to 9.1 for them. Wait. What? Unless you break catalog compatibility that's not safely possible using this implementation. Greetings, Andres Freund PS: Josh, minor thing, but could you please not trim the CC list, at least when I am on it? -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch API for After Triggers
Simon Riggs wrote: > On 9 June 2013 12:58, Craig Ringer wrote: >> We don't currently have OLD and NEW relations so we're free to >> define how this works pretty freely. > I think the best way, if we did do this, would be to have a > number of different relations defined: > > OLD > NEW > INSERTED > DELETED > all of which would be defined same as main table > > and also one called > UPDATED > which would have two row vars called OLD and NEW > so you would access it like e.g. IF UPDATED.OLD.id = 7 Well, there is the SQL standard, which has a couple paragraphs on the topic which we might want to heed. For a delete there is just an old table; for an insert just a new one. For an update you have both, with the same cardinality. The rows in the old and new tables have a correspondence, but that is only visible to FOR EACH ROW triggers. For something like RI, why would you need to establish correspondence? A row with the referenced key either exists after the statement completes, or it doesn't -- why would we care whether it is an updated version of the same row? Syntax for how to refer to the these is defined by the standard. As usual, I don't object to adding capabilities as long as the standard syntax is also supported with standard semantics. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
> Well, it can do lots stuff that DROP/CREATE CONCURRENTLY can't: > * reindex primary keys > * reindex keys referenced by foreign keys > * reindex exclusion constraints > * reindex toast tables > * do all that for a whole database > so I don't think that comparison is fair. Having it would have made > several previous point releases far less painful (e.g. 9.1.6/9.2.1). FWIW, I have a client who needs this implementation enough that we're backporting it to 9.1 for them. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] refresh materialized view concurrently
On 06/17/2013 04:13 AM, Heikki Linnakangas wrote: > On 14.06.2013 19:05, Kevin Grittner wrote: >> Attached is a patch for REFRESH MATERIALIZED VIEW CONCURRENTLY for >> 9.4 CF1. The goal of this patch is to allow a refresh without >> interfering with concurrent reads, using transactional semantics. >> >> It is my hope to get this committed during this CF to allow me to >> focus on incremental maintenance for the rest of the release cycle. > > I must say this seems a bit pointless on its own. But if it's a stepping > stone to incremental maintenance, I have no objections. Actually, CONCURRENTLY was cited as the #1 deficiency for the matview feature for 9.3. With it, the use-case for Postgres matviews is broadened considerably. So it's very valuable on its own, even if (for example) INCREMENTAL didn't get done for 9.3. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Minmax indexes
> So there isn't a "fall down" thing here. We expect the recently > loaded/updated data to be scanned and that's OK. > > Having the minmax index updated greedily is just adding extra work for > fast diminishing returns. We can always add that later if really > needed, but I doubt it will be needed - in just the same way as mat > views aren't greedily updated. Ok, in that case, can we add the patch without messing with the FSM logic? It'll work out-of-the-box for append-only tables, and that's a pretty solid use case. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila wrote: > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote: > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila wrote: >> >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote: >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila wrote: >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: Hello, >>> >> We have already started a discussion on pgsql-hackers for the problem of >> taking fresh backup during the failback operation here is the link for >> that: >> >> >> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb >> jgwrfu513...@mail.gmail.com >> >> Let me again summarize the problem we are trying to address. >> >> > How will you take care of extra WAL on old master during recovery. If it > plays the WAL which has not reached new-master, it can be a problem. >> you means that there is possible that old master's data ahead of new master's data. >> >>> I mean to say is that WAL of old master can be ahead of new master. I >>> understood that >>> data files of old master can't be ahead, but I think WAL can be ahead. >> so there is inconsistent data between those server when fail back. right? if so , there is not possible inconsistent. because if you use GUC option as his propose (i.g., failback_safe_standby_mode = remote_flush), when old master is working fine, all file system level changes aren't done before WAL replicated. >> >>> Would the propose patch will take care that old master's WAL is also not >>> ahead in some way? >>> If yes, I think i am missing some point. > >> yes it will happen that old master's WAL ahead of new master's WAL as you >> said. >> but I think that we can solve them by delete all WAL file when old >> master starts as new standby. > > I think ideally, it should reset WAL location at the point where new master > has forrked off. > In such a scenario it would be difficult for user who wants to get a dump of > some data in > old master which hasn't gone to new master. I am not sure if such a need is > there for real users, but if it > is there, then providing this solution will have some drawbacks. I think that we can dumping data before all WAL files deleting. All WAL files deleting is done when old master starts as new standby. Regards, --- Sawada Masahiko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 CF 1] Commit Fest has started
Amit, > I am interested in assisting you for this CF. > Kindly let me know how can I add value for CommitFest management. Thank you for the offer! However, you're currently signed up to review several patches, and I'd rather have you doing that than sending out reminder emails. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 3: ordering in index
On 17.06.2013 15:56, Alexander Korotkov wrote: On Sat, Jun 15, 2013 at 3:02 AM, Alexander Korotkovwrote: This patch introduces new interface method of GIN which takes same arguments as consistent but returns float8. float8 gin_ordering(bool check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool nullFlags[], Datum addInfo[], bool addInfoIsNull[]) This patch implements gingettuple method which can return ordering data using KNN infrastructure. Also it introduces>< operator for fts which support ordering in GIN index. Some example: postgres=# explain analyze select * from dblp_titles2 where tsvector @@ to_tsquery('english', 'statistics') order by tsvector>< to_tsquery('english', 'statistics') limit 10; QUERY PLAN - Limit (cost=12.00..48.22 rows=10 width=136) (actual time=6.999..7.120 rows=10 loops=1) -> Index Scan using dblp_titles2_idx on dblp_titles2 (cost=12.00..43003.03 rows=11868 width=136) (actual time=6.996..7.115 rows=10 loops=1) Index Cond: (tsvector @@ '''statist'''::tsquery) Order By: (tsvector>< '''statist'''::tsquery) Total runtime: 7.556 ms (5 rows) Attached version of patch has some refactoring and bug fixes. Thanks. There are no docs changes and not many comments, that needs to be fixed, but I think I understand how it works: On the first call to gingettuple, the index is first scanned for all the matches, which are collected in an array in memory. Then, the array is sorted with qsort(), and the matches are returned one by one from the sorted array. That has some obvious limitations. First of all, you can run out of memory. Secondly, is that really any faster than the plan you get without this patch? Ie. scan all the matches with a bitmap index scan, and sort the results on the rank function. If it is faster, why? What parts of the previous two patches does this rely on? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Mon, Jun 17, 2013 at 9:23 PM, Michael Paquier wrote: > > > > On Mon, Jun 17, 2013 at 5:23 AM, Andres Freund > wrote: >> >> On 2013-06-17 04:20:03 +0900, Fujii Masao wrote: >> > On Thu, Jun 6, 2013 at 1:29 PM, Michael Paquier >> > wrote: >> > > Hi all, >> > > >> > > Please find attached the latest versions of REINDEX CONCURRENTLY for >> > > the 1st >> > > commit fest of 9.4: >> > > - 20130606_1_remove_reltoastidxid_v9.patch, removing reltoastidxid, to >> > > allow >> > > a toast relation to have multiple indexes running in parallel (extra >> > > indexes >> > > could be created by a REINDEX CONCURRENTLY processed) >> > > - 20130606_2_reindex_concurrently_v26.patch, correcting some comments >> > > and >> > > fixed a lock in index_concurrent_create on an index relation not >> > > released at >> > > the end of a transaction >> > >> > Could you let me know how this patch has something to do with MVCC >> > catalog >> > access patch? Should we wait for MVCC catalog access patch to be >> > committed >> > before starting to review this patch? >> >> I wondered the same. The MVCC catalog patch, if applied, would make it >> possible to make the actual relfilenode swap concurrently instead of >> requiring to take access exlusive locks which obviously is way nicer. On >> the other hand, that function is only a really small part of this patch, >> so it seems quite possible to make another pass at it before relying on >> mvcc catalog scans. > > As mentionned by Andres, the only thing that the MVCC catalog patch can > improve here > is the index swap phase (index_concurrent_swap:index.c) where the > relfilenode of the > old and new indexes are exchanged. Now an AccessExclusiveLock is taken on > the 2 relations > being swap, we could leverage that to ShareUpdateExclusiveLock with the MVCC > catalog > access I think. > > Also, with the MVCC catalog patch in, we could add some isolation tests for > REINDEX CONCURRENTLY (there were some tests in one of the previous > versions), > what is currently not possible due to the exclusive lock taken at swap > phase. > > Btw, those are minor things in the patch, so I think that it would be better > to not wait > for the MVCC catalog patch. Even if you think that it would be better to > wait for it, > you could even begin with the 1st patch allowing a toast relation to have > multiple > indexes (removal of reltoastidxid) which does not depend at all on it. Here are the review comments of the removal_of_reltoastidxid patch. I've not completed the review yet, but I'd like to post the current comments before going to bed ;) *** a/src/backend/catalog/system_views.sql -pg_stat_get_blocks_fetched(X.oid) - -pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read, -pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit +pg_stat_get_blocks_fetched(X.indrelid) - +pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_read, +pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_hit ISTM that X.indrelid indicates the TOAST table not the TOAST index. Shouldn't we use X.indexrelid instead of X.indrelid? You changed some SQLs because of removal of reltoastidxid. Could you check that the original SQL and changed one return the same value, again? doc/src/sgml/diskusage.sgml > There will be one index on the > TOAST table, if present. I'm not sure if multiple indexes on TOAST table are viewable by a user. If it's viewable, we need to correct the above description. doc/src/sgml/monitoring.sgml > tidx_blks_read > bigint > Number of disk blocks read from this table's TOAST table index (if > any) > > > tidx_blks_hit > bigint > Number of buffer hits in this table's TOAST table index (if > any) For the same reason as the above, we need to change "index" to "indexes" in these descriptions? *** a/src/bin/pg_dump/pg_dump.c + "SELECT c.reltoastrelid, t.indexrelid " "FROM pg_catalog.pg_class c LEFT JOIN " - "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " - "WHERE c.oid = '%u'::pg_catalog.oid;", + "pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) " + "WHERE c.oid = '%u'::pg_catalog.oid AND t.indisvalid " + "LIMIT 1", Is there the case where TOAST table has more than one *valid* indexes? If yes, is it really okay to choose just one index by using LIMIT 1? If no, i.e., TOAST table should have only one valid index, we should get rid of LIMIT 1 and check that only one row is returned from this query. Fortunately, ISTM this check has been already done by the subsequent call of ExecuteSqlQueryForSingleRow(). Thought? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
Re: [HACKERS] Batch API for After Triggers
2013/6/17 Simon Riggs : > On 9 June 2013 12:58, Craig Ringer wrote: >> On 06/09/2013 04:58 PM, Simon Riggs wrote: >>> There are also difficulties in semantics, since when >>> we have OLD and NEW at row level we know we are discussing the same >>> row. With sets of OLD and NEW we'd need to be able to link the >>> relations back together somehow, which couldn't be done by PK since >>> that could change. >> >> We don't currently have OLD and NEW relations so we're free to define >> how this works pretty freely. >> >> Rather than having OLD and NEW as separate relations, we could just have >> one OLD_AND_NEW relation. In that relation we exploit Pg's composite >> types to nest the old and new tuples in a single outer change record. >> >> OLD_AND_NEW would look to PL/PgSQL as if it were: >> >> CREATE TEMPORARY TABLE OLD_AND_NEW ( >> OLD tabletype NOT NULL, >> NEW tabletype NOT NULL >> ); >> >> ...though presumably without the ability to create indexes on it and the >> other things you can do to a real temp table. Though I can see cases >> where that'd be awfully handy too. >> >> For DELETE and INSERT we'd either provide different relations named OLD >> and NEW respectively, or we'd use OLD_AND_NEW with one field or the >> other blank. I'm not sure which would be best. >> >> Alternately, we could break the usual rules for relations and define OLD >> and NEW as ordered, so lock-step iteration would always return matching >> pairs of rows. That's useless in SQL since there's no way to achieve >> lock-step iteration, but if we provide a >> "for_each_changed_row('some_function'::regproc)" that scans them in >> lock-step and invokes `some_function` for each one...? (I haven't yet >> done enough in the core to have any idea if this approach is completely >> and absurdly impossible, or just ugly. Figured I'd throw it out there >> anyway.) > > > I think the best way, if we did do this, would be to have a number of > different relations defined: > > OLD > NEW > INSERTED > DELETED > all of which would be defined same as main table > > and also one called > UPDATED > which would have two row vars called OLD and NEW > so you would access it like e.g. IF UPDATED.OLD.id = 7 > nice idea +1 Pavel > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] matview incremental maintenance
On 17 June 2013 15:41, Kevin Grittner wrote: > While I have yet to look in detail at the mechanism for capturing > the initial delta on the base tables, the two fairly obvious > candidates are to stuff the before and after images into a > tuplestore or temp table as base table changes are written, > somewhere around the point that triggers would be fired, or to use > the WAL stream in some way. The advantages of the former are that > it would be hard to find a lower overhead way to capture the data, > nor a more certain way to get exactly the right data. The latter, > which Simon has been arguing is better than using triggers, would > have the advantage of not directly slowing down a process writing > to base tables, although for more eager modes transactions would > need to block waiting for the data to flow through the walsender, > be filtered and assembled as data of interest, and communicated > back to the transaction somehow before it could proceed. Assuming > that it can provide the changeset prior to the commit, and that it > can include "before" images, it could work, but the timing sure > seems dubious for the more eager modes. It isn't unconditionally true statement to say "it would be hard to find a lower overhead way to capture the data", since there is strong experimental evidence from work on replication that shows that using the WAL is very effective mechanism for changeset extraction. There is nothing to say the changeset must occur through the WalSender. That is just where it currently occurs, but it could easily occur elsewhere, if the requirement existed. Similarly, changeset extraction doesn't currently allow access to uncommitted rows, but it could do so, if required. Before images of change could be provided by direct access to prior versions via their tid, just as they are with triggers. There are other advantages to using WAL that you don't mention, such as the avoidance of the need for the trigger queue to spill to disk, avoidance of memory overhead for large transactions and avoidance of random I/O. ISTM that using WAL has to be properly considered as a viable option which is why open discussion makes sense. The timing of that discussion doesn't need to be immediate but certainly it should happen before any options are precluded because of the progress of other events. Let me me know when that's appropriate, so we can discuss. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib
Le lundi 17 juin 2013 18:41:32, Alvaro Herrera a écrit : > Joe Conway wrote: > > On 06/15/2013 11:28 AM, Alvaro Herrera wrote: > > > This use case seems too narrow to me to justify the burden of > > > keeping PGXS-enabled makefiles in contrib. > > > > What was the burden of it? > > Per http://www.postgresql.org/message- id/1371093408.309.5.ca...@vanquo.pezone.net : > : 1. take up space > : 2. confuse users > : 3. produce broken external extension modules that take contrib as an > : example 4. break builds of PostgreSQL when users try to fix 3. by > : exporting USE_PGXS But: 4. can be fixed (see patches I sent) so it is not an excuse. I agree for other points. My only grief is to loose the perfect regression tests for PGXS those contribs are. -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] matview incremental maintenance
On 17 June 2013 15:41, Kevin Grittner wrote: > Since there seems to be interest in discussing incremental > maintenance of materialized views *now* Since your earlier complaint, I specificaly said I was happy to wait to discuss that. Why have you raised this now? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib
Joe Conway wrote: > On 06/15/2013 11:28 AM, Alvaro Herrera wrote: > > This use case seems too narrow to me to justify the burden of > > keeping PGXS-enabled makefiles in contrib. > > What was the burden of it? Per http://www.postgresql.org/message-id/1371093408.309.5.ca...@vanquo.pezone.net : : 1. take up space : 2. confuse users : 3. produce broken external extension modules that take contrib as an example : 4. break builds of PostgreSQL when users try to fix 3. by exporting USE_PGXS -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib
On Jun 16, 2013, at 9:20 AM, Cédric Villemain wrote: >> Then instead of the above you'd just be able to say something like >> >> MODULETEST = test > > or REGRESSDIR ? Yeah, that sounds perfect. > Also I suggest to remove the need to set REGRESS at all, and default to all > sql files in REGRESSDIR/sql (if REGRESSDIR is set) Yeah, that would be nice. If one has different file names or something, then one should still be able to set REGRESS. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] refresh materialized view concurrently
On Mon, Jun 17, 2013 at 11:21 AM, Kevin Grittner wrote: > Nicolas Barbier wrote: >> 2013/6/17 Heikki Linnakangas : > >> >>> +errmsg("concurrent refresh requires a >>> unique index on just columns for all rows of the materialized view"))); >> >> Maybe my english is failing me here, but I don’t understand the “just” part. > > It means that the index must not use any expressions in the list of > what it's indexing on -- only column names. Suggestions for better > wording would be welcome. Random idea: ERROR: materialized view \"%s\" does not have a unique key Perhaps augmented with: HINT: Create a UNIQUE btree index with no WHERE clause on one or more columns of the materialized view. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
On Sat, Jun 15, 2013 at 9:37 PM, Alvaro Herrera wrote: > Nicholas White escribió: > >> For the parsing changes, it seems I can either make RESPECT and IGNORE >> reserved keywords, or add a lookahead to construct synthetic RESPECT NULLS >> and IGNORE NULLS keywords. The grammar wouldn't compile if RESPECT and >> IGNORE were just normal unreserved keywords due to ambiguities after a >> function definition (e.g. select abs(1) respect; - which is currently a >> valid statement). > > Well, making them reserved keywords is not that great, so maybe the > lookahead thingy is better. However, this patch introduces the third > and fourth uses of the "save the lookahead token" pattern in the > "default" switch cases. Can we refactor that bit somehow, to avoid so > many duplicates? (For a minute I thought that Andrew Gierth's WITH > ORDINALITY patch would add another one, but it seems not.) Making things reserved keywords is painful and I don't like it, but I've started to become skeptical of shifting the problem to the lexer, too. Sometimes special case logic in the lexer about token combining can have surprising consequences in other parts of the grammar. For example, with a lexer hack, what will happen if someone has a column named RESPECT and does SELECT ... ORDER BY respect NULLS LAST? I haven't studied the code in detail so maybe it's fine, but it's something to think about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] refresh materialized view concurrently
Nicolas Barbier wrote: > 2013/6/17 Heikki Linnakangas : > >> + errmsg("concurrent refresh requires a >> unique index on just columns for all rows of the materialized view"))); > > Maybe my english is failing me here, but I don’t understand the “just” part. It means that the index must not use any expressions in the list of what it's indexing on -- only column names. Suggestions for better wording would be welcome. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] matview incremental maintenance
On 2013-06-17 16:41 +02:00, Kevin Grittner wrote: Since there seems to be interest in discussing incremental maintenance of materialized views *now*, I'm starting this thread to try to avoid polluting unrelated threads with the discussion. I don't intend to spend a lot of time on it until the CF in progress completes, but at that point the work will start in earnest. So I'll say where I'm at, and welcome anyone who has time to spare outside of the CF to comment or contribute ideas. The paper at the core of the discussion can be found by searching for "maintaining views incrementally gupta mumick subrahmanian" -- it's on both the ACM and CiteSeerX websites. Of course, one i.e. Ashish Gupta, Inderpal Singh Mumick, and V. S. Subrahmanian. 1993. Maintaining views incrementally. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data (SIGMOD '93), Peter Buneman and Sushil Jajodia (Eds.). ACM, New York, NY, USA, 157-166. DOI=10.1145/170035.170066 http://doi.acm.org/10.1145/170035.170066 just in case a direct reference might come in handy :-) ... All the ebst, Stefan. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-06-17 11:03:35 -0400, Peter Eisentraut wrote: > On 6/17/13 9:19 AM, Andres Freund wrote: > >> Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is > >> not really concurrent, at least not concurrent to the standard set by > >> CREATE and DROP INDEX CONCURRENTLY. > > > > Well, it still does the main body of work in a concurrent fashion, so I > > still don't see how that argument holds that much water. > > The reason we added DROP INDEX CONCURRENTLY is so that you don't get > stuck in a lock situation like > > long-running-transaction <- DROP INDEX <- everything else > > If we accepted REINDEX CONCURRENTLY as currently proposed, then it would > have the same problem. > > I don't think we should accept a REINDEX CONCURRENTLY implementation > that is worse in that respect than a manual CREATE INDEX CONCURRENTLY + > DROP INDEX CONCURRENTLY combination. Well, it can do lots stuff that DROP/CREATE CONCURRENTLY can't: * reindex primary keys * reindex keys referenced by foreign keys * reindex exclusion constraints * reindex toast tables * do all that for a whole database so I don't think that comparison is fair. Having it would have made several previous point releases far less painful (e.g. 9.1.6/9.2.1). But anyway, the as I said "the argument was only whether we could continue reviewing before the mvcc stuff goes in, not whether it can get committed before.". I don't think we a have need to decide whether REINDEX CONCURRENTLY can go in with the short exclusive lock unless we find unresolveable problems with the mvcc patch. Which I very, very much hope not to be the case. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 6/17/13 9:19 AM, Andres Freund wrote: >> Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is >> not really concurrent, at least not concurrent to the standard set by >> CREATE and DROP INDEX CONCURRENTLY. > > Well, it still does the main body of work in a concurrent fashion, so I > still don't see how that argument holds that much water. The reason we added DROP INDEX CONCURRENTLY is so that you don't get stuck in a lock situation like long-running-transaction <- DROP INDEX <- everything else If we accepted REINDEX CONCURRENTLY as currently proposed, then it would have the same problem. I don't think we should accept a REINDEX CONCURRENTLY implementation that is worse in that respect than a manual CREATE INDEX CONCURRENTLY + DROP INDEX CONCURRENTLY combination. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] matview incremental maintenance
Since there seems to be interest in discussing incremental maintenance of materialized views *now*, I'm starting this thread to try to avoid polluting unrelated threads with the discussion. I don't intend to spend a lot of time on it until the CF in progress completes, but at that point the work will start in earnest. So I'll say where I'm at, and welcome anyone who has time to spare outside of the CF to comment or contribute ideas. The paper at the core of the discussion can be found by searching for "maintaining views incrementally gupta mumick subrahmanian" -- it's on both the ACM and CiteSeerX websites. Of course, one doesn't need to understand that paper to discuss techniques for capturing the base deltas, but I'm hoping that's not what takes up most of the discussion. I expect the most important discussions to be around how best to handle the "count(t)" (or count_t) column, what form should be use for intermediate results, how to modify or add execution nodes which know how to deal with the count, how to generate set operations to use those nodes, and how to modify the planner to choose the best plan for these operations. Whether to pull the deltas off the WAL stream or stuff them into a tuplestore as they are written seems to me to be a relatively minor point. If properly abstracted, the performance and complexity of alternatives can be compared. The one thing that seems somewhat clear to me at the moment is that the complex set algebra needed to use the counting algorithm for incremental maintenance is not going to be something I want to handle by dynamically building up execution nodes. That way lies madness. SPI or something very similar to it should be used, probably with a layer or two above it to simplify working with the algebra separately from diddling around with strings for the query fragments. At the developer meeting last month, we talked about the special new count column for a bit, and everyone seemed to agree that adding such an animal, and creating execution nodes which were aware of it, would best be done on top of the patch Álvaro has been working on to replace attnum with three columns: a logical ID number for each column, the physical order of the attribute within the tuple image, and the display order (for SELECT *, INSERT without a column list, and similar cases). We seemed to have consensus that the count_t column would not display by default, but could be explicitly called out by a query, similar to the current handling of system columns. Nobody wanted to have a negative column number for the count or add it to the tuple header structure. Unfortunately I have heard from Álvaro that the patch is not complete and is not on his list of things to work on in the near future. Long term, timings for incremental maintenance that people would like to see (from most eager to least eager) are: - as part of completing each statement, so that the affect on the matview is immediately visible to the transaction which modifies a supporting table, and becomes visible at commit to other transactions - at transaction commit time, so that other transactions see the changes to the base tables and the referencing matviews at the same point in time - from a FIFO queue which is processed by a background process whenever data is present (possibly with pacing) - from a FIFO queue based on a schedule, so that matviews are stable between applications and/or to avoid burdening the machine during peak periods - incremental update, or even full refresh, on an attempt to query a "stale" matview - explicit request to apply incremental updates or refresh Incremental maintenance of a materialized view is a heuristic, to refresh contents more quickly than might happen by re-running the query which defines the matview. There will always be cases where the changes are so extensive that applying the delta will be slower than a refresh. At some point we should have a cost-based way to recognize when we have crossed that threshold, and fall back to the refresh technique. That's not for this release, though. In previous discussion there seemed to be a consensus that before incremental maintenance for a materialized view could be turned on, the matview would need to be populated and all referenced tables would need to be flagged as generating delta information, through a new ALTER TABLE option. While I have yet to look in detail at the mechanism for capturing the initial delta on the base tables, the two fairly obvious candidates are to stuff the before and after images into a tuplestore or temp table as base table changes are written, somewhere around the point that triggers would be fired, or to use the WAL stream in some way. The advantages of the former are that it would be hard to find a lower overhead way to capture the data, nor a more certain way to get exactly the right data. The latter, which Simon has been arguing is better than using triggers, would have the advantage of n
[HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn
Hi, When postgres on linux receives connection on a high rate client connections sometimes error out with: could not send data to server: Transport endpoint is not connected could not send startup packet: Transport endpoint is not connected To reproduce start something like on a server with sufficiently high max_connections: pgbench -h /tmp -p 5440 -T 10 -c 400 -j 400 -n -f /tmp/simplequery.sql Now that's strange since that error should happen at connect(2) time, not when sending the startup packet. Some investigation led me to fe-secure.c's PQConnectPoll: if (connect(conn->sock, addr_cur->ai_addr, addr_cur->ai_addrlen) < 0) { if (SOCK_ERRNO == EINPROGRESS || SOCK_ERRNO == EWOULDBLOCK || SOCK_ERRNO == EINTR || SOCK_ERRNO == 0) { /* * This is fine - we're in non-blocking mode, and * the connection is in progress. Tell caller to * wait for write-ready on socket. */ conn->status = CONNECTION_STARTED; return PGRES_POLLING_WRITING; } /* otherwise, trouble */ } So, we're accepting EWOULDBLOCK as a valid return value for connect(2). Which it isn't. EAGAIN in contrast is on some BSDs and on linux. Unfortunately POSIX allows those two to share the same value... My manpage tells me: EAGAIN No more free local ports or insufficient entries in the routing cache. For AF_INET see the description of /proc/sys/net/ipv4/ip_local_port_range ip(7) for information on how to increase the number of local ports. So, the problem is that we took a failed connection as having been initially successfull but in progress. Not accepting EWOULDBLOCK in the above if() results in: could not connect to server: Resource temporarily unavailable Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.5440"? which makes more sense. Trivial patch attached. Now, the question is why we cannot complete connections on unix sockets? Some code reading reading shows net/unix/af_unix.c:unix_stream_connect() shows: if (unix_recvq_full(other)) { err = -EAGAIN; if (!timeo) goto out_unlock; So, if we're in nonblocking mode - which we are - and the receive queue is full we return EAGAIN. The receive queue for unix sockets is defined as static inline int unix_recvq_full(struct sock const *sk) { return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog; } Where sk_max_ack_backlog is whatever has been passed to the listen(backlog) on the listening side. Question: But postgres does listen(fd, MaxBackends * 2), how can that be a problem? Answer: If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128. In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128. Setting somaxconn to something higher indeed makes the problem go away. I'd guess that pretty much the same holds true for tcp connections, although I didn't verify that which would explain some previous reports on the lists. TLDR: Increase /proc/sys/net/core/somaxconn Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services >From da5cfb7d237a4a07b146fb9d255f0de72207de10 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 17 Jun 2013 16:00:58 +0200 Subject: [PATCH] libpq: Handle connect(2) returning EAGAIN/EWOULDBLOCK correctly libpq used to accept EWOULDBLOCK - which is allowed to have the same value as EAGAIN by posix - as a valid return code to connect(2) indicating that a connection is in progress. While posix doesn't specify either as a valid return code, BSD based systems and linux use it to indicate temporary resource exhaustion. Accepting either as a in-progress connection attempt leads to hard to diagnose errors when sending the startup packet: could not send data to server: Transport endpoint is not connected could not send startup packet: Transport endpoint is not connected Treating it as an error results in: could not connect to server: Resource temporarily unavailable Is the server running locally and accepting connections on Unix domain socket "..."? which is more accurate. --- src/interfaces/libpq/fe-connect.c | 1 - 1 file changed, 1 deletion(-) diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 0d729c8..c17c303 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -1780,7 +1780,6 @@ keep_going: /* We will come back to here until there is addr_cur->ai_addrlen) < 0) { if (SOCK_ERRNO == EINPROGRESS || - SOCK_ERRNO == EWOULDBLOCK || SOCK_ERRNO == EINTR || SOCK_ERRNO == 0) { -- 1
Re: [HACKERS] refresh materialized view concurrently
2013/6/17 Heikki Linnakangas : > +errmsg("concurrent refresh requires a > unique index on just columns for all rows of the materialized view"))); Maybe my english is failing me here, but I don’t understand the “just” part. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-06-17 09:12:12 -0400, Peter Eisentraut wrote: > On 6/17/13 8:23 AM, Michael Paquier wrote: > > As mentionned by Andres, the only thing that the MVCC catalog patch can > > improve here > > is the index swap phase (index_concurrent_swap:index.c) where the > > relfilenode of the > > old and new indexes are exchanged. Now an AccessExclusiveLock is taken > > on the 2 relations > > being swap, we could leverage that to ShareUpdateExclusiveLock with the > > MVCC catalog > > access I think. > > Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is > not really concurrent, at least not concurrent to the standard set by > CREATE and DROP INDEX CONCURRENTLY. Well, it still does the main body of work in a concurrent fashion, so I still don't see how that argument holds that much water. But anyway, the argument was only whether we could continue reviewing before the mvcc stuff goes in, not whether it can get committed before. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] refresh materialized view concurrently
Simon Riggs wrote: > Kevin Grittner wrote: >> Changeset extraction has nothing to do with this patch, and >> cannot possibly be useful for it. Please keep discussion which >> is completely unrelated to this patch off this thread. > You mentioned "incremental maintenance" in your original post and > I have been discussing it. Had you not mentioned it, I doubt I > would have thought of it. > > But since you did mention it, and its clearly an important issue, > it seems relevant to have discussed it here and now. What I said was that I wanted to get this out of the way before I started working on incremental maintenance. > I'm happy to wait for you to start the thread discussing it > directly though. Cool. -Kevin -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 6/17/13 8:23 AM, Michael Paquier wrote: > As mentionned by Andres, the only thing that the MVCC catalog patch can > improve here > is the index swap phase (index_concurrent_swap:index.c) where the > relfilenode of the > old and new indexes are exchanged. Now an AccessExclusiveLock is taken > on the 2 relations > being swap, we could leverage that to ShareUpdateExclusiveLock with the > MVCC catalog > access I think. Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is not really concurrent, at least not concurrent to the standard set by CREATE and DROP INDEX CONCURRENTLY. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch API for After Triggers
On 9 June 2013 12:58, Craig Ringer wrote: > On 06/09/2013 04:58 PM, Simon Riggs wrote: >> There are also difficulties in semantics, since when >> we have OLD and NEW at row level we know we are discussing the same >> row. With sets of OLD and NEW we'd need to be able to link the >> relations back together somehow, which couldn't be done by PK since >> that could change. > > We don't currently have OLD and NEW relations so we're free to define > how this works pretty freely. > > Rather than having OLD and NEW as separate relations, we could just have > one OLD_AND_NEW relation. In that relation we exploit Pg's composite > types to nest the old and new tuples in a single outer change record. > > OLD_AND_NEW would look to PL/PgSQL as if it were: > > CREATE TEMPORARY TABLE OLD_AND_NEW ( > OLD tabletype NOT NULL, > NEW tabletype NOT NULL > ); > > ...though presumably without the ability to create indexes on it and the > other things you can do to a real temp table. Though I can see cases > where that'd be awfully handy too. > > For DELETE and INSERT we'd either provide different relations named OLD > and NEW respectively, or we'd use OLD_AND_NEW with one field or the > other blank. I'm not sure which would be best. > > Alternately, we could break the usual rules for relations and define OLD > and NEW as ordered, so lock-step iteration would always return matching > pairs of rows. That's useless in SQL since there's no way to achieve > lock-step iteration, but if we provide a > "for_each_changed_row('some_function'::regproc)" that scans them in > lock-step and invokes `some_function` for each one...? (I haven't yet > done enough in the core to have any idea if this approach is completely > and absurdly impossible, or just ugly. Figured I'd throw it out there > anyway.) I think the best way, if we did do this, would be to have a number of different relations defined: OLD NEW INSERTED DELETED all of which would be defined same as main table and also one called UPDATED which would have two row vars called OLD and NEW so you would access it like e.g. IF UPDATED.OLD.id = 7 -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part2: fast scan
On 17.06.2013 15:55, Alexander Korotkov wrote: On Sat, Jun 15, 2013 at 2:55 AM, Alexander Korotkovwrote: attached patch implementing "fast scan" technique for GIN. This is second patch of GIN improvements, see the 1st one here: http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com This patch allow to skip parts of posting trees when their scan is not necessary. In particular, it solves "frequent_term& rare_term" problem of FTS. It introduces new interface method pre_consistent which behaves like consistent, but: 1) allows false positives on input (check[]) 2) allowed to return false positives Some example: "frequent_term& rare_term" becomes pretty fast. create table test as (select to_tsvector('english', 'bbb') as v from generate_series(1,100)); insert into test (select to_tsvector('english', 'ddd') from generate_series(1,10)); create index test_idx on test using gin (v); postgres=# explain analyze select * from test where v @@ to_tsquery('english', 'bbb& ddd'); QUERY PLAN --- Bitmap Heap Scan on test (cost=942.75..7280.63 rows=5000 width=17) (actual time=0.458..0.461 rows=10 loops=1) Recheck Cond: (v @@ '''bbb''& ''ddd'''::tsquery) -> Bitmap Index Scan on test_idx (cost=0.00..941.50 rows=5000 width=0) (actual time=0.449..0.449 rows=10 loops=1) Index Cond: (v @@ '''bbb''& ''ddd'''::tsquery) Total runtime: 0.516 ms (5 rows) Attached version of patch has some refactoring and bug fixes. Good timing, I just started looking at this. I think you'll need to explain how this works. There are no docs, and almost no comments. (and this shows how poorly I understand this, but) Why does this require the "additional information" patch? What extra information do you store on-disk, in the additional information? The pre-consistent method is like the consistent method, but it allows false positives. I think that's because during the scan, before having scanned for all the keys, the gin AM doesn't yet know if the tuple contains all of the keys. So it passes the keys it doesn't yet know about as 'true' to pre-consistent. Could that be generalized, to pass a tri-state instead of a boolean for each key to the pre-consistent method? For each key, you would pass "true", "false", or "don't know". I think you could then also speed up queries like "!english & bbb". - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] refresh materialized view concurrently
Heikki Linnakangas wrote: > On 14.06.2013 19:05, Kevin Grittner wrote: >> Attached is a patch for REFRESH MATERIALIZED VIEW CONCURRENTLY >> for 9.4 CF1. The goal of this patch is to allow a refresh >> without interfering with concurrent reads, using transactional >> semantics. >> >> It is my hope to get this committed during this CF to allow me >> to focus on incremental maintenance for the rest of the release >> cycle. > > I must say this seems a bit pointless on its own. I completely disagree. When I read what people were posting about the materialized view creation that went into 9.3, there were many comments by people that they can't use it until the materialized views can be refreshed without blocking readers. There is a clear need for this. It doesn't do much to advance incremental maintenance, but it is a much smaller patch which will make matviews usable by a lot of people who can't use the initial feature set. > I didn't understand this error message: > > + if (!foundUniqueIndex) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("concurrent refresh requires a unique index on just > columns for all rows of the materialized view"))); > + > > What does that mean? It means that the REFRESH MATERIALIZED VIEW CONCURRENTLY command cannot be used on a materialized view unless it has at least one UNIQUE index which is not partial (i.e., there is no WHERE clause) and is not indexing on an expression (i.e., the index is entirely on bare column names). Set logic to do the "diff" is hard to get right if the tables are not proper sets (i.e., they contain duplicate rows). I can see at least three ways it *could* be done, but all of them are much more complex and significantly slower. With a UNIQUE index on some set of columns in all rows the correct guarantees exist to use fast set logic. It isn't that it's needed for access; it is needed to provide a guarantee that there is no row without NULLs that exactly duplicates another row. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 3: ordering in index
On Sat, Jun 15, 2013 at 3:02 AM, Alexander Korotkov wrote: > attached patch implementing ordering inside GIN index. This is third patch > of GIN improvements, see previous two: > > http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com > > http://www.postgresql.org/message-id/CAPpHfdvftaJq7www381naLw1=4u0h+qpxgwvnhceb9hmvyw...@mail.gmail.com > > This patch introduces new interface method of GIN which takes same > arguments as consistent but returns float8. > float8 gin_ordering(bool check[], StrategyNumber n, Datum query, int32 > nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool > nullFlags[], Datum addInfo[], bool addInfoIsNull[]) > This patch implements gingettuple method which can return ordering data > using KNN infrastructure. Also it introduces >< operator for fts which > support ordering in GIN index. Some example: > > postgres=# explain analyze select * from dblp_titles2 where tsvector @@ > to_tsquery('english', 'statistics') order by tsvector >< > to_tsquery('english', 'statistics') limit 10; >QUERY > PLAN > > - > Limit (cost=12.00..48.22 rows=10 width=136) (actual time=6.999..7.120 > rows=10 loops=1) >-> Index Scan using dblp_titles2_idx on dblp_titles2 > (cost=12.00..43003.03 rows=11868 width=136) (actual time=6.996..7.115 > rows=10 loops=1) > Index Cond: (tsvector @@ '''statist'''::tsquery) > Order By: (tsvector >< '''statist'''::tsquery) > Total runtime: 7.556 ms > (5 rows) > Attached version of patch has some refactoring and bug fixes. -- With best regards, Alexander Korotkov. gin_ordering.2.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part2: fast scan
On Sat, Jun 15, 2013 at 2:55 AM, Alexander Korotkov wrote: > attached patch implementing "fast scan" technique for GIN. This is second > patch of GIN improvements, see the 1st one here: > > http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com > This patch allow to skip parts of posting trees when their scan is not > necessary. In particular, it solves "frequent_term & rare_term" problem of > FTS. > It introduces new interface method pre_consistent which behaves like > consistent, but: > 1) allows false positives on input (check[]) > 2) allowed to return false positives > > Some example: "frequent_term & rare_term" becomes pretty fast. > > create table test as (select to_tsvector('english', 'bbb') as v from > generate_series(1,100)); > insert into test (select to_tsvector('english', 'ddd') from > generate_series(1,10)); > create index test_idx on test using gin (v); > > postgres=# explain analyze select * from test where v @@ > to_tsquery('english', 'bbb & ddd'); > QUERY PLAN > > --- > Bitmap Heap Scan on test (cost=942.75..7280.63 rows=5000 width=17) > (actual time=0.458..0.461 rows=10 loops=1) >Recheck Cond: (v @@ '''bbb'' & ''ddd'''::tsquery) >-> Bitmap Index Scan on test_idx (cost=0.00..941.50 rows=5000 > width=0) (actual time=0.449..0.449 rows=10 loops=1) > Index Cond: (v @@ '''bbb'' & ''ddd'''::tsquery) > Total runtime: 0.516 ms > (5 rows) > Attached version of patch has some refactoring and bug fixes. -- With best regards, Alexander Korotkov. gin_fast_scan.2.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] refresh materialized view concurrently
On 17 June 2013 13:15, Kevin Grittner wrote: > Simon Riggs wrote: > >> There are multiple features all requiring efficient change set >> extraction. It seems extremely relevant to begin discussing what >> that mechanism might be in each case > > Changeset extraction has nothing to do with this patch, and cannot > possibly be useful for it. Please keep discussion which is > completely unrelated to this patch off this thread. Kevin, You mentioned "incremental maintenance" in your original post and I have been discussing it. Had you not mentioned it, I doubt I would have thought of it. But since you did mention it, and its clearly an important issue, it seems relevant to have discussed it here and now. I'm happy to wait for you to start the thread discussing it directly though. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Jun 14, 2013 at 12:09 AM, Alexander Korotkov wrote: > Revised version of patch for additional information storage in GIN is > attached. Changes are mostly bug fixes. > New version of patch is attached with some more refactoring and bug fixes. -- With best regards, Alexander Korotkov. ginaddinfo.5.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist
On Fri, Mar 8, 2013 at 11:58 AM, Pavel Stehule wrote: > I'll see - please, stay tuned to 9.4 first commitfest Hi Pavel, Just a reminder, I didn't see this patch in the current commitfest. I would be happy to spend some more time reviewing if you wish to pursue the patch. Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add regression tests for DISCARD
On Mon, May 13, 2013 at 2:58 AM, Robins Tharakan wrote: > Please find attached a patch that adds basic regression tests for DISCARD > command. > > Any and all feedback is obviously welcome. Perhaps existing tests in guc.sql should be merged into it? -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Department of Redundancy Department: makeNode(FuncCall) division
* David Fetter (da...@fetter.org) wrote: > On Mon, Feb 11, 2013 at 10:48:38AM -0800, David Fetter wrote: > > On Sun, Feb 10, 2013 at 10:09:19AM -0500, Tom Lane wrote: > > > David Fetter writes: > > > > Per suggestions and lots of help from Andrew Gierth, please find > > > > attached a patch to clean up the call sites for FuncCall nodes, which > > > > I'd like to expand centrally rather than in each of the 37 (or 38, but > > > > I only redid 37) places where it's called. The remaining one is in > > > > src/backend/nodes/copyfuncs.c, which has to be modified for any > > > > changes in the that struct anyhow. > > > > > > TBH, I don't think this is an improvement. > > > > > > The problem with adding a new field to any struct is that you have to > > > run around and examine (and, usually, modify) every place that > > > manufactures that type of struct. With a makeFuncCall defined like > > > this, you'd still have to do that; it would just become a lot easier > > > to forget to do so. I don't really see how finding all callers of makeFuncCall is particularly harder than finding the callers of makeNode(Func). If there were cases where we still wanted to use makeNode(Func), perhaps that would be annoying since you'd have to look for both- but, iiuc, this patch changes all of the callers to use makeFuncCall and it seems reasonable for all callers to do so in the future as well. It looks to me like the advantage of this patch is exactly that you *don't* have to run around and modify things which are completely unrelated to the new feature being added (eg: FILTER). Add the new field, set up the default/no-op case in makeFuncCall() and then only change those places that actually need to worry about your new field. > > > If the subroutine were defined like most other makeXXX subroutines, > > > ie you have to supply *all* the fields, that argument would go away, > > > but the notational advantage is then dubious. Having to supply all the fields certainly wouldn't make things any better. Providing the base set of fields which are required to be set for any FuncCall node does make sense though, which is what the patch does. The rest of the fields are all special cases for which a default value works perfectly fine (when the field isn't involved in the specific case being handled). > > > The bigger-picture point is that you're proposing to make the coding > > > conventions for building FuncCalls different from what they are for > > > any other grammar node. I don't think that's a great idea; it will > > > mostly foster confusion. > > > > The major difference between FuncCalls and others is that `while most > > raw-parsetree nodes are constructed only in their own syntax > > productions, FuncCall is constructed in many places unrelated to > > actual function call syntax. Yeah, FuncCall's are already rather special and they're built all over the place. That's my 2c on it anyhow. I don't see it as some kind of major milestone but it looks like improvement to me and likely to make things a bit easier on patch authors and reviewers who otherwise have to ponder a bunch of repeated 'x->q = false;' statements in areas which are completely unrelated to the new feature itself. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Mon, Jun 17, 2013 at 5:23 AM, Andres Freund wrote: > On 2013-06-17 04:20:03 +0900, Fujii Masao wrote: > > On Thu, Jun 6, 2013 at 1:29 PM, Michael Paquier > > wrote: > > > Hi all, > > > > > > Please find attached the latest versions of REINDEX CONCURRENTLY for > the 1st > > > commit fest of 9.4: > > > - 20130606_1_remove_reltoastidxid_v9.patch, removing reltoastidxid, to > allow > > > a toast relation to have multiple indexes running in parallel (extra > indexes > > > could be created by a REINDEX CONCURRENTLY processed) > > > - 20130606_2_reindex_concurrently_v26.patch, correcting some comments > and > > > fixed a lock in index_concurrent_create on an index relation not > released at > > > the end of a transaction > > > > Could you let me know how this patch has something to do with MVCC > catalog > > access patch? Should we wait for MVCC catalog access patch to be > committed > > before starting to review this patch? > > I wondered the same. The MVCC catalog patch, if applied, would make it > possible to make the actual relfilenode swap concurrently instead of > requiring to take access exlusive locks which obviously is way nicer. On > the other hand, that function is only a really small part of this patch, > so it seems quite possible to make another pass at it before relying on > mvcc catalog scans. > As mentionned by Andres, the only thing that the MVCC catalog patch can improve here is the index swap phase (index_concurrent_swap:index.c) where the relfilenode of the old and new indexes are exchanged. Now an AccessExclusiveLock is taken on the 2 relations being swap, we could leverage that to ShareUpdateExclusiveLock with the MVCC catalog access I think. Also, with the MVCC catalog patch in, we could add some isolation tests for REINDEX CONCURRENTLY (there were some tests in one of the previous versions), what is currently not possible due to the exclusive lock taken at swap phase. Btw, those are minor things in the patch, so I think that it would be better to not wait for the MVCC catalog patch. Even if you think that it would be better to wait for it, you could even begin with the 1st patch allowing a toast relation to have multiple indexes (removal of reltoastidxid) which does not depend at all on it. Thanks, -- Michael
Re: [HACKERS] refresh materialized view concurrently
On 17 June 2013 12:13, Heikki Linnakangas wrote: > On 14.06.2013 19:05, Kevin Grittner wrote: >> >> Attached is a patch for REFRESH MATERIALIZED VIEW CONCURRENTLY for >> 9.4 CF1. The goal of this patch is to allow a refresh without >> interfering with concurrent reads, using transactional semantics. >> >> It is my hope to get this committed during this CF to allow me to >> focus on incremental maintenance for the rest of the release cycle. > > > I must say this seems a bit pointless on its own. But if it's a stepping > stone to incremental maintenance, I have no objections. There are generally 4 kinds of mat view 1. Transactionally updated 2. Incremental update, eventually consistent 3. Incremental update, regular refresh 4. Full refresh At the moment we only have type 4 and it holds a full lock while it runs. We definitely need a CONCURRENTLY option and this is it. Implementing the other types won't invalidate what we currently have, so this makes sense to me. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] refresh materialized view concurrently
Simon Riggs wrote: > There are multiple features all requiring efficient change set > extraction. It seems extremely relevant to begin discussing what > that mechanism might be in each case Changeset extraction has nothing to do with this patch, and cannot possibly be useful for it. Please keep discussion which is completely unrelated to this patch off this thread. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MVCC catalog access
On 2013-06-03 14:57:12 -0400, Robert Haas wrote: > On Thu, May 30, 2013 at 1:39 AM, Michael Paquier > wrote: > > +1. > > Here's a more serious patch for MVCC catalog access. This one > involves more data copying than the last one, I think, because the > previous version did not register the snapshots it took, which I think > is not safe. So this needs to be re-tested for performance, which I > have so far made no attempt to do. Ok, I am starting to take a bit more serious look. Minor issues I noticed: * index.c:index_constraint_create()s - comments need to get updated * index.c:IndexCheckExclusion() - why do we still use a SnapshotNow? I'd rather not use *Now if it isn't necessary. * the * CONCURRENTLY infrastructure should be simplified once this has been applied, but I think it makes sense to keep that separate. * index.c:reindex_index() - SnapshotNow comment should be updated I still think that renaming SnapshotNow to something like SnapshotPerTuple to force everyone to reavaluate their usage would be good. So, the biggest issue with the patch seems to be performance worries. I tried to create a worst case scenario: postgres (patched and HEAD) running with: -c shared_buffers=4GB \ -c max_connections=2000 \ -c maintenance_work_mem=2GB \ -c checkpoint_segments=300 \ -c wal_buffers=64MB \ -c synchronous_commit=off \ -c autovacuum=off \ -p 5440 With one background pgbench running: pgbench -p 5440 -h /tmp -f /tmp/readonly-busy.sql -c 1000 -j 10 -T 100 postgres readonly-busy.sql: BEGIN; SELECT txid_current(); SELECT pg_sleep(0.0001); COMMIT; I measured the performance of one other pgbench: pgbench -h /tmp -p 5440 postgres -T 10 -c 100 -j 100 -n -f /tmp/simplequery.sql -C simplequery.sql: SELECT * FROM af1, af2 WHERE af1.x = af2.x; tables: create table af1 (x) as select g from generate_series(1,4) g; create table af2 (x) as select g from generate_series(4,7) g; With that setup one can create quite a noticeable overhead for the mvcc patch (best of 5): master-optimize: tps = 1261.629474 (including connections establishing) tps = 15121.648834 (excluding connections establishing) dev-optimize: tps = 773.719637 (including connections establishing) tps = 2804.239979 (excluding connections establishing) Most of the time in both, patched and unpatched is by far spent in GetSnapshotData. I think the reason this shows a far higher overhead than what you previously measured is that a) in your test the other backends were idle, in mine they actually modify PGXACT which causes noticeable cacheline bouncing b) I have higher numer of connections & #max_connections A quick test shows that even with max_connection=600, 400 background, and 100 foreground pgbenches there's noticeable overhead: master-optimize: tps = 2221.226711 (including connections establishing) tps = 31203.259472 (excluding connections establishing) dev-optimize: tps = 1629.734352 (including connections establishing) tps = 4754.449726 (excluding connections establishing) Now I grant that's a somewhat harsh test for postgres, but I don't think it's entirely unreasonable and the performance impact is quite stark. > It strikes me as rather unfortunate that the snapshot interface is > designed in such a way as to require so much data copying. It seems > we always take a snapshot by copying from PGXACT/PGPROC into > CurrentSnapshotData or SecondarySnapshotData, and then copying data a > second time from there to someplace more permanent. It would be nice > to avoid that, at least in common cases. Sounds doable. But let's do one thing at a atime ;). That copy wasn't visible in the rather extreme workload from above btw... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add regression tests for SET xxx
On 26 May 2013 19:56, Robins Tharakan wrote: > Hi, > > Please find attached a patch to take code-coverage of SET (SESSION / SEED > / TRANSACTION / DATESTYLE / TIME ZONE) (src/backend/commands/variable.c) > from 65% to 82%. > > Any and all feedback is welcome. > -- > Robins Tharakan > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > Hi, the patch applies cleanly on code from trunk, however there are failing tests, diff attached. regards Szymon regression.diffs Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: > On 15 June 2013 10:22, Dean Rasheed wrote: > > There seem to be 2 separate directions that this could go, which > > really meet different requirements: > > > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of > > the order in which they are scanned. A possible approach to this might > > be something like an aggregate > > > > md5_total(text/bytea) returns text > > > > that returns the sum of the md5 values of each input value, treating > > each md5 value as an unsigned 128-bit integer, and then producing the > > hexadecimal representation of the final sum. This should out-perform a > > solution based on numeric addition, and in typical cases, the result > > wouldn't be much longer than a regular md5 sum, and so would be easy > > to eyeball for differences. > > > > I've been playing around with the idea of an aggregate that computes > the sum of the md5 hashes of each of its inputs, which I've called > md5_total() for now, although I'm not particularly wedded to that > name. Comparing it with md5_agg() on a 100M row table (see attached > test script) produces interesting results: > > SELECT md5_agg(foo.*::text) > FROM (SELECT * FROM foo ORDER BY id) foo; > > 50bc42127fb9b028c9708248f835ed8f > > Time: 92960.021 ms > > SELECT md5_total(foo.*::text) FROM foo; > > 02faea7fafee4d253fc94cfae031afc43c03479c > > Time: 96190.343 ms > > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its > result is longer) but it seems like it would be very useful for > quickly comparing data in SQL, since its value is not dependent on the > row-order making it easier to use and better performing if there is no > usable index for ordering. > > Note, however, that if there is an index that can be used for > ordering, the performance is not necessarily better than md5_agg(), as > this example shows. There is a small additional overhead per row for > initialising the MD5 sums, and adding the results to the total, but I > think the biggest factor is that md5_total() is processing more data. > The reason is that MD5 works on 64-byte blocks, so the total amount of > data going through the core MD5 algorithm is each row's size is > rounded up to a multiple of 64. In this simple case it ends up > processing around 1.5 times as much data: > > SELECT sum(length(foo.*::text)) AS md5_agg, >sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; > > md5_agg | md5_total > +- > 8103815438 | 12799909248 > > although of course that overhead won't be as large on wider tables, > and even in this case the overall performance is still on a par with > md5_agg(). > > ISTM that both aggregates are potentially useful in different > situations. I would probably typically use md5_total() because of its > simplicity/order-independence and consistent performance, but > md5_agg() might also be useful when comparing with external data. Few notes: - Index-scan over whole table is *very* bad for larger tables (few times bigger than available RAM). If you want to promote such use you should also warn against use on loaded server. - It's pointless to worry about overflow on 128-bit ints. Just let it happen. Adding complexity for that does not bring any advantage. - Using some faster 128-bit hash may be useful - check out CityHash or SpookyHash. You can get C implementation from pghashlib. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On 17 June 2013 09:03, Pavan Deolasee wrote: > I agree. We should probably find a better name for this. Any suggestions ? err, I already made one... >> But that's not the whole story. I can see some utility in a patch that >> makes all WAL transfer synchronous, rather than just commits. Some >> name like synchronous_transfer might be appropriate. e.g. >> synchronous_transfer = all | commit (default). > Since commits are more foreground in nature and this feature > does not require us to wait during common foreground activities, we want a > configuration where master can wait for synchronous transfers at other than > commits. May we can solve that by having more granular control to the said > parameter ? > >> >> The idea of another slew of parameters that are very similar to >> synchronous replication but yet somehow different seems weird. I can't >> see a reason why we'd want a second lot of parameters. Why not just >> use the existing ones for sync rep? (I'm surprised the Parameter >> Police haven't visited you in the night...) Sure, we might want to >> expand the design for how we specify multi-node sync rep, but that is >> a different patch. > > > How would we then distinguish between synchronous and the new kind of > standby ? That's not the point. The point is "Why would we have a new kind of standby?" and therefore why do we need new parameters? > I am told, one of the very popular setups for DR is to have one > local sync standby and one async (may be cascaded by the local sync). Since > this new feature is more useful for DR because taking a fresh backup on a > slower link is even more challenging, IMHO we should support such setups. ...which still doesn't make sense to me. Lets look at that in detail. Take 3 servers, A, B, C with A and B being linked by sync rep, and C being safety standby at a distance. Either A or B is master, except in disaster. So if A is master, then B would be the failover target. If A fails, then you want to failover to B. Once B is the target, you want to failback to A as the master. C needs to follow the new master, whichever it is. If you set up sync rep between A and B and this new mode between A and C. When B becomes the master, you need to failback from B from A, but you can't because the new mode applied between A and C only, so you have to failback from C to A. So having the new mode not match with sync rep means you are forcing people to failback using the slow link in the common case. You might observe that having the two modes match causes problems if A and B fail, so you are forced to go to C as master and then eventually failback to A or B across a slow link. That case is less common and could be solved by extending sync transfer to more/multi nodes. It definitely doesn't make sense to have sync rep on anything other than a subset of sync transfer. So while it may be sensible in the future to make sync transfer a superset of sync rep nodes, it makes sense to make them the same config for now. Phew -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Review: Display number of changed rows since last analyze
This is a review of the patch in 5192d7d2.8020...@catalyst.net.nz The patch applies cleanly (with the exception of catversion.h of course), compiles without warnings and passes the regression tests. It contains enough documentation, though I'd prefer "Estimated number of rows modified since the table was last analyzed" to "Estimated number of row changes (inserts + updates + deletes) since the last analyze" The patch works as it should, and I think that this is a useful addition. It only exposes a value that is already available internally, so there shouldn't be any penalties. I think that the column name is ok as it is, even if it is a bit long - I cannot come up with a more succinct idea. Perhaps "n_changed_since_analyze" could be shortened to "n_mod_since_analyze", but that's not much of an improvement. This is a very simple change, and I'll mark this patch "ready for committer". Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] refresh materialized view concurrently
On 14.06.2013 19:05, Kevin Grittner wrote: Attached is a patch for REFRESH MATERIALIZED VIEW CONCURRENTLY for 9.4 CF1. The goal of this patch is to allow a refresh without interfering with concurrent reads, using transactional semantics. It is my hope to get this committed during this CF to allow me to focus on incremental maintenance for the rest of the release cycle. I must say this seems a bit pointless on its own. But if it's a stepping stone to incremental maintenance, I have no objections. I didn't need to touch very much outside of matview-specific files for this. My biggest concern is that I needed two small functions which did *exactly* what some static functions in ri_triggers.c were doing and couldn't see where the best place to share them from was. For the moment I just duplicated them, but my hope would be that they could be put in a suitable location and called from both places, rather than duplicating the 30-some lines of code. The function signatures are: void quoteOneName(char *buffer, const char *name) void quoteRelationName(char *buffer, Relation rel) I'd just use quote_identifier and quote_qualified_identifier instead. I didn't understand this error message: + if (!foundUniqueIndex) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("concurrent refresh requires a unique index on just columns for all rows of the materialized view"))); + What does that mean? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 15 June 2013 10:22, Dean Rasheed wrote: > There seem to be 2 separate directions that this could go, which > really meet different requirements: > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of > the order in which they are scanned. A possible approach to this might > be something like an aggregate > > md5_total(text/bytea) returns text > > that returns the sum of the md5 values of each input value, treating > each md5 value as an unsigned 128-bit integer, and then producing the > hexadecimal representation of the final sum. This should out-perform a > solution based on numeric addition, and in typical cases, the result > wouldn't be much longer than a regular md5 sum, and so would be easy > to eyeball for differences. > I've been playing around with the idea of an aggregate that computes the sum of the md5 hashes of each of its inputs, which I've called md5_total() for now, although I'm not particularly wedded to that name. Comparing it with md5_agg() on a 100M row table (see attached test script) produces interesting results: SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo; 50bc42127fb9b028c9708248f835ed8f Time: 92960.021 ms SELECT md5_total(foo.*::text) FROM foo; 02faea7fafee4d253fc94cfae031afc43c03479c Time: 96190.343 ms Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its result is longer) but it seems like it would be very useful for quickly comparing data in SQL, since its value is not dependent on the row-order making it easier to use and better performing if there is no usable index for ordering. Note, however, that if there is an index that can be used for ordering, the performance is not necessarily better than md5_agg(), as this example shows. There is a small additional overhead per row for initialising the MD5 sums, and adding the results to the total, but I think the biggest factor is that md5_total() is processing more data. The reason is that MD5 works on 64-byte blocks, so the total amount of data going through the core MD5 algorithm is each row's size is rounded up to a multiple of 64. In this simple case it ends up processing around 1.5 times as much data: SELECT sum(length(foo.*::text)) AS md5_agg, sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; md5_agg | md5_total +- 8103815438 | 12799909248 although of course that overhead won't be as large on wider tables, and even in this case the overall performance is still on a par with md5_agg(). ISTM that both aggregates are potentially useful in different situations. I would probably typically use md5_total() because of its simplicity/order-independence and consistent performance, but md5_agg() might also be useful when comparing with external data. Regards, Dean md5_agg_v2.patch Description: Binary data md5-100m-row-test.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add more regression tests for CREATE OPERATOR
On 23 May 2013 00:34, Robins Tharakan wrote: > Hi, > > Please find attached a patch to take code-coverage of CREATE OPERATOR > (src/backend/commands/operatorcmds.c) from 56% to 91%. > > Any and all feedback is welcome. > -- > Robins Tharakan > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > Hi, there is one commented out test. I think it should be run, or deleted. There is no use of commented sql code which is not run. What do you think? regards, Szymon
Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib
Hi, Peter Eisentraut writes: > 2. confuse users > 3. produce broken external extension modules that take contrib as an example I agree that having both cases (sections) in the Makefile is a bad idea. Still, why should we keep the in-tree build instructions? Would it be possible instead to instruct PGXN to work with a non installed server source tree? And how much do we need that really? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Peter Eisentraut writes: > I suspect that there are actually only about 5 or 6 common ways to do > archiving (say, local, NFS, scp, rsync, S3, ...). There's no reason why > we can't fully specify and/or script what to do in each of these cases. And provide either fully reliable contrib scripts or "internal" archive commands ready to use for those common cases. I can't think of other common use cases, by the way. +1 Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SLRU
On Mon, Jun 17, 2013 at 1:22 PM, Soroosh Sardari wrote: > Hey > > I was reading the multi transaction log manager, multixact.c. > I didn't get what SLRU does. > > I want the goal of this module, and why we use it. > I'm kind of newbie, be patient with me ;) > > Did you look at src/backend/access/transam/slru.c ? The first para in that file is quite explanatory: * We use a simple least-recently-used scheme to manage a pool of page * buffers. Under ordinary circumstances we expect that write * traffic will occur mostly to the latest page (and to the just-prior * page, soon after a page transition). Read traffic will probably touch * a larger span of pages, but in any case a fairly small number of page * buffers should be sufficient. So, we just search the buffers using plain * linear search; there's no need for a hashtable or anything fancy. * The management algorithm is straight LRU except that we will never swap * out the latest page (since we know it's going to be hit again eventually). Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] Patch for fail-back without fresh backup
On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs wrote: > > > My perspective is that if the master crashed, assuming that you know > everything about that and suddenly jumping back on seem like a recipe > for disaster. Attempting that is currently blocked by the technical > obstacles you've identified, but that doesn't mean they are the only > ones - we don't yet understand what all the problems lurking might be. > Personally, I won't be following you onto that minefield anytime soon. > > Would it be fair to say that a user will be willing to trust her crashed master in all scenarios where she would have done so in a single instance setup ? IOW without the replication setup, AFAIU users have traditionally trusted the WAL recovery to recover from failed instances. This would include some common failures such as power outages and hardware failures, but may not include others such as on disk corruption. > So I strongly object to calling this patch anything to do with > "failback safe". You simply don't have enough data to make such a bold > claim. (Which is why we call it synchronous replication and not "zero > data loss", for example). > > I agree. We should probably find a better name for this. Any suggestions ? > But that's not the whole story. I can see some utility in a patch that > makes all WAL transfer synchronous, rather than just commits. Some > name like synchronous_transfer might be appropriate. e.g. > synchronous_transfer = all | commit (default). > > Its an interesting idea, but I think there is some difference here. For example, the proposed feature allows a backend to wait at other points but not commit. Since commits are more foreground in nature and this feature does not require us to wait during common foreground activities, we want a configuration where master can wait for synchronous transfers at other than commits. May we can solve that by having more granular control to the said parameter ? > The idea of another slew of parameters that are very similar to > synchronous replication but yet somehow different seems weird. I can't > see a reason why we'd want a second lot of parameters. Why not just > use the existing ones for sync rep? (I'm surprised the Parameter > Police haven't visited you in the night...) Sure, we might want to > expand the design for how we specify multi-node sync rep, but that is > a different patch. > How would we then distinguish between synchronous and the new kind of standby ? I am told, one of the very popular setups for DR is to have one local sync standby and one async (may be cascaded by the local sync). Since this new feature is more useful for DR because taking a fresh backup on a slower link is even more challenging, IMHO we should support such setups. > > I'm worried to see that adding this feature and yet turning it off > causes a measureable drop in performance. I don't think we want that > at all. That clearly needs more work and thought. > > I agree. We need to repeat those tests. I don't trust that turning the feature is causing 1-2% drop. In one of the tests, I see turning the feature on is showing better number compared to when its turn off. That's clearly noise or need concrete argument to convince that way. > I also think your performance results are somewhat bogus. Fast > transaction workloads were already mostly commit waits - But not in case of async standby, right ? > measurements > of what happens to large loads, index builds etc would likely reveal > something quite different. > > I agree. I also feel we need tests where the FlushBuffer gets called more often by the normal backends to see how much added wait in that code path causes performance drops. Another important thing to test would be to see how it works on a slower/high latency links. > I'm tempted by the thought that we should put the WaitForLSN inside > XLogFlush, rather than scatter additional calls everywhere and then > have us inevitably miss one. > > That indeed seems cleaner. Thanks, Pavan
[HACKERS] SLRU
Hey I was reading the multi transaction log manager, multixact.c. I didn't get what SLRU does. I want the goal of this module, and why we use it. I'm kind of newbie, be patient with me ;) Regards Soroosh
Re: [HACKERS] [RFC] Minmax indexes
On 17 June 2013 02:05, Josh Berkus wrote: > >>> I agree that the FSM behaviour shouldn't be linked to index existence. >>> IMHO that should be a separate table parameter, WITH (fsm_mode = append) >>> Index only scans would also benefit from that. >> >> -1 ... I cannot believe that such a parameter would ever get turned on >> in production by anyone. If your table has a significant update rate, >> the resulting table bloat would make such behavior completely >> infeasible. If you have few enough updates to make such a behavior >> practical, then you can live with the expensive index updates instead. > > I'm also thinking that if a table is really append-only, then there are > never any middle-of-the-table pages in the FSM, no? > > Where this falls down is the table which is > mostly-append-but-occasionally-needs-an-update-or-delete. I think the > answer there is to look for a way to make updating the index block range > faster, not ways to modify how we append to the heap. If we told users > "tables with Minmax indexes will be very expensive to update" then I > think they'd live with it; dropping and readding an index to enable fast > updates is something which is already familiar. This feature is using a similar technique to enhance SeqScans as we already use on VACUUM. We don't really care about whether we have 100% scan avoidance because we were likely to be using a WHERE clause that doesn't give perfect constraint elimination anyway. So on a large table we don't care about the fact that we still have to scan 1-5% of the table - we are still 20 times faster than a full seqscan. So there isn't a "fall down" thing here. We expect the recently loaded/updated data to be scanned and that's OK. Having the minmax index updated greedily is just adding extra work for fast diminishing returns. We can always add that later if really needed, but I doubt it will be needed - in just the same way as mat views aren't greedily updated. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
(2013/06/17 5:48), Andres Freund wrote:> On 2013-06-16 17:27:56 +0300, Heikki Linnakangas wrote: >> If we don't mind scanning the buffer cache several times, we don't >> necessarily even need to sort the writes for that. Just scan the buffer >> cache for all buffers belonging to relation A, then fsync it. Then scan the >> buffer cache again, for all buffers belonging to relation B, then fsync >> that, and so forth. > > That would end up with quite a lot of scans in a reasonably sized > machines. Not to talk of those that have a million+ relations. That > doesn't seem to be a good idea for bigger shared_buffers. C.f. the stuff > we did for 9.3 to make it cheaper to drop a bunch of relations at once > by only scanning shared_buffers once. As I written to reply to Heikki, I think that it is unnecessary to exactly buffer sort which has expensive cost. What we need to solve this problem, we need accuracy of sort which can be optimized in OS IO scheduler. And we normally have two optimized IO scheduler layer which are OS layer and RAID controller layer. I think that performance will be improved if it enables sort accuracy to optimize in these process. I think that computational complexity required to solve this problem is one sequential buffer descriptor scan for roughly buffer sort. I will try to study about this implementation, too. Best regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On Mon, Jun 17, 2013 at 2:18 AM, Andres Freund wrote: > On 2013-06-16 17:27:56 +0300, Heikki Linnakangas wrote: > > > A long time ago, Itagaki wrote a patch to sort the checkpoint writes: > www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp > . > > He posted very promising performance numbers, but it was dropped because > Tom > > couldn't reproduce the numbers, and because sorting requires allocating a > > large array, which has the risk of running out of memory, which would be > bad > > when you're trying to checkpoint. > > Hm. We could allocate the array early on since the number of buffers > doesn't change. Sure that would be pessimistic, but that seems fine. > > Alternatively I can very well imagine that it would still be beneficial > to sort the dirty buffers in shared buffers. I.e. scan till we found 50k > dirty pages, sort them and only then write them out. > > Without knowing that Itagaki had done something similar in the past, couple of months back I tried exactly the same thing i.e. sort the shared buffers in chunks and then write them out at once. But I did not get any significant performance gain except when the shared buffers are 3/4th (or some such number) or more than the available RAM. I will see if I can pull out the patch and the numbers. But if memory serves well, I concluded that the kernel is already utilising its buffer cache to achieve the same thing and it does not help beyond a point. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] refresh materialized view concurrently
On 17 June 2013 00:43, Kevin Grittner wrote: >> Especially when one is known to be better than the other already. > > What is the hypothetical technique you're arguing is inferior? For > my own part, I haven't gotten beyond the phase of knowing that to > meet all requests for the feature, it would need to be available at > about the same point that AFTER EACH STATEMENT triggers fire, but > that it should not involve any user-written triggers. Have you > implemented something similar to what you think I might be > considering? Do you have benchmark results? Can you share > details? Recording the changeset required by replication is known to be more efficient using WAL based extraction than using triggers. WAL writes are effectively free and using WAL concentrates the reads to avoid random I/O in large databases. That would be the most suitable approach for continuously updated matviews, or frequently updates. Extraction using multiple snapshots is also possible, using a technique similar to "concurrently" mechanism. That would require re-scanning the whole table which might be overkill depending upon the number of changes. That would work for reasonably infrequent updates. >> Given that we also want to do concurrent CLUSTER and ALTER TABLE >> ... SET TABLESPACE using changeset extraction I think its time >> that discussion happened on hackers. > > No objections to that here; but please don't hijack this thread for > that discussion. There are multiple features all requiring efficient change set extraction. It seems extremely relevant to begin discussing what that mechanism might be in each case, so we don't develop 2 or even 3 different ones while everybody ignores each other. As you said, we should be helping each other and working together, and I agree. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
Thank you for giving comments and my patch reviewer! (2013/06/16 23:27), Heikki Linnakangas wrote: On 10.06.2013 13:51, KONDO Mitsumasa wrote: I create patch which is improvement of checkpoint IO scheduler for stable transaction responses. * Problem in checkpoint IO schedule in heavy transaction case When heavy transaction in database, I think PostgreSQL checkpoint scheduler has two problems at start and end of checkpoint. One problem is IO heavy when starting initial checkpoint in rounds of checkpoint. This problem was caused by full-page-write which cause WAL IO in fast page writes after checkpoint write page. Therefore, when starting checkpoint, WAL-based checkpoint scheduler wrong judgment that is late schedule by full-page-write, nevertheless checkpoint schedule is not late. This is caused bad transaction response. I think WAL-based checkpoint scheduler was not property in starting checkpoint. Yeah, the checkpoint scheduling logic doesn't take into account the heavy WAL activity caused by full page images. That's an interesting phenomenon, but did you actually see that causing a problem in your tests? I couldn't tell from the results you posted what the impact of that was. Could you repeat the tests separately with the two separate patches you posted later in this thread? OK, I try to test with the two separate patches. My patches results which I send past indicate high WAL throughputs(write_size_per_sec) and high transaction during checkpoint. Please see under following HTML file which I set tag jump, and put 'checkpoint highlight switch' button. * With my patched PG http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/patchedPG-report.html#transaction_statistics http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/patchedPG-report.html#wal_statistics * Plain PG http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/plainPG-report.html#transaction_statistics http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/plainPG-report.html#wal_statistics In wal statistics result, I think that high WAL thorouputs in checkpoint starting indicates that checkpoint IO does not disturb other executing transaction IO. Rationalizing a bit, I could even argue to myself that it's a *good* thing. At the beginning of a checkpoint, the OS write cache should be relatively empty, as the checkpointer hasn't done any writes yet. So it might make sense to write a burst of pages at the beginning, to partially fill the write cache first, before starting to throttle. But this is just handwaving - I have no idea what the effect is in real life. Yes, I think so. If we want to change IO throttle, we change OS parameter which are '/proc/sys/vm/dirty_background_ratio' or '/proc/sys/vm/dirty_ratio'. But this parameter effects whole applications in OS, it is difficult to change this parameter and cannot set intuitive parameter. And I think that database tuning should be set in database parameter rather than OS parameter. It is more clear in tuning a server. Another thought is that rather than trying to compensate for that effect in the checkpoint scheduler, could we avoid the sudden rush of full-page images in the first place? The current rule for when to write a full page image is conservative: you don't actually need to write a full page image when you modify a buffer that's sitting in the buffer cache, if that buffer hasn't been flushed to disk by the checkpointer yet, because the checkpointer will write and fsync it later. I'm not sure how much it would smoothen WAL write I/O, but it would be interesting to try. It is most right method in ideal implementations. But I don't have any idea about this method. It seems very difficult... Second problem is fsync freeze problem in end of checkpoint. Normally, checkpoint write is executed in background by OS's IO scheduler. But when it does not correctly work, end of checkpoint fsync was caused IO freeze and slower transactions. Unexpected slow transaction will cause monitor error in HA-cluster and decrease user-experience in application service. It is especially serious problem in cloud and virtual server database system which does not have IO performance. However we don't have solution in postgresql.conf parameter very much. We prefer checkpoint time to fast response transactions. In fact checkpoint time is short, and it becomes little bit long that is not problem. You may think that checkpoint_segments and checkpoint_timeout are set larger value, however large checkpoint_segments affects file-cache which is not read and is wasted, and large checkpoint_timeout was caused long-time crash-recovery. A long time ago, Itagaki wrote a patch to sort the checkpoint writes: www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp. He posted very promising performance numbers, but it was dropped because Tom couldn't reproduce the numbers, and because sorting requires allocating a large array, which has the