Re: [HACKERS] random_page_cost vs seq_page_cost
(replying just to you) On 10/01/12 15:22, Greg Smith wrote: On 1/5/12 5:04 AM, Benedikt Grundmann wrote: That sort of thing is one reason why all attempts so far to set random_page_cost based on physical characteristics haven't gone anywhere useful. The setting is sort of overloaded right now, it's a fuzzy mix of true random seek cost blended with some notion of cache percentage. Trying to bring some measurements to bear on it is a less effective approach than what people actually do here. Monitor the profile of query execution, change the value, see what happens. Use that as feedback for what direction to keep going; repeat until you're just spinning with no improvements. Thank you very much for the reply it is very interesting. I'm excited to hear that documentation in that area will improve in 9.2. It's interesting postgres has remarkable good documentation but it is a sufficiently complex system that to actually sensible tune the knobs provided you have to understand quite a lot about what is going on. A colleague of mine likes to say all abstractions leak, which seems very appropriate in this case. We are not sure if the database used to choose differently before the move to the new hardware and the hardware is performing worse for random seeks. Or if the planner is now making different choices. I don't recommend ever deploying new hardware without first doing some low-level benchmarks to validate its performance. Once stuff goes into production, you can't do that anymore. See http://www.2ndquadrant.com/en/talks/ for my hardware benchmarking talks if you'd like some ideas on what to collect. We had actually done lots of tests on the sequential read performance. But you are right we could have done better (and I'll definitely read through your talks). Did you see my follow up? Based on the feedback we did further tests and It is now clear that neither the hardware nor the database version are at fault. A different plan is chosen by both new and old database version if spun up on the database as it is right now. Our best guess is that the clusters we run after we had moved to the hardware (it having more diskspace and faster sequential I/O making it possible) changed the planners perception of how the joins will perform in relation to each other. Cheers, Bene -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random_page_cost vs seq_page_cost
On 11/01/12 08:26, Benedikt Grundmann wrote: (replying just to you) Clearly I didn't. Sigh. Getting myself a coffee now. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
On Wed, Jan 11, 2012 at 4:33 AM, Tom Lane t...@sss.pgh.pa.us wrote: BTW ... it occurs to me to ask whether we really have a solid use-case for having listeners attached to slave servers. I have personally never seen an application for LISTEN/NOTIFY in which the listeners were entirely read-only. Even if there are one or two cases out there, it's not clear to me that supporting it is worth the extra complexity that seems to be needed. The idea is to support external caches that re-read the data when it changes. If we can do that from the standby then we offload from the master. Yes, there are other applications for LISTEN/NOTIFY and we wouldn't be able to support them all with this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
On Wed, Jan 11, 2012 at 4:38 AM, Greg Smith g...@2ndquadrant.com wrote: On 1/10/12 9:14 PM, Robert Haas wrote: Based on that, I whipped up the attached patch, which, if sync_file_range is available, simply iterates through everything that will eventually be fsync'd before beginning the write phase and tells the Linux kernel to put them all under write-out. I hadn't really thought of using it that way. The kernel expects that when this is called the normal way, you're going to track exactly which segments you want it to sync. And that data isn't really passed through the fsync absorption code yet; the list of things to fsync has already lost that level of detail. What you're doing here doesn't care though, and I hadn't considered that SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its docs. Used this way, it's basically fsync without the wait or guarantee; it just tries to push what's already dirty further ahead of the write queue than those writes would otherwise be. I don't think this will help at all, I think it will just make things worse. The problem comes from hammering the fsyncs one after the other. What this patch does is initiate all of the fsyncs at the same time, so it will max out the disks even more because this will hit all disks all at once. It does open the door to various other uses, so I think this work will be useful. One idea I was thinking about here was building a little hash table inside of the fsync absorb code, tracking how many absorb operations have happened for whatever the most popular relation files are. The idea is that we might say use sync_file_range every time N calls for a relation have come in, just to keep from ever accumulating too many writes to any one file before trying to nudge some of it out of there. The bat that keeps hitting me in the head here is that right now, a single fsync might have a full 1GB of writes to flush out, perhaps because it extended a table and then write more than that to it. And in everything but a SSD or giant SAN cache situation, 1GB of I/O is just too much to fsync at a time without the OS choking a little on it. A better idea. Seems like it should be easy enough to keep a counter. I see some other uses around large writes also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
* Greg Smith: One idea I was thinking about here was building a little hash table inside of the fsync absorb code, tracking how many absorb operations have happened for whatever the most popular relation files are. The idea is that we might say use sync_file_range every time N calls for a relation have come in, just to keep from ever accumulating too many writes to any one file before trying to nudge some of it out of there. The bat that keeps hitting me in the head here is that right now, a single fsync might have a full 1GB of writes to flush out, perhaps because it extended a table and then write more than that to it. And in everything but a SSD or giant SAN cache situation, 1GB of I/O is just too much to fsync at a time without the OS choking a little on it. Isn't this pretty much like tuning vm.dirty_bytes? We generally set it to pretty low values, and seems to help to smoothen the checkpoints. -- Florian Weimerfwei...@bfk.de BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] log messages for archive recovery progress
2012/1/11 Euler Taveira de Oliveira eu...@timbira.com: On 08-01-2012 11:59, Satoshi Nagayasu / Uptime Technologies, LLC. wrote: [2011-12-08 15:14:36 JST] 16758: LOG: restored log file 00080046 from archive [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 00080046 [2011-12-08 15:14:36 JST] 16758: LOG: restored log file 00080047 from archive [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 00080047 cp: cannot stat `/backups/archlog/00080048': No such file or directory [2011-12-08 15:14:37 JST] 16758: LOG: could not restore file 00080048 from archive [2011-12-08 15:14:37 JST] 16758: LOG: attempting to look into pg_xlog [2011-12-08 15:14:37 JST] 16758: LOG: recoverying 00080048 What about just 'restored log file 00080048 from pg_xlog' instead of the last two messages? If you can't read from pg_xlog emit 'could not restore file 00080048 from pg_xlog'. Yes, simple is better. We already have a message if the file is not present, we just need one if we switch to using pg_xlog. Please look at this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8e65962..055ae6d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2817,6 +2817,16 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, xlogfname); set_ps_display(activitymsg, false); + /* + * If we're in archive recovery and the archive runs dry, + * we switch to reading from the pg_xlog directly. In that case + * we want to output a message that shows what just happened. + */ + if (InArchiveRecovery source == XLOG_FROM_PG_XLOG) + ereport(LOG, + (errmsg(restored log file \%s\ from pg_xlog, + xlogfname))); + /* Track source of data in assorted state variables */ readSource = source; XLogReceiptSource = source; @@ -2867,10 +2877,7 @@ XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources) { fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true); if (fd != -1) - { -elog(DEBUG1, got WAL segment from archive); return fd; - } } if (sources XLOG_FROM_PG_XLOG) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Confusing EXPLAIN output in case of inherited tables
Hi, After running regression, I ran EXPLAIN on one of the queries in regression (test create_misc) and got following output regression=# explain verbose select * into table ramp from road where name ~ '.*Ramp'; QUERY PLAN Result (cost=0.00..154.00 rows=841 width=67) Output: public.road.name, public.road.thepath - Append (cost=0.00..154.00 rows=841 width=67) - Seq Scan on public.road (cost=0.00..135.05 rows=418 width=67) Output: public.road.name, public.road.thepath Filter: (public.road.name ~ '.*Ramp'::text) - Seq Scan on public.ihighway road (cost=0.00..14.99 rows=367 width=67) ^ Output: public.road.name, public.road.thepath ^^, ^^ Filter: (public.road.name ~ '.*Ramp'::text) ^^^ - Seq Scan on public.shighway road (cost=0.00..3.96 rows=56 width=67) Output: public.road.name, public.road.thepath Filter: (public.road.name ~ '.*Ramp'::text) (12 rows) regression=# \d+ road Table public.road Column | Type | Modifiers | Storage | Stats target | Description -+--+---+--+--+- name| text | | extended | | thepath | path | | extended | | Indexes: rix btree (name) Child tables: ihighway, shighway Has OIDs: no Table road has children ihighway and shighway as seen in the \d+ output above. The EXPLAIN output of Seq Scan node on children has public.road as prefix for variables. public.road could imply the parent table road and thus can cause confusion, as to what's been referreed, the columns of parent table or child table. In the EXPLAIN output children tables have road as alias (as against public.road). The alias comes from RangeTblEntry-eref-aliasname. It might be better to have road as prefix in the variable names over public.road. The reason why this happens is the code in get_variable() 3865 /* Exceptions occur only if the RTE is alias-less */ 3866 if (rte-alias == NULL) 3867 { 3868 if (rte-rtekind == RTE_RELATION) 3869 { 3870 /* 3871 * It's possible that use of the bare refname would find another 3872 * more-closely-nested RTE, or be ambiguous, in which case we need 3873 * to specify the schemaname to avoid these errors. 3874 */ 3875 if (find_rte_by_refname(rte-eref-aliasname, context) != rte) 3876 schemaname = get_namespace_name(get_rel_namespace(rte-relid)); 3877 } If there is no alias, we find out the schema name and later add it to the prefix. In the inherited table case, we are actually creating a kind of alias for the children table and thus we should not find out the schema name and add it to the prefix. This case has been taken care of in get_from_clause_item(), 6505 else if (rte-rtekind == RTE_RELATION 6506 strcmp(rte-eref-aliasname, get_relation_name(rte-relid)) != 0) 6507 { 6508 /* 6509 * Apparently the rel has been renamed since the rule was made. 6510 * Emit a fake alias clause so that variable references will still 6511 * work. This is not a 100% solution but should work in most 6512 * reasonable situations. 6513 */ 6514 appendStringInfo(buf, %s, 6515 quote_identifier(rte-eref-aliasname)); 6516 gavealias = true; 6517 } I see similar code in ExplainTargetRel() 1778 if (objectname == NULL || 1779 strcmp(rte-eref-aliasname, objectname) != 0) 1780 appendStringInfo(es-str, %s, 1781 quote_identifier(rte-eref-aliasname)); Based on this, here is patch to not add schemaname in the prefix for a variable. I have run make check. All except inherit.sql passed. The expected output change is included in the patch. -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 9ad54c5..2e87183 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -3858,21 +3858,26 @@ get_variable(Var *var, int levelsup, bool showstar, deparse_context *context) return NULL; } /* Identify names to use */ schemaname = NULL; /* default assumptions */ refname = rte-eref-aliasname; /* Exceptions occur only if the RTE is alias-less */ if (rte-alias == NULL) { - if (rte-rtekind == RTE_RELATION) + /* + * If the rel has been renamed since the rule was made, that's + *
Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables
On Wed, Jan 11, 2012 at 5:13 PM, Ashutosh Bapat ashutosh.ba...@enterprisedb.com wrote: Hi, After running regression, I ran EXPLAIN on one of the queries in regression (test create_misc) and got following output regression=# explain verbose select * into table ramp from road where name ~ '.*Ramp'; QUERY PLAN Result (cost=0.00..154.00 rows=841 width=67) Output: public.road.name, public.road.thepath - Append (cost=0.00..154.00 rows=841 width=67) - Seq Scan on public.road (cost=0.00..135.05 rows=418 width=67) Output: public.road.name, public.road.thepath Filter: (public.road.name ~ '.*Ramp'::text) - Seq Scan on public.ihighway road (cost=0.00..14.99 rows=367 width=67) ^ Output: public.road.name, public.road.thepath ^^, ^^ Filter: (public.road.name ~ '.*Ramp'::text) ^^^ - Seq Scan on public.shighway road (cost=0.00..3.96 rows=56 width=67) Output: public.road.name, public.road.thepath Filter: (public.road.name ~ '.*Ramp'::text) (12 rows) regression=# \d+ road Table public.road Column | Type | Modifiers | Storage | Stats target | Description -+--+---+--+--+- name| text | | extended | | thepath | path | | extended | | Indexes: rix btree (name) Child tables: ihighway, shighway Has OIDs: no Table road has children ihighway and shighway as seen in the \d+ output above. The EXPLAIN output of Seq Scan node on children has public.road as prefix for variables. public.road could imply the parent table road and thus can cause confusion, as to what's been referreed, the columns of parent table or child table. In the EXPLAIN output children tables have road as alias (as against public.road). The alias comes from RangeTblEntry-eref-aliasname. It might be better to have road as prefix in the variable names over public.road. The reason why this happens is the code in get_variable() 3865 /* Exceptions occur only if the RTE is alias-less */ 3866 if (rte-alias == NULL) 3867 { 3868 if (rte-rtekind == RTE_RELATION) 3869 { 3870 /* 3871 * It's possible that use of the bare refname would find another 3872 * more-closely-nested RTE, or be ambiguous, in which case we need 3873 * to specify the schemaname to avoid these errors. 3874 */ 3875 if (find_rte_by_refname(rte-eref-aliasname, context) != rte) 3876 schemaname = get_namespace_name(get_rel_namespace(rte-relid)); 3877 } If there is no alias, we find out the schema name and later add it to the prefix. In the inherited table case, we are actually creating a kind of alias for the children table and thus we should not find out the schema name and add it to the prefix. This case has been taken care of in get_from_clause_item(), 6505 else if (rte-rtekind == RTE_RELATION 6506 strcmp(rte-eref-aliasname, get_relation_name(rte-relid)) != 0) 6507 { 6508 /* 6509 * Apparently the rel has been renamed since the rule was made. 6510 * Emit a fake alias clause so that variable references will still 6511 * work. This is not a 100% solution but should work in most 6512 * reasonable situations. 6513 */ 6514 appendStringInfo(buf, %s, 6515 quote_identifier(rte-eref-aliasname)); 6516 gavealias = true; 6517 } I see similar code in ExplainTargetRel() 1778 if (objectname == NULL || 1779 strcmp(rte-eref-aliasname, objectname) != 0) 1780 appendStringInfo(es-str, %s, 1781 quote_identifier(rte-eref-aliasname)); Based on this, here is patch to not add schemaname in the prefix for a variable. I have run make check. All except inherit.sql passed. The expected output change is included in the patch. -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers A table can inherit from one or more parent table. So in that case, qualifying schema/table name helps in finding out where the column is coming from. Regards, Chetan -- EnterpriseDB Corporation The Enterprise PostgreSQL Company Website: www.enterprisedb.com EnterpriseDB
Re: [HACKERS] [WIP] Double-write with Fast Checksums
On 10.01.2012 23:43, David Fetter wrote: Please find attached a new revision of the double-write patch. While this one still uses the checksums from VMware, it's been forward-ported to 9.2. I'd like to hold off on merging Simon's checksum patch into this one for now because there may be some independent issues. Could you write this patch so that it doesn't depend on any of the checksum patches, please? That would make the patch smaller and easier to review, and it would allow benchmarking the performance impact of double-writes vs full page writes independent of checksums. At the moment, double-writes are done in one batch, fsyncing the double-write area first and the data files immediately after that. That's probably beneficial if you have a BBU, and/or a fairly large shared_buffers setting, so that pages don't get swapped between OS and PostgreSQL cache too much. But when those assumptions don't hold, it would be interesting to treat the double-write buffers more like a 2nd WAL for full-page images. Whenever a dirty page is evicted from shared_buffers, write it to the double-write area, but don't fsync it or write it back to the data file yet. Instead, let it sit in the double-write area, and grow the double-write file(s) as necessary, until the next checkpoint comes along. In general, I must say that I'm pretty horrified by all these extra fsync's this introduces. You really need a BBU to absorb them, and even then, you're fsyncing data files to disk much more frequently than you otherwise would. Jignesh mentioned having run some performance tests with this. I would like to see those results, and some analysis and benchmarks of how settings like shared_buffers and the presence of BBU affect this, compared to full_page_writes=on and off. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Double-write with Fast Checksums
On Wed, Jan 11, 2012 at 12:13 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: At the moment, double-writes are done in one batch, fsyncing the double-write area first and the data files immediately after that. That's probably beneficial if you have a BBU, and/or a fairly large shared_buffers setting, so that pages don't get swapped between OS and PostgreSQL cache too much. But when those assumptions don't hold, it would be interesting to treat the double-write buffers more like a 2nd WAL for full-page images. Whenever a dirty page is evicted from shared_buffers, write it to the double-write area, but don't fsync it or write it back to the data file yet. Instead, let it sit in the double-write area, and grow the double-write file(s) as necessary, until the next checkpoint comes along. In general, I must say that I'm pretty horrified by all these extra fsync's this introduces. You really need a BBU to absorb them, and even then, you're fsyncing data files to disk much more frequently than you otherwise would. Agreed. Almost exactly the design I've been mulling over while waiting for the patch to get tidied up. Interestingly, you use the term double write buffer, which is a concept that doesn't exist in the patch, and should. You don't say it, but presumably the bgwriter would flush double write buffers as needed. Perhaps the checkpointer could do that when not, so we wouldn't need to send as many fsync messages. Bottom line is that an increased number of fsyncs on main data files will throw the balance of performance out, so other performance tuning will go awry. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
On Wed, Jan 11, 2012 at 9:28 AM, Simon Riggs si...@2ndquadrant.com wrote: It does open the door to various other uses, so I think this work will be useful. Yes, I think this would allow a better design for the checkpointer. Checkpoint scan will collect buffers to write for checkpoint and sort them by fileid, like Koichi/Itagaki already suggested. We then do all the writes for a particular file, then issue a background sync_file_range, then sleep a little. Loop. At end of loop, collect up and close the sync_file_range calls with a SYNC_FILE_RANGE_WAIT_AFTER. So we're interleaving the writes and fsyncs throughout the whole checkpoint, not bursting the fsyncs at the end. With that design we would just have a continuous checkpoint, rather than having 0,5 or 0.9 -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
On Wednesday, January 11, 2012 03:14:31 AM Robert Haas wrote: Greg Smith muttered a while ago about wanting to do something with sync_file_range to improve checkpoint behavior on Linux. I thought he was talking about trying to sync only the range of blocks known to be dirty, which didn't seem like a very exciting idea, but after looking at the man page for sync_file_range, I think I understand what he was really going for: sync_file_range allows you to hint the Linux kernel that you'd like it to clean a certain set of pages. I further recall from Greg's previous comments that in the scenarios he's seen, checkpoint I/O spikes are caused not so much by the data written out by the checkpoint itself but from the other dirty data in the kernel buffer cache. Based on that, I whipped up the attached patch, which, if sync_file_range is available, simply iterates through everything that will eventually be fsync'd before beginning the write phase and tells the Linux kernel to put them all under write-out. I played around with this before and my problem was that sync_file_range is not really a hint. It actually starts writeback *directly* and only returns when the io is placed inside the queue (at least thats the way it was back then). Which very quickly leads to it blocking all the time... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
On Wednesday, January 11, 2012 10:28:11 AM Simon Riggs wrote: On Wed, Jan 11, 2012 at 4:38 AM, Greg Smith g...@2ndquadrant.com wrote: On 1/10/12 9:14 PM, Robert Haas wrote: Based on that, I whipped up the attached patch, which, if sync_file_range is available, simply iterates through everything that will eventually be fsync'd before beginning the write phase and tells the Linux kernel to put them all under write-out. I hadn't really thought of using it that way. The kernel expects that when this is called the normal way, you're going to track exactly which segments you want it to sync. And that data isn't really passed through the fsync absorption code yet; the list of things to fsync has already lost that level of detail. What you're doing here doesn't care though, and I hadn't considered that SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its docs. Used this way, it's basically fsync without the wait or guarantee; it just tries to push what's already dirty further ahead of the write queue than those writes would otherwise be. I don't think this will help at all, I think it will just make things worse. The problem comes from hammering the fsyncs one after the other. What this patch does is initiate all of the fsyncs at the same time, so it will max out the disks even more because this will hit all disks all at once. The advantage of using sync_file_range that way is that it starts writeout but doesn't cause queue drains/barriers/whatever to be issued which can be quite a signfiicant speed gain. In theory. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
On Wednesday, January 11, 2012 10:33:47 AM Florian Weimer wrote: * Greg Smith: One idea I was thinking about here was building a little hash table inside of the fsync absorb code, tracking how many absorb operations have happened for whatever the most popular relation files are. The idea is that we might say use sync_file_range every time N calls for a relation have come in, just to keep from ever accumulating too many writes to any one file before trying to nudge some of it out of there. The bat that keeps hitting me in the head here is that right now, a single fsync might have a full 1GB of writes to flush out, perhaps because it extended a table and then write more than that to it. And in everything but a SSD or giant SAN cache situation, 1GB of I/O is just too much to fsync at a time without the OS choking a little on it. Isn't this pretty much like tuning vm.dirty_bytes? We generally set it to pretty low values, and seems to help to smoothen the checkpoints. If done correctly/way much more invasive you could only issue sync_file_range's to the areas of the file where checkpointing needs to happen and you could leave out e.g. hint bit only changes. Which could help to reduce the cost of checkpoints. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
On 01/11/2012 01:18 AM, Pavel Stehule wrote: I like this patch and this feature. I'm about to read the patch in detail - I certainly like the feature. I see only one issue - there is not functionality that helps generate JSON in pg. What do you think about functions: array_to_json(anyarray), row_to_json(any) and format_json(text, text, ...) Actually, more than these, I (and at least one very interested client) want query_to_json, which would do something like: # select q2json('select $$a$$ || x as b, y as c from generate_series(1,3) x, generate_series(4,5) y'); q2json - [{b:a1,c:4},{b:a1,c:5},{b:a2,c:4},{b:a2,c:5},{b:a3,c:4},{b:a3,c:5}] No doubt several variants are possible such as returning a setof json, one per row, instead of a single json, and allowing query parameters as separate arguments (maybe just using variadic functions), but probably for a first go just something as simple as this would meet the case. Given the short time span available before patches must be in, I am prepared to work on this ASAP. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
2012/1/11 Andrew Dunstan and...@dunslane.net: On 01/11/2012 01:18 AM, Pavel Stehule wrote: I like this patch and this feature. I'm about to read the patch in detail - I certainly like the feature. I see only one issue - there is not functionality that helps generate JSON in pg. What do you think about functions: array_to_json(anyarray), row_to_json(any) and format_json(text, text, ...) Actually, more than these, I (and at least one very interested client) want query_to_json, which would do something like: # select q2json('select $$a$$ || x as b, y as c from generate_series(1,3) x, generate_series(4,5) y'); q2json - [{b:a1,c:4},{b:a1,c:5},{b:a2,c:4},{b:a2,c:5},{b:a3,c:4},{b:a3,c:5}] we have a query_to_xml - so there should similar query_to_json. But this is not enough for usage from SP. What about two rich functions * query_to_json - by your proposal * array_to_json - with possibility to serialize array of records This can be a basic set Regards Pavel No doubt several variants are possible such as returning a setof json, one per row, instead of a single json, and allowing query parameters as separate arguments (maybe just using variadic functions), but probably for a first go just something as simple as this would meet the case. Given the short time span available before patches must be in, I am prepared to work on this ASAP. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
On Wed, Jan 11, 2012 at 1:18 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2012/1/11 Robert Haas robertmh...@gmail.com: On Tue, Dec 20, 2011 at 9:06 PM, David E. Wheeler da...@kineticode.com wrote: On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote: Are people explicitly asking for a) *JSON* datatype or b) a type that lets you store arbitrary complex semi-untyped data structures? Yes. if b) then this might get a lot more interesting JSON is the most popular/likely way to represent that, I think. On that note, here's an updated version of the patch I posted upthread, with some regression tests and minimal documentation. I like this patch and this feature. I see only one issue - there is not functionality that helps generate JSON in pg. What do you think about functions: array_to_json(anyarray), row_to_json(any) and format_json(text, text, ...) I think we might want all of that stuff, but I doubt there is time to do it for 9.2. Actually, I think the next logical step would be to define equality (is there an official definition of that for JSON?) and build a btree opclass. I believe the code I've already written could be extended to construct an abstract syntax tree for those operations that need it. But we need to make some decisions first. A btree opclass requires a total ordering, so we have to arbitrarily define whether 1 true, 1 [1], 1 1, etc. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
2012/1/11 Robert Haas robertmh...@gmail.com: On Wed, Jan 11, 2012 at 1:18 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2012/1/11 Robert Haas robertmh...@gmail.com: On Tue, Dec 20, 2011 at 9:06 PM, David E. Wheeler da...@kineticode.com wrote: On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote: Are people explicitly asking for a) *JSON* datatype or b) a type that lets you store arbitrary complex semi-untyped data structures? Yes. if b) then this might get a lot more interesting JSON is the most popular/likely way to represent that, I think. On that note, here's an updated version of the patch I posted upthread, with some regression tests and minimal documentation. I like this patch and this feature. I see only one issue - there is not functionality that helps generate JSON in pg. What do you think about functions: array_to_json(anyarray), row_to_json(any) and format_json(text, text, ...) I think we might want all of that stuff, but I doubt there is time to do it for 9.2. Actually, I think the next logical step would be to define equality (is there an official definition of that for JSON?) and build a btree opclass. I believe the code I've already written could be extended to construct an abstract syntax tree for those operations that need it. But we need to make some decisions first. A btree opclass requires a total ordering, so we have to arbitrarily define whether 1 true, 1 [1], 1 1, etc. I don't understand why we have to do it? We don't support similar functionality for XML, so why for JSON? Pavel -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
On Tue, Jan 10, 2012 at 11:38 PM, Greg Smith g...@2ndquadrant.com wrote: What you're doing here doesn't care though, and I hadn't considered that SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its docs. Used this way, it's basically fsync without the wait or guarantee; it just tries to push what's already dirty further ahead of the write queue than those writes would otherwise be. Well, my goal was to make sure they got into the write queue rather than just sitting in memory while the kernel twiddles its thumbs. My hope is that the kernel is smart enough that, when you put something under write-out, the kernel writes it out as quickly as it can without causing too much degradation in foreground activity. If that turns out to be an incorrect assumption, we'll need a different approach, but I thought it might be worth trying something simple first and seeing what happens. One idea I was thinking about here was building a little hash table inside of the fsync absorb code, tracking how many absorb operations have happened for whatever the most popular relation files are. The idea is that we might say use sync_file_range every time N calls for a relation have come in, just to keep from ever accumulating too many writes to any one file before trying to nudge some of it out of there. The bat that keeps hitting me in the head here is that right now, a single fsync might have a full 1GB of writes to flush out, perhaps because it extended a table and then write more than that to it. And in everything but a SSD or giant SAN cache situation, 1GB of I/O is just too much to fsync at a time without the OS choking a little on it. That's not a bad idea, but there's definitely some potential down side: you might end up reducing write-combining quite significantly if you keep pushing things out to files when it isn't really needed yet. I was aiming to only push things out when we're 100% sure that they're going to have to be fsync'd, and certainly any already-written buffers that are in the OS cache at the start of a checkpoint fall into that category. That having been said, experimental evidence is king. I'll put this into my testing queue after the upcoming CF starts. Thanks! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] log messages for archive recovery progress
2012/01/11 19:56, Simon Riggs wrote: 2012/1/11 Euler Taveira de Oliveiraeu...@timbira.com: On 08-01-2012 11:59, Satoshi Nagayasu / Uptime Technologies, LLC. wrote: [2011-12-08 15:14:36 JST] 16758: LOG: restored log file 00080046 from archive [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 00080046 [2011-12-08 15:14:36 JST] 16758: LOG: restored log file 00080047 from archive [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 00080047 cp: cannot stat `/backups/archlog/00080048': No such file or directory [2011-12-08 15:14:37 JST] 16758: LOG: could not restore file 00080048 from archive [2011-12-08 15:14:37 JST] 16758: LOG: attempting to look into pg_xlog [2011-12-08 15:14:37 JST] 16758: LOG: recoverying 00080048 What about just 'restored log file 00080048 from pg_xlog' instead of the last two messages? If you can't read from pg_xlog emit 'could not restore file 00080048 from pg_xlog'. Yes, simple is better. We already have a message if the file is not present, we just need one if we switch to using pg_xlog. Please look at this. Thanks for a patch. I agree that simple is better. However, I'm a bit afraid that it will confuse DBA if we use restored under the pg_xlog replay context, because we have already used restored that means a WAL file as successfully copied (not replayed) from archive directory into pg_xlog directory under the archive recovery context. So, to determine the status of copying WAL files from archive directory, I think we can use restored, or could not restore on failure. And to determine the status of replaying WAL files in pg_xlog directory (even if a WAL is copied from archive), we have to use recover or replay. I'll try to revise my proposed log messages again. Thanks, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
On Wed, Jan 11, 2012 at 8:38 AM, Pavel Stehule pavel.steh...@gmail.com wrote: I don't understand why we have to do it? We don't support similar functionality for XML, so why for JSON? Hrm. Well, that's an interesting point. Maybe we don't. I assumed that people would eventually want to optimize queries of the form SELECT whatever FROM tab WHERE jsoncol = 'constant'. If that's a sufficiently marginal use case that we don't care, then fine. One difference between JSON and XML is that XML really has no well-defined comparison semantics. For example, consider: foobar1.0/bar/foo foobar1.0/bar /foo If the XML is being used as a transport mechanism, then the extra space is semantically insignificant, but if this is markup, then it might matter a lot. Also, consider: foobar1.00/bar/foo That one might be equal if we think 1.0 is intended to be a number, but if it's intended as a string then it's not. We could perhaps do comparisons in XML relative to some DTD or schema if those provide details about what the values mean, but in a vacuum it's not well-defined. On the other hand, in JSON, it's pretty clear that { 1, 2, 3 } is the same value as {1,2,3} but 1,2,3 is different from 1, 2, 3. There are some borderline cases that might need some sweat, like whether 1 = 1.0 = 1.00 = 1e0, but in general the level of ambiguity seems to me to be much less, making it more feasible here than it would be for XML. That having been said, uncertainties about whether we want this at all (and if so in what form) are exactly why I didn't include this kind of stuff in the patch to begin with, and I think that if we get this much committed for 9.2 we'll be doing pretty well. If we can agree on and do more, great; if not, we'll at least have this much, which IMHO would be an improvement over what we have now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.
On Sat, Dec 31, 2011 at 10:34 PM, Simon Riggs si...@2ndquadrant.com wrote: Send new protocol keepalive messages to standby servers. Allows streaming replication users to calculate transfer latency and apply delay via internal functions. No external functions yet. pq_flush_if_writable() needs to be called just after WalSndKeepalive(). Otherwise, keepalive packet is not sent for a while. +static void +ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime) walEnd is not used in ProcessWalSndrMessage() at all. Can't we remove it? If yes, walEnd field in WalSndrMessage is also not used anywhere, so ISTM we can remove it. + elog(DEBUG2, sendtime %s receipttime %s replication apply delay %d transfer latency %d, + timestamptz_to_str(sendTime), + timestamptz_to_str(lastMsgReceiptTime), + GetReplicationApplyDelay(), + GetReplicationTransferLatency()); The unit of replication apply delay and transfer latency should be in log message. GetReplicationApplyDelay() and GetReplicationTransferLatency() are called whenever the standby receives the message from the master. Which might degrade the performance of replication a bit. So we should skip the above elog when log_message = DEBUG2? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
On 1/11/12 4:33 AM, Florian Weimer wrote: Isn't this pretty much like tuning vm.dirty_bytes? We generally set it to pretty low values, and seems to help to smoothen the checkpoints. When I experimented with dropping the actual size of the cache, checkpoint spikes improved, but things like VACUUM ran terribly slow. On a typical medium to large server nowadays (let's say 16GB+), PostgreSQL needs to have gigabytes of write cache for good performance. What we're aiming to here is keep the benefits of having that much write cache, while allowing checkpoint related work to send increasingly strong suggestions about ordering what it needs written soon. There's basically three primary states on Linux to be concerned about here: Dirty: in the cache via standard write | v pdflush does writeback at 5 or 10% dirty || sync_file_range push | Writeback | v write happens in the background || fsync call | Stored on disk The systems with bad checkpoint problems will typically have gigabytes Dirty, which is necessary for good performance. It's very lazy about pushing things toward Writeback though. Getting the oldest portions of the outstanding writes into the Writeback queue more aggressively should make the eventual fsync less likely to block. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
On 1/11/12 7:46 AM, Andres Freund wrote: I played around with this before and my problem was that sync_file_range is not really a hint. It actually starts writeback *directly* and only returns when the io is placed inside the queue (at least thats the way it was back then). Which very quickly leads to it blocking all the time... Right, you're answering one of Robert's questions here: yes, once something is pushed toward writeback, it moves toward an actual write extremely fast. And the writeback queue can fill itself. But we don't really care if this blocks. There's a checkpointer process, it will be doing this work, and it has no other responsibilities anymore (as of 9.2, which is why some of these approaches suddenly become practical). It's going to get blocked waiting for things sometimes, the way it already does rarely when it writes, and often when it call fsync. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpoint writeback via sync_file_range
On Wednesday, January 11, 2012 03:20:09 PM Greg Smith wrote: On 1/11/12 7:46 AM, Andres Freund wrote: I played around with this before and my problem was that sync_file_range is not really a hint. It actually starts writeback *directly* and only returns when the io is placed inside the queue (at least thats the way it was back then). Which very quickly leads to it blocking all the time... Right, you're answering one of Robert's questions here: yes, once something is pushed toward writeback, it moves toward an actual write extremely fast. And the writeback queue can fill itself. But we don't really care if this blocks. There's a checkpointer process, it will be doing this work, and it has no other responsibilities anymore (as of 9.2, which is why some of these approaches suddenly become practical). It's going to get blocked waiting for things sometimes, the way it already does rarely when it writes, and often when it call fsync. We do care imo. The heavy pressure putting it directly in the writeback queue leads to less efficient io because quite often it won't reorder sensibly with other io anymore and thelike. At least that was my experience in using it with in another application. Lots of that changed with linux 3.2 (near complete rewrite of the writeback mechanism), so a bit of that might be moot anyway. I definitely aggree that 9.2 opens new possibilities there. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] log messages for archive recovery progress
On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu sn...@uptime.jp wrote: However, I'm a bit afraid that it will confuse DBA if we use restored under the pg_xlog replay context, because we have already used restored that means a WAL file as successfully copied (not replayed) from archive directory into pg_xlog directory under the archive recovery context. So, to determine the status of copying WAL files from archive directory, I think we can use restored, or could not restore on failure. And to determine the status of replaying WAL files in pg_xlog directory (even if a WAL is copied from archive), we have to use recover or replay. Agreed. I can change restored to using, so we have two message types LOG: restored log file 00080047 from archive LOG: using pre-existing log file 00080047 from pg_xlog -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
2012/1/11 Robert Haas robertmh...@gmail.com: On Wed, Jan 11, 2012 at 8:38 AM, Pavel Stehule pavel.steh...@gmail.com wrote: I don't understand why we have to do it? We don't support similar functionality for XML, so why for JSON? Hrm. Well, that's an interesting point. Maybe we don't. I assumed that people would eventually want to optimize queries of the form SELECT whatever FROM tab WHERE jsoncol = 'constant'. If that's a sufficiently marginal use case that we don't care, then fine. One difference between JSON and XML is that XML really has no well-defined comparison semantics. For example, consider: foobar1.0/bar/foo foobar1.0/bar /foo If the XML is being used as a transport mechanism, then the extra space is semantically insignificant, but if this is markup, then it might matter a lot. Also, consider: foobar1.00/bar/foo That one might be equal if we think 1.0 is intended to be a number, but if it's intended as a string then it's not. We could perhaps do comparisons in XML relative to some DTD or schema if those provide details about what the values mean, but in a vacuum it's not well-defined. On the other hand, in JSON, it's pretty clear that { 1, 2, 3 } is the same value as {1,2,3} but 1,2,3 is different from 1, 2, 3. There are some borderline cases that might need some sweat, like whether 1 = 1.0 = 1.00 = 1e0, but in general the level of ambiguity seems to me to be much less, making it more feasible here than it would be for XML. That having been said, uncertainties about whether we want this at all (and if so in what form) are exactly why I didn't include this kind of stuff in the patch to begin with, and I think that if we get this much committed for 9.2 we'll be doing pretty well. If we can agree on and do more, great; if not, we'll at least have this much, which IMHO would be an improvement over what we have now. I understand it now. My opinion is so some operators and index search can be in 9.2 - so use a JSON just as communication format now. * we need to build JSON * we need to check if some is valid JSON * we need to store JSON other steps should be (9.2) * basic operators eq, neeq * some tool like XQuery - simple query on JSON document available from SQL that can be used for functional indexes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Double-write with Fast Checksums
On Wed, Jan 11, 2012 at 7:13 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: At the moment, double-writes are done in one batch, fsyncing the double-write area first and the data files immediately after that. That's probably beneficial if you have a BBU, and/or a fairly large shared_buffers setting, so that pages don't get swapped between OS and PostgreSQL cache too much. But when those assumptions don't hold, it would be interesting to treat the double-write buffers more like a 2nd WAL for full-page images. Whenever a dirty page is evicted from shared_buffers, write it to the double-write area, but don't fsync it or write it back to the data file yet. Instead, let it sit in the double-write area, and grow the double-write file(s) as necessary, until the next checkpoint comes along. Ok, but for correctness, you need to *fsync* the double-write buffer (WAL) before you can issue the write on the normal datafile at all. All the double write can do is move the FPW from the WAL stream (done at commit time) to some other double buffer space (which can be done at write time). It still has to fsync the write-ahead part of the double write before it can write any of the normal part, or you leave the the torn-page possibility. And you still need to keep all the write-ahead part of the double-write around until all the normal writes have been fsynced (checkpoint time) so you can redo them all on crash recovery. So, I think that the work in double-writes has merit, but if it's done correctly, it isn't this magic bullet that suddenly gives us atomic, durable writes for free. It has major advantages (including, but not limited too) 1) Moving the FPW out of normal WAL/commit processing 2) Allowing fine control of (possibly seperate) FPW locations on a per tablespace/relation basis It does this by moving the FPW/IO penalty from the commit time of a backend dirtying the buffer first, to the eviction time of a backend evicting a dirty buffer. And if you're lucky enough that the background writer is the only one writing dirty buffers, you'll see lots of improvements in your performance (equivilent of running with current FPW off). But I have a feeling that many of us see backends having to write dirty buffers often enough too that the reduction in commit/WAL latency will be offset (hopefully not as much) by increased query processing time as backends double-write dirty buffers. a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule pavel.steh...@gmail.com wrote: I understand it now. My opinion is so some operators and index search can be in 9.2 - so use a JSON just as communication format now. * we need to build JSON * we need to check if some is valid JSON * we need to store JSON other steps should be (9.2) * basic operators eq, neeq * some tool like XQuery - simple query on JSON document available from SQL that can be used for functional indexes. That would be nice, but let's not let the perfect be the enemy of the good. We don't have a lot of time here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Double-write with Fast Checksums
On Wed, Jan 11, 2012 at 9:47 AM, Aidan Van Dyk ai...@highrise.ca wrote: It does this by moving the FPW/IO penalty from the commit time of a backend dirtying the buffer first, to the eviction time of a backend evicting a dirty buffer. And if you're lucky enough that the background writer is the only one writing dirty buffers, you'll see lots of improvements in your performance (equivilent of running with current FPW off). But I have a feeling that many of us see backends having to write dirty buffers often enough too that the reduction in commit/WAL latency will be offset (hopefully not as much) by increased query processing time as backends double-write dirty buffers. I have that feeling, too. Someone needs to devote some time to performance testing this stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.
On Wed, Jan 11, 2012 at 2:05 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sat, Dec 31, 2011 at 10:34 PM, Simon Riggs si...@2ndquadrant.com wrote: Send new protocol keepalive messages to standby servers. Allows streaming replication users to calculate transfer latency and apply delay via internal functions. No external functions yet. Thanks for further review. pq_flush_if_writable() needs to be called just after WalSndKeepalive(). Otherwise, keepalive packet is not sent for a while. It will get sent though won't it? Maybe not immediately. I guess we may as well flush though, since we're not doing anything else - by definition. Will add. +static void +ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime) walEnd is not used in ProcessWalSndrMessage() at all. Can't we remove it? If yes, walEnd field in WalSndrMessage is also not used anywhere, so ISTM we can remove it. It's there to allow extension of the message processing to be more complex than it currently is. Changing the protocol is much harder than changing a function call. I'd like to keep it since it doesn't have any negative effects. + elog(DEBUG2, sendtime %s receipttime %s replication apply delay %d transfer latency %d, + timestamptz_to_str(sendTime), + timestamptz_to_str(lastMsgReceiptTime), + GetReplicationApplyDelay(), + GetReplicationTransferLatency()); The unit of replication apply delay and transfer latency should be in log message. OK, will do. GetReplicationApplyDelay() and GetReplicationTransferLatency() are called whenever the standby receives the message from the master. Which might degrade the performance of replication a bit. So we should skip the above elog when log_message = DEBUG2? OK, will put in a specific test for you. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
2012/1/11 Robert Haas robertmh...@gmail.com: On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule pavel.steh...@gmail.com wrote: I understand it now. My opinion is so some operators and index search can be in 9.2 - so use a JSON just as communication format now. * we need to build JSON * we need to check if some is valid JSON * we need to store JSON other steps should be (9.2) * basic operators eq, neeq * some tool like XQuery - simple query on JSON document available from SQL that can be used for functional indexes. That would be nice, but let's not let the perfect be the enemy of the good. We don't have a lot of time here. sorry - replace 9.2 by 9.3 - I am sorry I am able to write array_to_json fce and Andrew can write query_to_json Pavel -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
I am able to write array_to_json fce and Andrew can write query_to_json +1 Thanks guys... We are using a lot of JSON as communication protocol... having core support for JSON, And those functions, will be a real life saver... Many thanks, Misa Sent from my Windows Phone From: Pavel Stehule Sent: 11/01/2012 16:22 To: Robert Haas Cc: David E. Wheeler; Claes Jakobsson; Dimitri Fontaine; Merlin Moncure; Andrew Dunstan; Magnus Hagander; Jan Urbański; Simon Riggs; Joey Adams; Bruce Momjian; PostgreSQL-development Hackers; Jan Wieck Subject: Re: [HACKERS] JSON for PG 9.2 2012/1/11 Robert Haas robertmh...@gmail.com: On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule pavel.steh...@gmail.com wrote: I understand it now. My opinion is so some operators and index search can be in 9.2 - so use a JSON just as communication format now. * we need to build JSON * we need to check if some is valid JSON * we need to store JSON other steps should be (9.2) * basic operators eq, neeq * some tool like XQuery - simple query on JSON document available from SQL that can be used for functional indexes. That would be nice, but let's not let the perfect be the enemy of the good. We don't have a lot of time here. sorry - replace 9.2 by 9.3 - I am sorry I am able to write array_to_json fce and Andrew can write query_to_json Pavel -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
Tom, BTW ... it occurs to me to ask whether we really have a solid use-case for having listeners attached to slave servers. I have personally never seen an application for LISTEN/NOTIFY in which the listeners were entirely read-only. Even if there are one or two cases out there, it's not clear to me that supporting it is worth the extra complexity that seems to be needed. Actually, I've seen requests for it from my clients and on IRC. Not sure how serious those are, but users have brought it up. Certainly users intuitively think they should be able to LISTEN on a standby, and are surprised when they find out they can't. The basic idea is that if we can replicate LISTENs, then you can use replication as a simple distributed (and lossy) queueing system. This is especially useful if the replica is geographically distant, and there are a lot of listeners. The obvious first use case for this is for cache invalidation. For example, we have one application where we're using Redis to queue cache invalidation messages; if LISTEN/NOTIFY were replicated, we could use it instead and simplify our infrastructure. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PL/Python result metadata
There is currently no reliable way to retrieve from a result object in PL/Python the number, name, or type of the result columns. You can get the number and name if the query returned more than zero rows by looking at the row dicts, but that is unreliable. The type information isn't available at all. I propose to add two functions to the result object: .colnames() returns a list of column names (strings) .coltypes() returns a list of type OIDs (integers) I just made that up because there is no guidance in the other standard PLs for this sort of thing, AFAICT. Patch attached. Comments welcome. diff --git i/doc/src/sgml/plpython.sgml w/doc/src/sgml/plpython.sgml index 618f8d0..69c9c90 100644 --- i/doc/src/sgml/plpython.sgml +++ w/doc/src/sgml/plpython.sgml @@ -886,9 +886,11 @@ $$ LANGUAGE plpythonu; list or dictionary object. The result object can be accessed by row number and column name. It has these additional methods: functionnrows/function which returns the number of rows - returned by the query, and functionstatus/function which is the - functionSPI_execute()/function return value. The result object - can be modified. + returned by the query, functionstatus/function which is the + functionSPI_execute()/function return value, + functioncolnames/function which is the list of column names, and + functioncoltypes/function which is the list of column type OIDs. The + result object can be modified. /para para diff --git i/src/pl/plpython/expected/plpython_spi.out w/src/pl/plpython/expected/plpython_spi.out index 3b4d7a3..cd76147 100644 --- i/src/pl/plpython/expected/plpython_spi.out +++ w/src/pl/plpython/expected/plpython_spi.out @@ -117,10 +117,12 @@ SELECT join_sequences(sequences) FROM sequences -- CREATE FUNCTION result_nrows_test() RETURNS int AS $$ -plan = plpy.prepare(SELECT 1 UNION SELECT 2) +plan = plpy.prepare(SELECT 1 AS foo, '11'::text AS bar UNION SELECT 2, '22') plpy.info(plan.status()) # not really documented or useful result = plpy.execute(plan) if result.status() 0: + plpy.info(result.colnames()) + plpy.info(result.coltypes()) return result.nrows() else: return None @@ -128,6 +130,10 @@ $$ LANGUAGE plpythonu; SELECT result_nrows_test(); INFO: True CONTEXT: PL/Python function result_nrows_test +INFO: ['foo', 'bar'] +CONTEXT: PL/Python function result_nrows_test +INFO: [23, 25] +CONTEXT: PL/Python function result_nrows_test result_nrows_test --- 2 diff --git i/src/pl/plpython/plpy_resultobject.c w/src/pl/plpython/plpy_resultobject.c index bf46a16..e7d14d4 100644 --- i/src/pl/plpython/plpy_resultobject.c +++ w/src/pl/plpython/plpy_resultobject.c @@ -12,6 +12,8 @@ static void PLy_result_dealloc(PyObject *arg); +static PyObject *PLy_result_colnames(PyObject *self, PyObject *unused); +static PyObject *PLy_result_coltypes(PyObject *self, PyObject *unused); static PyObject *PLy_result_nrows(PyObject *self, PyObject *args); static PyObject *PLy_result_status(PyObject *self, PyObject *args); static Py_ssize_t PLy_result_length(PyObject *arg); @@ -35,6 +37,8 @@ static PySequenceMethods PLy_result_as_sequence = { }; static PyMethodDef PLy_result_methods[] = { + {colnames, PLy_result_colnames, METH_NOARGS, NULL}, + {coltypes, PLy_result_coltypes, METH_NOARGS, NULL}, {nrows, PLy_result_nrows, METH_VARARGS, NULL}, {status, PLy_result_status, METH_VARARGS, NULL}, {NULL, NULL, 0, NULL} @@ -96,6 +100,7 @@ PLy_result_new(void) ob-status = Py_None; ob-nrows = PyInt_FromLong(-1); ob-rows = PyList_New(0); + ob-tupdesc = NULL; return (PyObject *) ob; } @@ -108,11 +113,44 @@ PLy_result_dealloc(PyObject *arg) Py_XDECREF(ob-nrows); Py_XDECREF(ob-rows); Py_XDECREF(ob-status); + if (ob-tupdesc) + { + FreeTupleDesc(ob-tupdesc); + ob-tupdesc = NULL; + } arg-ob_type-tp_free(arg); } static PyObject * +PLy_result_colnames(PyObject *self, PyObject *unused) +{ + PLyResultObject *ob = (PLyResultObject *) self; + PyObject *list; + int i; + + list = PyList_New(ob-tupdesc-natts); + for (i = 0; i ob-tupdesc-natts; i++) + PyList_SET_ITEM(list, i, PyString_FromString(NameStr(ob-tupdesc-attrs[i]-attname))); + + return list; +} + +static PyObject * +PLy_result_coltypes(PyObject *self, PyObject *unused) +{ + PLyResultObject *ob = (PLyResultObject *) self; + PyObject *list; + int i; + + list = PyList_New(ob-tupdesc-natts); + for (i = 0; i ob-tupdesc-natts; i++) + PyList_SET_ITEM(list, i, PyInt_FromLong(ob-tupdesc-attrs[i]-atttypid)); + + return list; +} + +static PyObject * PLy_result_nrows(PyObject *self, PyObject *args) { PLyResultObject *ob = (PLyResultObject *) self; diff --git i/src/pl/plpython/plpy_resultobject.h w/src/pl/plpython/plpy_resultobject.h index 719828a..1b37d1d 100644 --- i/src/pl/plpython/plpy_resultobject.h +++ w/src/pl/plpython/plpy_resultobject.h @@ -5,6 +5,9 @@ #ifndef PLPY_RESULTOBJECT_H #define PLPY_RESULTOBJECT_H +#include
Re: [HACKERS] PL/Python result metadata
Excerpts from Peter Eisentraut's message of mié ene 11 17:05:34 -0300 2012: There is currently no reliable way to retrieve from a result object in PL/Python the number, name, or type of the result columns. You can get the number and name if the query returned more than zero rows by looking at the row dicts, but that is unreliable. The type information isn't available at all. I propose to add two functions to the result object: .colnames() returns a list of column names (strings) .coltypes() returns a list of type OIDs (integers) No typmods? -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] psql case preserving completion
In psql, the tab completion always converts key words to upper case. In practice, I and I think most users type in lower case. So then you end up with commands looking like this: = alter TABLE foo add CONSTRAINT bar check (a 0); To address this, I have implemented a slightly different completion mode that looks at the word being completed and converts the completed word to the case of the original word. (Well, it looks at the first letter.) In fact, since almost all completions in psql are of this nature, I made this the default mode for COMPLETE_WITH_CONST and COMPLETE_WITH_LIST and added a new macro COMPLETE_WITH_LIST_CS that uses the old case-sensitive behavior. The latter is used mainly for completing backslash commands. After playing with this a little, I find the behavior more pleasing. Less yelling. ;-) Patch attached. diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index a27ef69..5d042f0 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -132,6 +132,7 @@ static const char *const * completion_charpp; /* to pass a list of strings */ static const char *completion_info_charp; /* to pass a second string */ static const char *completion_info_charp2; /* to pass a third string */ static const SchemaQuery *completion_squery; /* to pass a SchemaQuery */ +static bool completion_case_sensitive; /* completion is case sensitive */ /* * A few macros to ease typing. You can use these to complete the given @@ -155,15 +156,24 @@ do { \ matches = completion_matches(text, complete_from_schema_query); \ } while (0) +#define COMPLETE_WITH_LIST_CS(list) \ +do { \ + completion_charpp = list; \ + completion_case_sensitive = true; \ + matches = completion_matches(text, complete_from_list); \ +} while (0) + #define COMPLETE_WITH_LIST(list) \ do { \ completion_charpp = list; \ + completion_case_sensitive = false; \ matches = completion_matches(text, complete_from_list); \ } while (0) #define COMPLETE_WITH_CONST(string) \ do { \ completion_charp = string; \ + completion_case_sensitive = false; \ matches = completion_matches(text, complete_from_const); \ } while (0) @@ -771,7 +781,7 @@ psql_completion(char *text, int start, int end) /* If a backslash command was started, continue */ if (text[0] == '\\') - COMPLETE_WITH_LIST(backslash_commands); + COMPLETE_WITH_LIST_CS(backslash_commands); /* Variable interpolation */ else if (text[0] == ':' text[1] != ':') @@ -2864,7 +2874,7 @@ psql_completion(char *text, int start, int end) null, fieldsep, tuples_only, title, tableattr, linestyle, pager, recordsep, NULL}; - COMPLETE_WITH_LIST(my_list); + COMPLETE_WITH_LIST_CS(my_list); } else if (strcmp(prev2_wd, \\pset) == 0) { @@ -2874,14 +2884,14 @@ psql_completion(char *text, int start, int end) {unaligned, aligned, wrapped, html, latex, troff-ms, NULL}; - COMPLETE_WITH_LIST(my_list); + COMPLETE_WITH_LIST_CS(my_list); } else if (strcmp(prev_wd, linestyle) == 0) { static const char *const my_list[] = {ascii, old-ascii, unicode, NULL}; - COMPLETE_WITH_LIST(my_list); + COMPLETE_WITH_LIST_CS(my_list); } } else if (strcmp(prev_wd, \\set) == 0) @@ -3234,6 +3244,31 @@ _complete_from_query(int is_schema_query, const char *text, int state) /* + * Make a pg_strdup copy of s and convert it to the same case as ref. + */ +static char * +pg_strdup_same_case(const char *s, const char *ref) +{ + char *ret, *p; + unsigned char first = ref[0]; + + if (isalpha(first)) + { + ret = pg_strdup(s); + if (islower(first)) + for (p = ret; *p; p++) +*p = pg_tolower((unsigned char) *p); + else + for (p = ret; *p; p++) +*p = pg_toupper((unsigned char) *p); + return ret; + } + else + return pg_strdup(s); +} + + +/* * This function returns in order one of a fixed, NULL pointer terminated list * of strings (if matching). This can be used if there are only a fixed number * SQL words that can appear at certain spot. @@ -3255,7 +3290,7 @@ complete_from_list(const char *text, int state) { list_index = 0; string_length = strlen(text); - casesensitive = true; + casesensitive = completion_case_sensitive; matches = 0; } @@ -3270,7 +3305,14 @@ complete_from_list(const char *text, int state) /* Second pass is case insensitive, don't bother counting matches */ if (!casesensitive pg_strncasecmp(text, item, string_length) == 0) - return pg_strdup(item); + { + if (completion_case_sensitive) +return pg_strdup(item); + else +/* If case insensitive matching was requested initially, return + * it in the case of what was already entered. */ +return pg_strdup_same_case(item, text); + } } /* @@ -3300,12 +3342,16 @@ complete_from_list(const char *text, int state) static char * complete_from_const(const char *text, int state) { - (void) text;/* We don't care about what was entered - * already. */ -
Re: [HACKERS] psql case preserving completion
2012/1/11 Peter Eisentraut pete...@gmx.net: In psql, the tab completion always converts key words to upper case. In practice, I and I think most users type in lower case. So then you end up with commands looking like this: = alter TABLE foo add CONSTRAINT bar check (a 0); To address this, I have implemented a slightly different completion mode that looks at the word being completed and converts the completed word to the case of the original word. (Well, it looks at the first letter.) In fact, since almost all completions in psql are of this nature, I made this the default mode for COMPLETE_WITH_CONST and COMPLETE_WITH_LIST and added a new macro COMPLETE_WITH_LIST_CS that uses the old case-sensitive behavior. The latter is used mainly for completing backslash commands. After playing with this a little, I find the behavior more pleasing. Less yelling. ;-) Patch attached. +1 Pavel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] renaming constraints
Here is a patch to add a command ALTER TABLE ... RENAME CONSTRAINT ... Currently, it only supports table constraints. I have an almost finished patch for renaming domain constraints, but it is easier to keep it separate. diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 1976f6d..cd9ee97 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -25,6 +25,8 @@ ALTER TABLE [ ONLY ] replaceable class=PARAMETERname/replaceable [ * ] replaceable class=PARAMETERaction/replaceable [, ... ] ALTER TABLE [ ONLY ] replaceable class=PARAMETERname/replaceable [ * ] RENAME [ COLUMN ] replaceable class=PARAMETERcolumn/replaceable TO replaceable class=PARAMETERnew_column/replaceable +ALTER TABLE [ ONLY ] replaceable class=PARAMETERname/replaceable [ * ] +RENAME CONSTRAINT replaceable class=PARAMETERconstraint_name/replaceable TO replaceable class=PARAMETERnew_constraint_name/replaceable ALTER TABLE replaceable class=PARAMETERname/replaceable RENAME TO replaceable class=PARAMETERnew_name/replaceable ALTER TABLE replaceable class=PARAMETERname/replaceable @@ -555,8 +557,8 @@ ALTER TABLE replaceable class=PARAMETERname/replaceable listitem para The literalRENAME/literal forms change the name of a table - (or an index, sequence, or view) or the name of an individual column in - a table. There is no effect on the stored data. + (or an index, sequence, or view), the name of an individual column in + a table, or the name of a constraint of the table. There is no effect on the stored data. /para /listitem /varlistentry @@ -869,7 +871,8 @@ ALTER TABLE replaceable class=PARAMETERname/replaceable para If a table has any descendant tables, it is not permitted to add, -rename, or change the type of a column in the parent table without doing +rename, or change the type of a column, or rename an inherited constraint +in the parent table without doing the same to the descendants. That is, commandALTER TABLE ONLY/command will be rejected. This ensures that the descendants always have columns matching the parent. @@ -969,6 +972,13 @@ ALTER TABLE distributors RENAME TO suppliers; /para para + To rename an existing constraint: +programlisting +ALTER TABLE distributors RENAME CONSTRAINT zipchk TO zip_check; +/programlisting + /para + + para To add a not-null constraint to a column: programlisting ALTER TABLE distributors ALTER COLUMN street SET NOT NULL; diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c index 7c658c0..ef8c5fb 100644 --- a/src/backend/commands/alter.c +++ b/src/backend/commands/alter.c @@ -57,6 +57,10 @@ ExecRenameStmt(RenameStmt *stmt) RenameCollation(stmt-object, stmt-newname); break; + case OBJECT_CONSTRAINT: + RenameConstraint(stmt); + break; + case OBJECT_CONVERSION: RenameConversion(stmt-object, stmt-newname); break; diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index c373016..33ba611 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -2320,6 +2320,108 @@ renameatt(RenameStmt *stmt) stmt-behavior); } + +/* + * same logic as renameatt_internal + */ +static void +rename_constraint_internal(Oid myrelid, + const char *oldconname, + const char *newconname, + bool recurse, + bool recursing, + int expected_parents) +{ + Relation targetrelation; + Oid constraintOid; + HeapTuple tuple; + Form_pg_constraint con; + + targetrelation = relation_open(myrelid, AccessExclusiveLock); + /* don't tell it whether we're recursing; we allow changing typed tables here */ + renameatt_check(myrelid, RelationGetForm(targetrelation), false); + + constraintOid = get_constraint_oid(myrelid, oldconname, false); + + tuple = SearchSysCache1(CONSTROID, ObjectIdGetDatum(constraintOid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, cache lookup failed for constraint %u, + constraintOid); + con = (Form_pg_constraint) GETSTRUCT(tuple); + + if (!con-conisonly) + { + if (recurse) + { + List *child_oids, +*child_numparents; + ListCell *lo, +*li; + + child_oids = find_all_inheritors(myrelid, AccessExclusiveLock, + child_numparents); + + forboth(lo, child_oids, li, child_numparents) + { +Oid childrelid = lfirst_oid(lo); +int numparents = lfirst_int(li); + +if (childrelid == myrelid) + continue; + +rename_constraint_internal(childrelid, oldconname, newconname, false, true, numparents); + } + } + else + { + if (expected_parents == 0 +find_inheritance_children(myrelid, NoLock) != NIL) +ereport(ERROR, + (errcode(ERRCODE_INVALID_TABLE_DEFINITION), + errmsg(inherited constraint \%s\ must be renamed in child tables too, +oldconname))); + } + + if (con-coninhcount expected_parents) +
Re: [HACKERS] [WIP] Double-write with Fast Checksums
Thanks for all the comments and suggestions on the double-write patch. We are working on generating performance results for the 9.2 patch, but there is enough difference between 9.0 and 9.2 that it will take some time. One thing in 9.2 that may be causing problems with the current patch is the fact that the checkpointer and bgwriter are separated and can run at the same time (I think), and therefore will contend on the double-write file. Is there any thought that the bgwriter might be paused while the checkpointer is doing a checkpoint, since the checkpointer is doing some of the cleaning that the bgwriter wants to do anyways? The current patch (as mentioned) also may not do well if there are a lot of dirty-page evictions by backends, because of the extra fsyncing just to write individual buffers. I think Heikki's (and Simon's) idea of a growing shared double-write buffer (only doing double-writes when it gets to a certain size) instead is a great idea that could deal with the dirty-page eviction issue with less performance hit. It could also deal with the checkpointer/bgwriter contention, if we can't avoid that. I will think about that approach and any issues that might arise. But for now, we will work on getting performance numbers for the current patch. With respect to all the extra fsyncs, I agree they are expensive if done on individual buffers by backends. For the checkpointer, there will be extra fsyncs, but the batching helps greatly, and the fsyncs per batch are traded off against the often large unpredictable fsyncs at the end of checkpoints. In our performance runs on 9.0, the configuration was such that there were not a lot of dirty evictions, and the checkpointer/bgwriter was able to finish the checkpoint on time, even with the double writes. And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs. Thanks, Dan - Original Message - From: Heikki Linnakangas heikki.linnakan...@enterprisedb.com To: David Fetter da...@fetter.org Cc: PG Hackers pgsql-hackers@postgresql.org, jks...@gmail.com Sent: Wednesday, January 11, 2012 4:13:01 AM Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums On 10.01.2012 23:43, David Fetter wrote: Please find attached a new revision of the double-write patch. While this one still uses the checksums from VMware, it's been forward-ported to 9.2. I'd like to hold off on merging Simon's checksum patch into this one for now because there may be some independent issues. Could you write this patch so that it doesn't depend on any of the checksum patches, please? That would make the patch smaller and easier to review, and it would allow benchmarking the performance impact of double-writes vs full page writes independent of checksums. At the moment, double-writes are done in one batch, fsyncing the double-write area first and the data files immediately after that. That's probably beneficial if you have a BBU, and/or a fairly large shared_buffers setting, so that pages don't get swapped between OS and PostgreSQL cache too much. But when those assumptions don't hold, it would be interesting to treat the double-write buffers more like a 2nd WAL for full-page images. Whenever a dirty page is evicted from shared_buffers, write it to the double-write area, but don't fsync it or write it back to the data file yet. Instead, let it sit in the double-write area, and grow the double-write file(s) as necessary, until the next checkpoint comes along. In general, I must say that I'm pretty horrified by all these extra fsync's this introduces. You really need a BBU to absorb them, and even then, you're fsyncing data files to disk much more frequently than you otherwise would. Jignesh mentioned having run some performance tests with this. I would like to see those results, and some analysis and benchmarks of how settings like shared_buffers and the presence of BBU affect this, compared to full_page_writes=on and off. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_basebackup is not checking IDENTIFY_SYSTEM numbre of columns
Hi, I just notice $SUBJECT and this could lead us to a segmentation fault if by accident we point to a system with a different number of columns in IDENTIFY_SYSTEM, at least i point pg_receivexlog from current head to a 9.0 instalation and got that. Any reason for not checking number of columns? -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/Python result metadata
Peter Eisentraut pete...@gmx.net writes: .colnames() returns a list of column names (strings) .coltypes() returns a list of type OIDs (integers) I just made that up because there is no guidance in the other standard PLs for this sort of thing, AFAICT. What about having the same or comparable API as in psycopg or DB API http://initd.org/psycopg/docs/cursor.html You could expose a py.description structure? -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup is not checking IDENTIFY_SYSTEM numbre of columns
On Wed, Jan 11, 2012 at 22:31, Jaime Casanova ja...@2ndquadrant.com wrote: Hi, I just notice $SUBJECT and this could lead us to a segmentation fault if by accident we point to a system with a different number of columns in IDENTIFY_SYSTEM, at least i point pg_receivexlog from current head to a 9.0 instalation and got that. Any reason for not checking number of columns? Hmm. I could've bet it did that. I must've taken that code out at some point during refactoring. No, no reason. Adding such a check would be a good idea. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sun, Jan 8, 2012 at 2:03 PM, Simon Riggs si...@2ndquadrant.com wrote: On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs si...@2ndquadrant.com wrote: On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs si...@2ndquadrant.com wrote: So there isn't any problem with there being incorrect checksums on blocks and you can turn the parameter on and off as often and as easily as you want. I think it can be USERSET but I wouldn't want to encourage users to see turning it off as a performance tuning feature. If the admin turns it on for the server, its on, so its SIGHUP. Any holes in that I haven't noticed? And of course, as soon as I wrote that I thought of the problem. We mustn't make a write that hasn't been covered by a FPW, so we must know ahead of time whether to WAL log hints or not. We can't simply turn it on/off any longer, now that we have to WAL log hint bits also. So thanks for making me think of that. We *could* make it turn on/off at each checkpoint, but its easier just to say that it can be turned on/off at server start. Attached patch v6 now handles hint bits and checksums correctly, following Heikki's comments. In recovery, setting a hint doesn't dirty a block if it wasn't already dirty. So we can write some hints, and we can set others but not write them. Lots of comments in the code. v7 * Fixes merge conflict * Minor patch cleanups * Adds checksum of complete page including hole * Calcs checksum in mdwrite() so we pickup all non-shared buffer writes also Robert mentioned to me there were outstanding concerns on this patch. I know of none, and have double checked the thread to confirm all concerns are fully addressed. Adding to CF. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 0cc3296..3cb8d2a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1701,6 +1701,47 @@ SET ENABLE_SEQSCAN TO OFF; /listitem /varlistentry + varlistentry id=guc-page-checksums xreflabel=page_checksums + indexterm + primaryvarnamepage_checksums/ configuration parameter/primary + /indexterm + termvarnamepage_checksums/varname (typeboolean/type)/term + listitem + para +When this parameter is on, the productnamePostgreSQL/ server +calculates checksums when it writes main database pages to disk, +flagging the page as checksum protected. When this parameter is off, +no checksum is written, only a standard watermark in the page header. +The database may thus contain a mix of pages with checksums and pages +without checksums. + /para + + para +When pages are read into shared buffers any page flagged with a +checksum has the checksum re-calculated and compared against the +stored value to provide greatly improved validation of page contents. + /para + + para +Writes via temp_buffers are not checksummed. + /para + + para +Turning this parameter off speeds normal operation, but +might allow data corruption to go unnoticed. The checksum uses +16-bit checksums, using the fast Fletcher 16 algorithm. With this +parameter enabled there is still a non-zero probability that an error +could go undetected, as well as a non-zero probability of false +positives. + /para + + para +This parameter can only be set at server start. +The default is literaloff/. + /para + /listitem + /varlistentry + varlistentry id=guc-wal-buffers xreflabel=wal_buffers termvarnamewal_buffers/varname (typeinteger/type)/term indexterm diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 19ef66b..7c7b20e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -709,6 +709,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata) bool updrqst; bool doPageWrites; bool isLogSwitch = (rmid == RM_XLOG_ID info == XLOG_SWITCH); + bool IsHint = (rmid == RM_SMGR_ID info == XLOG_SMGR_HINT); uint8 info_orig = info; /* cross-check on whether we should be here or not */ @@ -955,6 +956,18 @@ begin:; } /* + * If this is a hint record and we don't need a backup block then + * we have no more work to do and can exit quickly without inserting + * a WAL record at all. In that case return InvalidXLogRecPtr. + */ + if (IsHint !(info XLR_BKP_BLOCK_MASK)) + { + LWLockRelease(WALInsertLock); + END_CRIT_SECTION(); + return InvalidXLogRecPtr; + } + + /* * If there isn't enough space on the current XLOG page for a record * header, advance to the next page (leaving the unused space as zeroes). */ @@ -3650,6 +3663,13 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup) BLCKSZ - (bkpb.hole_offset +
Re: [HACKERS] PL/Python result metadata
On 11/01/12 22:52, Dimitri Fontaine wrote: Peter Eisentraut pete...@gmx.net writes: .colnames() returns a list of column names (strings) .coltypes() returns a list of type OIDs (integers) I just made that up because there is no guidance in the other standard PLs for this sort of thing, AFAICT. What about having the same or comparable API as in psycopg or DB API http://initd.org/psycopg/docs/cursor.html You could expose a py.description structure? +1 for providing a read-only result.description. Not sure if it's worth it to follow DB-API there, but maybe yes. Perhaps we could have a result.description_ex information that's PG-specific or just not present in PEP 249, like the typmod, collation and so on. J -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Double-write with Fast Checksums
On 1/11/12 1:25 PM, Dan Scales wrote: And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs. Even if you're replicating? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PGCon 2012 Call for Papers
PGCon 2012 will be held 17-18 May 2012, in Ottawa at the University of Ottawa. It will be preceded by two days of tutorials on 15-16 May 2012. We are now accepting proposals for talks. Proposals can be quite simple. We do not require academic-style papers. If you are doing something interesting with PostgreSQL, please submit a proposal. You might be one of the backend hackers or work on a PostgreSQL related project and want to share your know-how with others. You might be developing an interesting system using PostgreSQL as the foundation. Perhaps you migrated from another database to PostgreSQL and would like to share details. These, and other stories are welcome. Both users and developers are encouraged to share their experiences. Here are a some ideas to jump start your proposal process: - novel ways in which PostgreSQL is used - migration of production systems from another database - data warehousing - tuning PostgreSQL for different work loads - replication and clustering - hacking the PostgreSQL code - PostgreSQL derivatives and forks - applications built around PostgreSQL - benchmarking and performance engineering - case studies - location-aware and mapping software with PostGIS - The latest PostgreSQL features and features in development - research and teaching with PostgreSQL - things the PostgreSQL project could do better - integrating PostgreSQL with 3rd-party software Both users and developers are encouraged to share their experiences. The schedule is: 8 Jan 2012 Proposal acceptance begins 29 Jan 2012 Proposal acceptance ends 19 Feb 2012 Confirmation of accepted proposals See also http://www.pgcon.org/2012/papers.php Instructions for submitting a proposal to PGCon 2012 are available from: http://www.pgcon.org/2012/submissions.php -- Dan Langille - http://langille.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Double-write with Fast Checksums
On Wed, Jan 11, 2012 at 11:07 PM, Josh Berkus j...@agliodbs.com wrote: On 1/11/12 1:25 PM, Dan Scales wrote: And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs. Even if you're replicating? Yes, but it will increase random I/O on the standby when we replay if we don't have FPWs. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLOCK_STATS
On Jan 10, 2012, at 3:16 AM, Simon Riggs wrote: On Tue, Jan 10, 2012 at 12:24 AM, Jim Nasby j...@nasby.net wrote: IIRC, pg_bench is *extremely* write-heavy. There's probably not that many systems that operate that way. I suspect that most OLTP systems read more than they write, and some probably have as much as a 10-1 ratio. IMHO the main PostgreSQL design objective is doing a flexible, general purpose 100% write workload. Which is why Hot Standby and LISTEN/NOTIFY are so important as mechanisms for offloading read traffic to other places, so we can scale the total solution beyond 1 node without giving up the power of SQL. There's a problem with that theory though... in an actual OLTP system it can be extremely difficult to effectively split read and write workloads unless you've got some really easy way to know that you're not reading data that was just modified. I realize that there are caching and some other tricks that can help here, but AFAICT they all have some pretty significant drawbacks that can easily limit where they can be used. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Remembering bug #6123
Going back through the patches we had to make to 9.0 to move to PostgreSQL triggers, I noticed that I let the issues raised as bug #6123 lie untouched during the 9.2 development cycle. In my view, the best suggestion for a solution was proposed by Florian here: http://archives.postgresql.org/pgsql-hackers/2011-08/msg00388.php As pointed out in a brief exchange after that post, there would be ways to do the more exotic things that might be desired, while preventing apparently straightforward code from doing surprising things. I'm not sure whether that discussion fully satisfies the concerns raised by Robert, though. Because I let this lapse, it only seems feasible to go forward with a patch for 9.2 if there is consensus around Florian's proposal. If there is any dissent, I guess the thing to do is for me to gather the issues and see about getting something into 9.3, once 9.2 work has died down -- in five months or so. Wisconsin Courts can continue to deal with the issues using my more simple-minded patch, but others still are getting surprised by it -- bug #6226 is apparently another manifestation. Comments? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
Josh Berkus j...@agliodbs.com writes: BTW ... it occurs to me to ask whether we really have a solid use-case for having listeners attached to slave servers. I have personally never seen an application for LISTEN/NOTIFY in which the listeners were entirely read-only. Even if there are one or two cases out there, it's not clear to me that supporting it is worth the extra complexity that seems to be needed. The basic idea is that if we can replicate LISTENs, then you can use replication as a simple distributed (and lossy) queueing system. Well, this is exactly what I don't believe. A queueing system requires that recipients be able to remove things from the queue. You can't do that on a slave server, because you can't make any change in the database that would be visible to other users. The obvious first use case for this is for cache invalidation. Yeah, upthread Simon pointed out that propagating notifies would be useful for flushing caches in applications that watch the database in a read-only fashion. I grant that such a use-case is technically possible within the limitations of a slave server; I'm just dubious that it's a sufficiently attractive use-case to justify the complexity and future maintenance costs of the sort of designs we are talking about. Or in other words: so far, cache invalidation is not the first use-case, it's the ONLY POSSIBLE use-case. That's not useful enough. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
Yeah, upthread Simon pointed out that propagating notifies would be useful for flushing caches in applications that watch the database in a read-only fashion. I grant that such a use-case is technically possible within the limitations of a slave server; I'm just dubious that it's a sufficiently attractive use-case to justify the complexity and future maintenance costs of the sort of designs we are talking about. Or in other words: so far, cache invalidation is not the first use-case, it's the ONLY POSSIBLE use-case. That's not useful enough. Well, cache invalidation is a pretty common task; probably more than 50% of all database applications need to do it. Note that we're not just talking about memcached for web applications here. For example, one of the companies quoted for PostgreSQL 9.0 release uses LISTEN/NOTIFY to inform remote devices (POS systems) that there's new data available for them. That's a form of cache invalidation. It's certainly a more common design pattern than using XML in the database. However, there's the question of whether or not this patch actually allows a master-slave replication system to support more Listeners more efficiently than having them all simply listen to the master. And what impact it has on the performance of LISTEN/NOTIFY on standalone systems. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sending notifications from the master to the standby
On 11 January 2012 23:51, Josh Berkus j...@agliodbs.com wrote: Yeah, upthread Simon pointed out that propagating notifies would be useful for flushing caches in applications that watch the database in a read-only fashion. I grant that such a use-case is technically possible within the limitations of a slave server; I'm just dubious that it's a sufficiently attractive use-case to justify the complexity and future maintenance costs of the sort of designs we are talking about. Or in other words: so far, cache invalidation is not the first use-case, it's the ONLY POSSIBLE use-case. That's not useful enough. Well, cache invalidation is a pretty common task; probably more than 50% of all database applications need to do it. I agree that it would be nice to support this type of cache invalidation - without commenting on the implementation, I think that the concept is very useful, and of immediate benefit to a significant number of people. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Double-write with Fast Checksums
Simon Riggs si...@2ndquadrant.com writes: On Wed, Jan 11, 2012 at 11:07 PM, Josh Berkus j...@agliodbs.com wrote: On 1/11/12 1:25 PM, Dan Scales wrote: And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs. Even if you're replicating? Yes, but it will increase random I/O on the standby when we replay if we don't have FPWs. The question is how you prevent torn pages when a slave server crashes during replay. Right now, the presence of FPIs in the WAL stream, together with the requirement that replay restart from a checkpoint, is sufficient to guarantee that any torn pages will be fixed up. If you remove FPIs from WAL and don't transmit some substitute information, ISTM you've lost protection against slave server crashes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WIP -- renaming implicit sequences
Hi, Here is an unfinished patch to implement something which appears on the TODO list under ALTER: automatic renaming of sequences created with serial when the table and column names change. I've often wanted this feature and it seemed like a good starter project. I'd be grateful for any feedback and advice on how I could get it into acceptable shape. Example: hack=# create table foo (id serial primary key); NOTICE: CREATE TABLE will create implicit sequence foo_id_seq for serial column foo.id NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index foo_pkey for table foo CREATE TABLE hack=# alter table foo rename to bar; NOTICE: ALTER TABLE will rename implicit sequence foo_id_seq to bar_id_seq ALTER TABLE hack=# alter table bar rename id to snacks; NOTICE: ALTER TABLE will rename implicit sequence bar_id_seq to bar_snacks_seq ALTER TABLE Sequences are considered to be renameable if they are owned by the table, and have a name conforming to the name pattern used by CREATE TABLE (table_column_seq with optional trailing numbers). If you've manually renamed a SEQUENCE so that it doesn't conform, it won't touch it. If you've created a SEQUENCE and declared it to be OWNED BY the table, then it will be renamed only if it happens to conform. I'm not sure what to do about permissions. I guess it should silently skip renaming sequences if the user doesn't have appropriate privileges. Useful? Why would anyone not want this behaviour? Have I used inappropriate locking levels? What should I read to understand the rules of locking? Have I failed to handle errors? Have I made memory ownership mistakes? Thanks! Thomas Munro diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index 335bdc6..cd6318e 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -876,6 +876,160 @@ transformTableLikeClause(CreateStmtContext *cxt, TableLikeClause *table_like_cla heap_close(relation, NoLock); } +/* + * Check if name appears to be a sequence name as generated by CREATE TABLE for + * serial columns. column_name may be NULL to mean any column. + */ +static bool +matchesGeneratedSequenceForm(const char *name, + const char *table_name, + const char *column_name) +{ + /* Must start with the table name and an underscore. */ + size_t table_name_len = strlen(table_name); + if (strncmp(name, table_name, table_name_len) != 0) + return false; + name += table_name_len; + if (name[0] != '_') + return false; + name += 1; + if (column_name != NULL) + { + /* Must match a specific column name. */ + size_t column_name_len = strlen(column_name); + if (strncmp(name, column_name, column_name_len) != 0) + return false; + name += column_name_len; + } + else + { + /* Step over any column name. */ + name = strrchr(name, '_'); + if (name == NULL) + return false; + } + /* Must have a trailing 'seq'. */ + if (strncmp(name, _seq, 4) != 0) + return false; + name += 4; + /* We tolerate any number of digits at the end */ + while (*name) + { + if (*name '0' || *name '9') + { + return false; + } + ++name; + } + return true; +} + +/* + * Given a sequence name as generated for serial columns, attempt to extract + * the column name as a newly allocated string. If it can't be done, return + * NULL. + */ +static char * +extractColumnName(const char *seq_name, const char *table_name) +{ + const char *column_begin = seq_name + strlen(table_name) + 1; + const char *column_end = strrchr(column_begin, '_'); + if (column_end == NULL) + { + return NULL; + } + else + { + size_t column_len = column_end - column_begin; + char *column = palloc(column_len + 1); + strcpy(column, column_begin); + column[column_len] = 0; + return column; + } +} + +/* + * Expand a RenameStmt into a list of statements. If the statement renames a + * table or a column, then this builds statements to rename any owned sequences + * that have a name apparently created implicitly. + */ +List * +transformRenameStmt(RenameStmt *stmt) +{ + List *result = NIL; + if (stmt-renameType == OBJECT_TABLE || stmt-renameType == OBJECT_COLUMN) + { + Oid table_relid; + Relation table_rel; + List *sequences; + ListCell *cell; + char *table_name; + + /* Get the table's (current) name. */ + table_relid = RangeVarGetRelid(stmt-relation, NoLock, false); + table_rel = relation_open(table_relid, AccessExclusiveLock); + table_name = RelationGetRelationName(table_rel); + relation_close(table_rel, NoLock); + + /* + * Find all owned sequences, and consider renaming them if they appear + * to conform to the naming rule used for implicit sequence generation + * during table creation. We skip other owned sequences. + */ + sequences = getOwnedSequences(table_relid); + foreach(cell, sequences) + { + Oid seq_relid = lfirst_oid(cell); + Relation seq_rel = relation_open(seq_relid, AccessExclusiveLock); + char *seq_name = RelationGetRelationName(seq_rel); /*
Re: [HACKERS] Remembering bug #6123
Kevin Grittner kevin.gritt...@wicourts.gov writes: Going back through the patches we had to make to 9.0 to move to PostgreSQL triggers, I noticed that I let the issues raised as bug #6123 lie untouched during the 9.2 development cycle. In my view, the best suggestion for a solution was proposed by Florian here: http://archives.postgresql.org/pgsql-hackers/2011-08/msg00388.php Do you mean this: After every BEFORE trigger invocation, if the trigger returned non-NULL, check if latest row version is still the same as when the trigger started. If not, complain. While that sounds relatively safe, if possibly performance-impacting, it's not apparent to me how it fixes the problem you complained of. The triggers you were using were modifying rows other than the one being targeted by the triggering action, so a test like the above would not notice that they'd done anything. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
On 01/11/2012 10:21 AM, Pavel Stehule wrote: 2012/1/11 Robert Haasrobertmh...@gmail.com: On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehulepavel.steh...@gmail.com wrote: I understand it now. My opinion is so some operators and index search can be in 9.2 - so use a JSON just as communication format now. * we need to build JSON * we need to check if some is valid JSON * we need to store JSON other steps should be (9.2) * basic operators eq, neeq * some tool like XQuery - simple query on JSON document available from SQL that can be used for functional indexes. That would be nice, but let's not let the perfect be the enemy of the good. We don't have a lot of time here. sorry - replace 9.2 by 9.3 - I am sorry I am able to write array_to_json fce and Andrew can write query_to_json For those who want to play along, see https://bitbucket.org/adunstan/pgdevel which has Robert's patch and my additions to it. I'm actually half way through writing an array_to_json function, since it it necessary anyway for query_to_json. I hope to have a fairly complete working function in about 24 hours. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] order of operations for pg_restore
I'm working on a tool that runs pg_restore with -j 4. I notice that after COPYing in the data, pg_restore does two indexes and a cluster command in parallel. The first CREATE INDEX is running, the CLUSTER command is waiting on it and the second CREATE INDEX is waiting on the CLUSTER. This seems sub-optimal. Would it make sense to run the CLUSTER command first? I'm pretty sure I can replicate the behavior if necessary. Running 9.1.2. Andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP -- renaming implicit sequences
Thomas Munro mu...@ip9.org writes: Here is an unfinished patch to implement something which appears on the TODO list under ALTER: automatic renaming of sequences created with serial when the table and column names change. I've often wanted this feature and it seemed like a good starter project. Hmm ... this seems a bit inconsistent with the fact that we got rid of automatic renaming of indexes a year or three back. Won't renaming of serials have all the same problems that caused us to give up on renaming indexes? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Double-write with Fast Checksums
On Wed, Jan 11, 2012 at 7:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: The question is how you prevent torn pages when a slave server crashes during replay. Right now, the presence of FPIs in the WAL stream, together with the requirement that replay restart from a checkpoint, is sufficient to guarantee that any torn pages will be fixed up. If you remove FPIs from WAL and don't transmit some substitute information, ISTM you've lost protection against slave server crashes. This double-write stragegy is all an attempt to make writes durable. You remove the FPW from the WAL stream only because you're writes are make durable using some other stragegy, like the double-write. Any standby will need to be using some stragegy to make sure it's writes are durable, namely, the same double-write. So on a standby crash, it will replay whatever FPWs it has in the double-write buffer it has accumulated to make sure it's writes were consistent. Exactly as the master would do. a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] measuring spinning
I've had cause, a few times this development cycle, to want to measure the amount of spinning on each lwlock in the system. To that end, I've found the attached patch useful. Note that if you don't define LWLOCK_STATS, this changes nothing except that the return value from s_lock becomes int rather than void. If you do define LWLOCK_STATS, then LWLockAcquire() counts the number of pg_usleep() calls that are required to acquire each LWLock, in addition to the other statistics. Since this has come up for me a few times now, I'd like to proposing including it in core. Just to whet your appetite, here are the top spinners on a 32-client SELECT-only test on a 32-core Itanium server. All the locks not shown below have two orders of magnitude less of a problem than these do. lwlock 48: shacq 6042357 exacq 34590 blk 53 spin 1288 lwlock 42: shacq 5014729 exacq 34942 blk 58 spin 1321 lwlock 43: shacq 5448771 exacq 34725 blk 44 spin 1608 lwlock 44: shacq 6420992 exacq 34980 blk 67 spin 1713 lwlock 35: shacq 6353111 exacq 34256 blk 59 spin 1784 lwlock 38: shacq 6052801 exacq 34913 blk 70 spin 1801 lwlock 46: shacq 6401413 exacq 34698 blk 58 spin 1879 lwlock 36: shacq 6051887 exacq 35309 blk 73 spin 1915 lwlock 45: shacq 6812817 exacq 35170 blk 59 spin 1924 lwlock 33: shacq 6793666 exacq 35009 blk 59 spin 1955 lwlock 34: shacq 6395994 exacq 34941 blk 58 spin 2019 lwlock 40: shacq 6388136 exacq 34579 blk 51 spin 2054 lwlock 37: shacq 7250574 exacq 35242 blk 73 spin 2409 lwlock 39: shacq 7109729 exacq 34892 blk 65 spin 2632 lwlock 47: shacq 8243963 exacq 35256 blk 88 spin 3018 lwlock 328936: shacq 33992167 exacq 0 blk 0 spin 59816 lwlock 4: shacq 33994583 exacq 168 blk 40 spin 74018 lwlock 41: shacq 40098590 exacq 35001 blk 440 spin 81324 grant total: shacq 361775238 exacq 172965995 blk 2315 spin 245342 This is a 5-minute test run. If I had to take a shot in the dark, the buffer mapping lock in the #1 spot is the one protecting the root page of pgbench_acocunts_pkey, and the high numbered lock in position #3 is the content lock on the buffer itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On 12 January 2012 01:48, Robert Haas robertmh...@gmail.com wrote: I've had cause, a few times this development cycle, to want to measure the amount of spinning on each lwlock in the system. To that end, I've found the attached patch useful. Note that if you don't define LWLOCK_STATS, this changes nothing except that the return value from s_lock becomes int rather than void. If you do define LWLOCK_STATS, then LWLockAcquire() counts the number of pg_usleep() calls that are required to acquire each LWLock, in addition to the other statistics. Since this has come up for me a few times now, I'd like to proposing including it in core. Just to whet your appetite, here are the top spinners on a 32-client SELECT-only test on a 32-core Itanium server. All the locks not shown below have two orders of magnitude less of a problem than these do. lwlock 48: shacq 6042357 exacq 34590 blk 53 spin 1288 lwlock 42: shacq 5014729 exacq 34942 blk 58 spin 1321 lwlock 43: shacq 5448771 exacq 34725 blk 44 spin 1608 lwlock 44: shacq 6420992 exacq 34980 blk 67 spin 1713 lwlock 35: shacq 6353111 exacq 34256 blk 59 spin 1784 lwlock 38: shacq 6052801 exacq 34913 blk 70 spin 1801 lwlock 46: shacq 6401413 exacq 34698 blk 58 spin 1879 lwlock 36: shacq 6051887 exacq 35309 blk 73 spin 1915 lwlock 45: shacq 6812817 exacq 35170 blk 59 spin 1924 lwlock 33: shacq 6793666 exacq 35009 blk 59 spin 1955 lwlock 34: shacq 6395994 exacq 34941 blk 58 spin 2019 lwlock 40: shacq 6388136 exacq 34579 blk 51 spin 2054 lwlock 37: shacq 7250574 exacq 35242 blk 73 spin 2409 lwlock 39: shacq 7109729 exacq 34892 blk 65 spin 2632 lwlock 47: shacq 8243963 exacq 35256 blk 88 spin 3018 lwlock 328936: shacq 33992167 exacq 0 blk 0 spin 59816 lwlock 4: shacq 33994583 exacq 168 blk 40 spin 74018 lwlock 41: shacq 40098590 exacq 35001 blk 440 spin 81324 grant total: shacq 361775238 exacq 172965995 blk 2315 spin 245342 This is a 5-minute test run. If I had to take a shot in the dark, the buffer mapping lock in the #1 spot is the one protecting the root page of pgbench_acocunts_pkey, and the high numbered lock in position #3 is the content lock on the buffer itself. Patch missing. -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] order of operations for pg_restore
On 01/11/2012 07:57 PM, Andrew Hammond wrote: I'm working on a tool that runs pg_restore with -j 4. I notice that after COPYing in the data, pg_restore does two indexes and a cluster command in parallel. The first CREATE INDEX is running, the CLUSTER command is waiting on it and the second CREATE INDEX is waiting on the CLUSTER. This seems sub-optimal. Would it make sense to run the CLUSTER command first? I'm pretty sure I can replicate the behavior if necessary. Running 9.1.2. Well, we don't actually run CLUSTER. We run a command to mark a table as clustered on the index. The nasty part is that it's not a separate TOC member, it's in the same TOC as the index creation. But ALTER TABLE has different locking requirements from CREATE INDEX. If the clustered index is not one created from a constraint we could have the dependencies wrong. It looks like this is something we all missed when parallel restore was implemented. I think we might need to split the ALTER TABLE ... CLUSTER from its parent statement. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.
On Thu, Jan 12, 2012 at 12:20 AM, Simon Riggs si...@2ndquadrant.com wrote: +static void +ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime) walEnd is not used in ProcessWalSndrMessage() at all. Can't we remove it? If yes, walEnd field in WalSndrMessage is also not used anywhere, so ISTM we can remove it. It's there to allow extension of the message processing to be more complex than it currently is. Changing the protocol is much harder than changing a function call. I'd like to keep it since it doesn't have any negative effects. OK. Another problem about walEnd is that WalDataMessageHeader.walEnd is not the same kind of location as WalSndrMessage.walEnd. The former indicates the location that WAL has already been flushed (maybe not sent yet), i.e., send request location. OTOH, the latter indicates the location that WAL has already been sent. Is this inconsistency intentional? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables
On Wed, Jan 11, 2012 at 5:25 PM, Chetan Suttraway chetan.suttra...@enterprisedb.com wrote: On Wed, Jan 11, 2012 at 5:13 PM, Ashutosh Bapat ashutosh.ba...@enterprisedb.com wrote: Hi, After running regression, I ran EXPLAIN on one of the queries in regression (test create_misc) and got following output regression=# explain verbose select * into table ramp from road where name ~ '.*Ramp'; QUERY PLAN Result (cost=0.00..154.00 rows=841 width=67) Output: public.road.name, public.road.thepath - Append (cost=0.00..154.00 rows=841 width=67) - Seq Scan on public.road (cost=0.00..135.05 rows=418 width=67) Output: public.road.name, public.road.thepath Filter: (public.road.name ~ '.*Ramp'::text) - Seq Scan on public.ihighway road (cost=0.00..14.99 rows=367 width=67) ^ Output: public.road.name, public.road.thepath ^^, ^^ Filter: (public.road.name ~ '.*Ramp'::text) ^^^ - Seq Scan on public.shighway road (cost=0.00..3.96 rows=56 width=67) Output: public.road.name, public.road.thepath Filter: (public.road.name ~ '.*Ramp'::text) (12 rows) regression=# \d+ road Table public.road Column | Type | Modifiers | Storage | Stats target | Description -+--+---+--+--+- name| text | | extended | | thepath | path | | extended | | Indexes: rix btree (name) Child tables: ihighway, shighway Has OIDs: no Table road has children ihighway and shighway as seen in the \d+ output above. The EXPLAIN output of Seq Scan node on children has public.road as prefix for variables. public.road could imply the parent table road and thus can cause confusion, as to what's been referreed, the columns of parent table or child table. In the EXPLAIN output children tables have road as alias (as against public.road). The alias comes from RangeTblEntry-eref-aliasname. It might be better to have road as prefix in the variable names over public.road. The reason why this happens is the code in get_variable() 3865 /* Exceptions occur only if the RTE is alias-less */ 3866 if (rte-alias == NULL) 3867 { 3868 if (rte-rtekind == RTE_RELATION) 3869 { 3870 /* 3871 * It's possible that use of the bare refname would find another 3872 * more-closely-nested RTE, or be ambiguous, in which case we need 3873 * to specify the schemaname to avoid these errors. 3874 */ 3875 if (find_rte_by_refname(rte-eref-aliasname, context) != rte) 3876 schemaname = get_namespace_name(get_rel_namespace(rte-relid)); 3877 } If there is no alias, we find out the schema name and later add it to the prefix. In the inherited table case, we are actually creating a kind of alias for the children table and thus we should not find out the schema name and add it to the prefix. This case has been taken care of in get_from_clause_item(), 6505 else if (rte-rtekind == RTE_RELATION 6506 strcmp(rte-eref-aliasname, get_relation_name(rte-relid)) != 0) 6507 { 6508 /* 6509 * Apparently the rel has been renamed since the rule was made. 6510 * Emit a fake alias clause so that variable references will still 6511 * work. This is not a 100% solution but should work in most 6512 * reasonable situations. 6513 */ 6514 appendStringInfo(buf, %s, 6515 quote_identifier(rte-eref-aliasname)); 6516 gavealias = true; 6517 } I see similar code in ExplainTargetRel() 1778 if (objectname == NULL || 1779 strcmp(rte-eref-aliasname, objectname) != 0) 1780 appendStringInfo(es-str, %s, 1781 quote_identifier(rte-eref-aliasname)); Based on this, here is patch to not add schemaname in the prefix for a variable. I have run make check. All except inherit.sql passed. The expected output change is included in the patch. -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers A table can inherit from one or more parent table. So in that case, qualifying schema/table name helps in finding out where the column is coming from. Do you have any example