Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog
On Tue, Feb 7, 2012 at 2:58 PM, Fujii Masao wrote: > Hi, > > When I compiled HEAD with --disable-integer-datetimes and tested > pg_receivexlog, I encountered unexpected replication timeout. As > far as I read the pg_receivexlog code, the cause of this problem is > that pg_receivexlog handles the standby message timeout incorrectly > in --disable-integer-datetimes. The attached patch fixes this problem. > Comments? receivelog.c --- timeout.tv_sec = last_status + standby_message_timeout - now - 1; if (timeout.tv_sec <= 0) --- Umm.. the above code also handles the timestamp incorrectly. ISTM that the root cause of these problems is that receivelog.c uses TimestampTz. What about changing receivelog.c so that it uses time_t instead of TimestampTz? Which would make the code simpler, I think. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for implementing SPI_gettypemod()
On Thu, Feb 2, 2012 at 8:11 PM, Robert Haas wrote: > On Wed, Feb 1, 2012 at 5:23 AM, Chetan Suttraway > wrote: > > Hi All, > > > > This is regarding the TODO item : > > "Add SPI_gettypmod() to return a field's typemod from a TupleDesc" > > > > The related message is: > > http://archives.postgresql.org/pgsql-hackers/2005-11/msg00250.php > > > > This basically talks about having an SPI_gettypemod() which returns the > > typmod of a field of tupdesc > > > > Please refer the attached patch based on the suggested implementation. > > Please add this to the next CommitFest: > > https://commitfest.postgresql.org/action/commitfest_view/open > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > At the given link, I am able to choose only "System administration" under commitfest topic. I think there has to be "server features" or "Miscellaneous". Regards, Chetan -- EnterpriseDB Corporation The Enterprise PostgreSQL Company Website: www.enterprisedb.com EnterpriseDB Blog : http://blogs.enterprisedb.com Follow us on Twitter : http://www.twitter.com/enterprisedb
[HACKERS] incorrect handling of the timeout in pg_receivexlog
Hi, When I compiled HEAD with --disable-integer-datetimes and tested pg_receivexlog, I encountered unexpected replication timeout. As far as I read the pg_receivexlog code, the cause of this problem is that pg_receivexlog handles the standby message timeout incorrectly in --disable-integer-datetimes. The attached patch fixes this problem. Comments? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c index 8ca3882..af4add0 100644 --- a/src/bin/pg_basebackup/receivelog.c +++ b/src/bin/pg_basebackup/receivelog.c @@ -62,7 +62,7 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir, char *namebu f = open(fn, O_WRONLY | O_CREAT | PG_BINARY, S_IRUSR | S_IWUSR); if (f == -1) { - fprintf(stderr, _("%s: Could not open WAL segment %s: %s\n"), + fprintf(stderr, _("%s: could not open WAL segment %s: %s\n"), progname, fn, strerror(errno)); return -1; } @@ -190,6 +190,24 @@ localGetCurrentTimestamp(void) } /* + * Local version of TimestampDifferenceExceeds(), since we are not + * linked with backend code. + */ +static bool +localTimestampDifferenceExceeds(TimestampTz start_time, + TimestampTz stop_time, + int msec) +{ + TimestampTz diff = stop_time - start_time; + +#ifdef HAVE_INT64_TIMESTAMP + return (diff >= msec * INT64CONST(1000)); +#else + return (diff * 1000.0 >= msec); +#endif +} + +/* * Receive a log stream starting at the specified position. * * If sysidentifier is specified, validate that both the system @@ -301,7 +319,8 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline, char *sysi */ now = localGetCurrentTimestamp(); if (standby_message_timeout > 0 && - last_status < now - standby_message_timeout * 100) + localTimestampDifferenceExceeds(last_status, now, + standby_message_timeout * 1000)) { /* Time to send feedback! */ char replybuf[sizeof(StandbyReplyMessage) + 1]; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] double writes using "double-write buffer" approach [WIP]
I don't know a lot about base backup, but it sounds like full_page_writes must be turned on for base backup, in order to deal with the inconsistent reads of pages (which you might call torn pages) that can happen when you backup the data files while the database is running. The relevant parts of the WAL log are then copied separately (and consistently) once the backup of the data files is done, and used to "recover" the database into a consistent state later. So, yes, good point -- double writes cannot replace the functionality of full_page_writes for base backup. If double writes were in use, they might be automatically switched over to full page writes for the duration of the base backup. And the double write file should not be part of the base backup. Dan - Original Message - From: "Fujii Masao" To: "Dan Scales" Cc: "PG Hackers" Sent: Monday, February 6, 2012 3:08:15 AM Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP] On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales wrote: > Let me know if you have any thoughts/comments, etc. The patch is > enclosed, and the README.doublewrites is updated a fair bit. ISTM that the double-write can prevent torn-pages in neither double-write file nor data file in *base backup*. Because both double-write file and data file can be backed up while being written. Is this right? To avoid the torn-page problem, we should write FPI to WAL during online backup even if the double-write has been committed? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] When do we lose column names?
On 11/16/2011 10:38 PM, Tom Lane wrote: I wrote: PFC, a patch that does this. Upon further review, this patch would need some more work even for the RowExpr case, because there are several places that build RowExprs without bothering to build a valid colnames list. It's clearly soluble if anyone cares to put in the work, but I'm not personally excited enough to pursue it ... The patch itself causes a core dump with the current regression tests. This test: SELECT array_to_json(array_agg(q),false) FROM ( SELECT $$a$$ || x AS b, y AS c, ARRAY[ROW(x.*,ARRAY[1,2,3]), ROW(y.*,ARRAY[4,5,6])] AS z FROM generate_series(1,2) x, generate_series(4,5) y) q; causes a failure at line 967 of execTuples.c: Assert(list_length(exprList) == list_length(namesList)); I'm investigating further. I've been looking at the other places that build RowExprs. Most of them look OK and none seem clearly in need of fixing at first glance. Which do you think need attention? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Mon, Feb 06, 2012 at 06:43:04PM -0500, Bruce Momjian wrote: > On Mon, Feb 06, 2012 at 10:49:10PM +, Peter Geoghegan wrote: > > On 6 February 2012 21:19, Bruce Momjian wrote: > > > Peter Geoghegan obviously has done some serious work in improving > > > sorting, and worked well with the community process. > > > > Thank you for acknowledging that. > > > > It's unfortunate that C does not support expressing these kinds of > > ideas in a more natural way. > > Yes, it is a problem, and a benefit. We have avoided C++ because these > types of trade-offs that we are discussing are often done invisibly, so > we can't make the decision ourselves. Let me add that while it is fine for languages like C++ to make these decisions for application code automatically, operating systems and high-performance databases developers prefer to make such decisions explicitly, which is what we are discussing now. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Mon, Feb 06, 2012 at 10:49:10PM +, Peter Geoghegan wrote: > On 6 February 2012 21:19, Bruce Momjian wrote: > > Peter Geoghegan obviously has done some serious work in improving > > sorting, and worked well with the community process. > > Thank you for acknowledging that. > > It's unfortunate that C does not support expressing these kinds of > ideas in a more natural way. Yes, it is a problem, and a benefit. We have avoided C++ because these types of trade-offs that we are discussing are often done invisibly, so we can't make the decision ourselves. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] semi-PoC: kNN-gist for cubes
I have a rough proof-of-concept for getting nearest-neighbor searches working with cubes. When I say "rough", I mean "I have no idea what I'm doing and I haven't written C for 15 years but I hear it got standardized please don't hurt me". It seems to be about 400x faster for a 3D cube with 1 million rows, more like 10-30x for a 6D cube with 10 million rows. The patch adds operator <-> (which is just the existing cube_distance function) and support function 8, distance (which is just g_cube_distance, a wrapper around cube_distance). The code is in no way production-quality; it is in fact right around "look! it compiles!", complete with pasted-in, commented-out code from something I was mimicking. I thought I'd share at this early stage in the hopes I might get some pointers, such as: - What unintended consequences should I be looking for? - What benchmarks should I do? - What kind of edge cases might I consider? - I'm just wrapping cube_distance and calling it through DirectFunctionCall; it's probably more proper to extract out the "real" function and call it from both cube_distance and g_cube_distance. Right? - What else don't I know? (Besides C, funny man.) The patch, such as it is, is at: https://github.com/jaylevitt/postgres/commit/9cae4ea6bd4b2e582b95d7e1452de0a7aec12857 with an even-messier test at https://github.com/jaylevitt/postgres/commit/daa33e30acaa2c99fe554d88a99dd7d78ff6c784 I initially thought this patch made inserting and indexing slower, but then I realized the fast version was doing 1 million rows, and the slow one did 10 million rows. Which means: dinnertime. Jay Levitt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On 6 February 2012 21:19, Bruce Momjian wrote: > Peter Geoghegan obviously has done some serious work in improving > sorting, and worked well with the community process. Thank you for acknowledging that. It's unfortunate that C does not support expressing these kinds of ideas in a more natural way. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Mon, Feb 06, 2012 at 04:19:07PM -0500, Bruce Momjian wrote: > Peter Geoghegan obviously has done some serious work in improving > sorting, and worked well with the community process. He has done enough > analysis that I am hard-pressed to see how we would get similar > improvement using a different method, so I think it comes down to > whether we want the 28% speedup by adding 55k (1%) to the binary. > > I think Peter has shown us how to get that, and what it will cost --- we > just need to decide now whether it is worth it. What I am saying is > there probably isn't a cheaper way to get that speedup, either now or in > the next few years. (COPY might need similar help for speedups.) > > I believe this is a big win and well worth the increased binary size > because the speed up is significant, and because it is of general > usefulness for a wide range of queries. Either of these would be enough > to justify the additional 1% size, but both make it an easy decision for > me. > > FYI, I believe COPY needs similar optimizations; we have gotten repeated > complaints about its performance and this method of optmization might > also be our only option. Sorry, that was wordy. What I am saying is that years ago we did hot-spot optimization for storage and tuple access using macros. We have looked for cheap optimizations for sort (and COPY) for years, but haven't found it. If we want optimizations in these areas, we are probably going to need to do the same sort of hot-spot optimization we did earlier, and Peter's work is an excellent candidate for that. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assertion failure in AtCleanup_Portals
Pavan Deolasee writes: > If I run the following sequence of commands, I get an assertion > failure in current HEAD. > postgres=# BEGIN; > BEGIN > postgres=# SELECT 1/0; > ERROR: division by zero > postgres=# ROLLBACK TO A; > ERROR: no such savepoint > postgres=# \q > The process fails when the session is closed and aborted transaction > gets cleaned at the proc_exit time. The stack trace is as below. Hmm. It works fine if you issue an actual ROLLBACK command there, so my gut reaction is that AbortOutOfAnyTransaction isn't sufficiently duplicating the full-fledged ROLLBACK code path. No time to dig further right now though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Fri, Jan 27, 2012 at 09:37:37AM -0500, Robert Haas wrote: > On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan > wrote: > > Well, I don't think it's all that subjective - it's more the case that > > it is just difficult, or it gets that way as you consider more > > specialisations. > > Sure it's subjective. Two well-meaning people could have different > opinions without either of them being "wrong". If you do a lot of > small, in-memory sorts, more of this stuff is going to seem worthwhile > than if you don't. > > > As for what types/specialisations may not make the cut, I'm > > increasingly convinced that floats (in the following order: float4, > > float8) should be the first to go. Aside from the fact that we cannot > > use their specialisations for anything like dates and timestamps, > > floats are just way less useful than integers in the context of > > database applications, or at least those that I've been involved with. > > As important as floats are in the broad context of computing, it's > > usually only acceptable to store data in a database as floats within > > scientific applications, and only then when their limitations are > > well-understood and acceptable. I think we've all heard anecdotes at > > one time or another, involving their limitations not being well > > understood. > > While we're waiting for anyone else to weigh in with an opinion on the > right place to draw the line here, do you want to post an updated > patch with the changes previously discussed? Well, I think we have to ask not only how many people are using float4/8, but how many people are sorting or creating indexes on them. I think it would be few and perhaps should be eliminated. Peter Geoghegan obviously has done some serious work in improving sorting, and worked well with the community process. He has done enough analysis that I am hard-pressed to see how we would get similar improvement using a different method, so I think it comes down to whether we want the 28% speedup by adding 55k (1%) to the binary. I think Peter has shown us how to get that, and what it will cost --- we just need to decide now whether it is worth it. What I am saying is there probably isn't a cheaper way to get that speedup, either now or in the next few years. (COPY might need similar help for speedups.) I believe this is a big win and well worth the increased binary size because the speed up is significant, and because it is of general usefulness for a wide range of queries. Either of these would be enough to justify the additional 1% size, but both make it an easy decision for me. FYI, I believe COPY needs similar optimizations; we have gotten repeated complaints about its performance and this method of optmization might also be our only option. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] freezing multixacts
Excerpts from Robert Haas's message of lun feb 06 13:19:14 -0300 2012: > > On Mon, Feb 6, 2012 at 9:31 AM, Alvaro Herrera > wrote: > >> Suppose you have a tuple A which is locked by a series of transactions > >> T0, T1, T2, ...; AIUI, each new locker is going to have to create a > >> new mxid with all the existing entries plus a new one for itself. > >> But, unless I'm confused, as it's doing so, it can discard any entries > >> for locks taken by transactions which are no longer running. > > > > That's correct. But the problem is a tuple that is locked or updated by > > a very old transaction that doesn't commit or rollback, and the tuple is > > never locked again. Eventually the Xid could remain live while the mxid > > is in wraparound danger. > > Ah, I see. I think we should probably handle that the same way we do > for XIDs: try to force autovac when things get tight, then start > issuing warnings, and finally just refuse to assign any more MXIDs. Agreed. > Another thing that might make sense, for both XIDs and MXIDs, is to > start killing transactions that are preventing vacuum/autovacuum from > doing their thing. This could mean either killing the people who are > holding back RecentGlobalXmin, so that we can actually freeze the old > stuff; or killing people who are holding a conflicting lock, using the > recovery-conflict stuff or some adaptation of it. Yeah -- right now we only emit some innocuous-looking messages, which I've seen most people to ignore until they get bitten by a forced anti-wraparound vacuum. It'd be nice to get more agressive about that as the situation gets more critical. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ecpglib use PQconnectdbParams
On Fri, Feb 03, 2012 at 01:15:30PM +0100, Christian Ullrich wrote: > Shouldn't these be [7]? You can have up to 6 items, plus the terminator. I take it only keywords have to be [7], right? Committed, thanks for spotting this. There seems to be one more problem that I haven't had time to tackle yet. With this patch the connection regression test (make checktcp) doesn't run through anymore because one test connection cannot be made. It spits out the error message: FATAL: invalid command-line arguments for server process HINT: Try "postgres --help" for more information. This connection used to work without the patch and besides the error message is not really telling in this context. So if anyone is interested in looking at this fine, if not I will as soon as I find some spare time. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org Jabber: michael.meskes at googlemail dot com VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Report: race conditions in WAL replay routines
On Mon, Feb 6, 2012 at 6:32 PM, Tom Lane wrote: > Simon Riggs writes: >> The existing errmsg of "tablespace is not empty" doesn't cover all >> reasons why tablespace was not removed. > > Yeah, in fact that particular statement is really pretty bogus for the > replay case, because as the comment says we know that the tablespace > *is* empty so far as full-fledged database objects are concerned. > >> The final message should have >> errmsg "tablespace not fully removed" >> errhint "you should resolve this manually if it causes further problems" > > Planning to go with this: > > errmsg("directories for tablespace %u could not be > removed", > xlrec->ts_id), > errhint("You can remove the directories manually if > necessary."))); > > I thought about an errdetail, but the preceding LOG entries from > destroy_tablespace_directories should provide the details reasonably > well. Looks good. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Mon, Feb 06, 2012 at 12:59:59PM -0500, Bruce Momjian wrote: > > > I'm also not very comfortable with the idea of having flags on the page > > > indicating whether it has a checksum or not. It's not hard to imagine a > > > software of firmware bug or hardware failure that would cause pd_flags > > > field > > > to be zeroed out altogether. It would be more robust if the expected > > > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with > > > that > > > at upgrade somehow. And it still feels a bit whacky anyway. > > > Good idea. Lets use > > > > 0-0-0 to represent upgraded from previous version, needs a bit set > > 0-0-1 to represent new version number of page, no checksum > > 1-1-1 to represent new version number of page, with checksum > > > > So we have 1 bit dedicated to the page version, 2 bits to the checksum > > indicator > > Interesting point that we would not be guarding against a bit flip from > 1 to 0 for the checksum bit; I agree using two bits is the way to go. I > don't see how upgrade figures into this. > > However, I am unclear how Simon's idea above actually works. We need > two bits for redundancy, both 1, to mark a page as having a checksum. I > don't think mixing the idea of a new page version and checksum enabled > really makes sense, especially since we have to plan for future page > version changes. > > I think we dedicate 2 bits to say we have computed a checksum, and 3 > bits to mark up to 8 possible page versions, so the logic is, in > pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored > on the page, and we use 0x32 and later for the page version number. We > can assume all the remaining bits are for the page version number until > we need to define new bits, and we can start storing them at the end > first, and work forward. If all the version bits are zero, it means the > page version number is still stored in pd_pagesize_version. A simpler solution would be to place two bits for checksum after the existing page bit usage, and place the page version on the last four bits of the 16-bit field --- that still leaves us with 7 unused bits. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Report: race conditions in WAL replay routines
Simon Riggs writes: > The existing errmsg of "tablespace is not empty" doesn't cover all > reasons why tablespace was not removed. Yeah, in fact that particular statement is really pretty bogus for the replay case, because as the comment says we know that the tablespace *is* empty so far as full-fledged database objects are concerned. > The final message should have > errmsg "tablespace not fully removed" > errhint "you should resolve this manually if it causes further problems" Planning to go with this: errmsg("directories for tablespace %u could not be removed", xlrec->ts_id), errhint("You can remove the directories manually if necessary."))); I thought about an errdetail, but the preceding LOG entries from destroy_tablespace_directories should provide the details reasonably well. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Mon, Feb 06, 2012 at 08:51:34AM +0100, Heikki Linnakangas wrote: > >I wonder if we should just dedicate 3 page header bits, call that the > >page version number, and set this new version number to 1, and assume > >all previous versions were zero, and have them look in the old page > >version location if the new version number is zero. I am basically > >thinking of how we can plan ahead to move the version number to a new > >location and have a defined way of finding the page version number using > >old and new schemes. > > Three bits seems short-sighted, but yeah, something like 6-8 bits > should be enough. On the whole, though. I think we should bite the > bullet and invent a way to extend the page header at upgrade. I just emailed a possible layout that allows for future page version number expansion. I don't think there is any magic bullet that will allow for page header extension by pg_upgrade. If it is going to be done, it would have to happen in the backend while the system is running. Anything that requires pg_upgrade to do lots of reads or writes basically makes pg_upgrade useless. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Support for foreign keys with arrays
Hi guys, Please find attached version 3 of our patch. We thoroughly followed your suggestions and were able to implement "EACH foreign key constraints" with multi-column syntax as well. "EACH foreign key constraints" represent PostgreSQL implementation of what are also known as Foreign Key Arrays. Some limitations occur in this release, but as previously agreed these can be refined and defined in future release implementations. This patch adds: * support for EACH REFERENCES column constraint on array types - e.g. c1 INT[] EACH REFERENCES t1 * support for EACH foreign key table constraints - e.g. FOREIGN KEY (EACH c1) REFERENCES t1 * support for EACH foreign keys in multi-column foreign key table constraints - e.g. FOREIGN KEY (c1, EACH c2) REFERENCES t1 (u1, u2) * support for two new referential actions on update/delete operations for single-column only EACH foreign keys: ** EACH CASCADE (deletes or updates elements inside the referencing array) ** EACH SET NULL (sets to NULL referencing element inside the foreign array) * support for array_replace and array_remove functions as required by the above actions As previously said, we preferred to keep this patch simple for 9.2 and forbid EACH CASCADE and EACH SET NULL on multi-column foreign keys. After all, majority of use cases is represented by EACH foreign key column constraints (single-column foreign key arrays), and more complicated use cases can be discussed for 9.3 - should this patch make it. :) We can use multi-dimensional arrays as well as referencing columns. In that case though, ON DELETE EACH CASCADE will behave like ON DELETE EACH SET NULL. This is a safe way of implementing the action. We have some ideas on how to implement this, but we feel it is better to limit the behaviour for now. As far as documentation is concerned, we: * added actions and constraint info in the catalog * added an entire section on "EACH foreign key constraints" in the data definition language chapter (we've simplified the second example, greatly following Noah's advice - let us know if this is ok with you) * added array_remove (currently limited to single-dimensional arrays) and array_replace in the array functions chapter * modified REFERENCES/FOREIGN KEY section in the CREATE TABLE command's documentation and added a special section on the EACH REFERENCES clause (using square braces as suggested) Here follows a short list of notes for Noah: * You proposed these changes: ARRAY CASCADE -> EACH CASCADE and ARRAY SET NULL -> EACH SET NULL. We stack with EACH CASCADE and decided to prepend the "EACH" keyword to standard's CASCADE and SET NULL. Grammar is simpler and the emphasis is on the EACH keyword. * Multi-dimensional arrays: ON DELETE EACH CASCADE -> ON DELETE EACH SET NULL. We cannot determine the array's number of dimensions at definition time as it depends on the actual values. As anticipated above, we have some ideas on multi-dimensional element removal, but not for this patch for the aforementioned reasons. * Support of EACH CASCADE/SET NULL in ConvertTriggerToFK(): we decided to leave it. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it EACH-foreign-key-constraints-aka-foreign-key-arrays.v3.patch.bz2 Description: application/bzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA
On Sat, Feb 4, 2012 at 5:38 AM, Thomas Munro wrote: > On 16 January 2012 21:30, Josh Berkus wrote: > >> Useful, yes. Harder than it looks, probably. I tried to mock up a >> version of this years ago for a project where I needed it, and ran into >> all kinds of race conditions. > > Can you remember any details about those race conditions? > >> Anyway, if it could be made to work, this is extremely useful for any >> application which needs queueing behavior (with is a large plurality, if >> not a majority, of applications). > > Ok, based on this feedback I decided to push further and try > implementating this. See POC/WIP patch attached. It seems to work > for simple examples but I haven't yet tried to break it or see how it > interacts with more complicated queries or high concurrency levels. > It probably contains at least a few rookie mistakes! Any feedback > gratefully received. > > The approach is described in my original email. Short version: > heap_lock_tuple now takes an enum called wait_policy instead of a > boolean called nowait, with the following values: > > LockWaitBlock: wait for lock (like nowait = false before), > > LockWaitError: error if not immediately lockable (like nowait = true > before) > > LockWaitSkip: give up and return HeapTupleWouldBlock if not > immediately lockable (this is a new policy) > > The rest of the patch is about getting the appropriate value down to > that function call, following the example of the existing nowait > support, and skipping rows if you said SKIP LOCKED DATA and you got > HeapTupleWouldBlock. > > Compared to one very popular commercial database's implementation, I > think this is a little bit friendlier for the user who wants to > distribute work. Let's say you want to lock one row without lock > contention, which this patch allows with FETCH FIRST 1 ROW ONLY FOR > UPDATE SKIP LOCKED DATA in an SQL query. In that other system, the > mechanism for limiting the number of rows fetched is done in the WHERE > clause, and therefore the N rows are counted *before* checking if the > lock can be obtained, so users sometimes have to resort to stored > procedures so they can control the FETCH from a cursor imperatively. > In another popular commercial database from Redmond, you can ask for > the top (first) N rows while using the equivalent of SKIP LOCKED DATA > and it has the same effect as this patch as far as I can tell, and > another large blue system is the same. > > As discussed in another branch of this thread, you can probably get > the same effect with transactional advisory locks. But I personally > like row skipping better as an explicit feature because: > > (1) I think there might be an order-of-evaluation problem with a WHERE > clause containing both lock testing and row filtering expressions (ie > it is undefined right?) which you might need subselects to work around > (ie to be sure to avoid false positive locks, not sure about this). > > (2) The advisory technique requires you to introduce an integer > identifier if you didn't already have one and then lock that despite > having already said which rows you want to try to lock in standard row > filtering expressions. > > (3) The advisory technique requires all users of the table to > participate in the advisory lock protocol even if they don't want to > use the option to skip lock. > > (4) It complements NOWAIT (which could also have been done with > transactional advisory locks, with a function like try_lock_or_fail). > > (5) I like the idea of directly porting applications from other > databases with this feature (that is what led me here). Yeah -- also your stuff is also going to have much better interactions with LIMIT. Your enhancements will beat an advisory lock implementation all day long. the real competition is not advisory locks, but the mvcc tricks played with PGQ. It's all about implementing producer consumer queues (arguments that databases should not do that are 100% bogus) and I can't help but wonder if that's a better system. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Mon, Feb 06, 2012 at 09:05:19AM +, Simon Riggs wrote: > > In any case, I think it's a very bad idea to remove the page version field. > > How are we supposed to ever be able to change the page format again if we > > don't even have a version number on the page? I strongly oppose removing it. > > Nobody is removing the version field, nor is anybody suggesting not > being able to tell which page version we are looking at. Agreed. I thought the idea was that we have a 16-bit page _version_ number and 8+ free page _flag_ bits, which are all currently zero. The idea was to move the version number from 16-bit field into the unused flag bits, and use the 16-bit field for the checksum. I would like to have some logic that would allow tools inspecting the page to tell if they should look for the version number in the bits at the beginning of the page or at the end. Specifically, this becomes the checksum: uint16 pd_pagesize_version; and this holds the page version, if we have updated the page to the new format: uint16 pd_flags; /* flag bits, see below */ Of the 16 bits of pd_flags, these are the only ones used: #define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */ #define PD_PAGE_FULL0x0002 /* not enough free space for new * tuple? */ #define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to * everyone */ #define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ > > I'm also not very comfortable with the idea of having flags on the page > > indicating whether it has a checksum or not. It's not hard to imagine a > > software of firmware bug or hardware failure that would cause pd_flags field > > to be zeroed out altogether. It would be more robust if the expected > > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that > > at upgrade somehow. And it still feels a bit whacky anyway. > Good idea. Lets use > > 0-0-0 to represent upgraded from previous version, needs a bit set > 0-0-1 to represent new version number of page, no checksum > 1-1-1 to represent new version number of page, with checksum > > So we have 1 bit dedicated to the page version, 2 bits to the checksum > indicator Interesting point that we would not be guarding against a bit flip from 1 to 0 for the checksum bit; I agree using two bits is the way to go. I don't see how upgrade figures into this. However, I am unclear how Simon's idea above actually works. We need two bits for redundancy, both 1, to mark a page as having a checksum. I don't think mixing the idea of a new page version and checksum enabled really makes sense, especially since we have to plan for future page version changes. I think we dedicate 2 bits to say we have computed a checksum, and 3 bits to mark up to 8 possible page versions, so the logic is, in pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored on the page, and we use 0x32 and later for the page version number. We can assume all the remaining bits are for the page version number until we need to define new bits, and we can start storing them at the end first, and work forward. If all the version bits are zero, it means the page version number is still stored in pd_pagesize_version. > >> I wonder if we should just dedicate 3 page header bits, call that the > >> page version number, and set this new version number to 1, and assume > >> all previous versions were zero, and have them look in the old page > >> version location if the new version number is zero. I am basically > >> thinking of how we can plan ahead to move the version number to a new > >> location and have a defined way of finding the page version number using > >> old and new schemes. > > > > > > Three bits seems short-sighted, but yeah, something like 6-8 bits should be > > enough. On the whole, though. I think we should bite the bullet and invent a > > way to extend the page header at upgrade. > > There are currently many spare bits. I don't see any need to allocate > them to this specific use ahead of time - especially since that is the > exact decision we took last time when we reserved 16 bits for the > version. Right, but I am thinking we should set things up so we can grow the page version number into the unused bit, rather than box it between bits we are already using. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] pg_dump -s dumps data?!
On 01/31/2012 11:10 PM, Andrew Dunstan wrote: Here's a possible patch for the exclude-table-data problem along the lines you suggest. Should I apply this? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] about type cast
Robert Haas writes: > 2012/2/6 zoulx1982 : >> my question is why int can cast to bit , but bot for bit varying? > Off the top of my head, I'm guessing that it's just a case of nobody > having implemented it. I have some vague recollection that we didn't want to be squishy about the length of the resulting bitstring. When you cast to bit you're more or less forced to specify what you want, since the spec-mandated default length is only 1. varbit would leave us having to make a decision in the code. Anyway, try searching the pgsql-hackers archives if you want some history. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] about type cast
2012/2/6 zoulx1982 : > hi, > there is a problem about type cast that i don't understand, follow is my > test. > > postgres=# select 10::bit(3); > bit > - > 010 > (1 row) > postgres=# select 10::bit varying(3); > ERROR: cannot cast type integer to bit varying > LINE 1: select 10::bit varying(3); > ^ > postgres=# > > my question is why int can cast to bit , but bot for bit varying? > i want to know the reason. > thank you for your timing. Off the top of my head, I'm guessing that it's just a case of nobody having implemented it. You can do this: rhaas=# select 10::bit(3)::varbit; varbit 010 (1 row) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ecpglib use PQconnectdbParams
On Fri, Feb 03, 2012 at 01:15:30PM +0100, Christian Ullrich wrote: > * Peter Eisentraut wrote: > >I noticed ecpglib still uses PQconnectdb() with a craftily assembled > >connection string. Here is a patch to use PQconnectdbParams() instead. > > + const char *conn_keywords[6]; > + const char *conn_values[6]; > > Shouldn't these be [7]? You can have up to 6 items, plus the terminator. This bug is now in GIT. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] freezing multixacts
On Mon, Feb 6, 2012 at 9:31 AM, Alvaro Herrera wrote: >> Suppose you have a tuple A which is locked by a series of transactions >> T0, T1, T2, ...; AIUI, each new locker is going to have to create a >> new mxid with all the existing entries plus a new one for itself. >> But, unless I'm confused, as it's doing so, it can discard any entries >> for locks taken by transactions which are no longer running. > > That's correct. But the problem is a tuple that is locked or updated by > a very old transaction that doesn't commit or rollback, and the tuple is > never locked again. Eventually the Xid could remain live while the mxid > is in wraparound danger. Ah, I see. I think we should probably handle that the same way we do for XIDs: try to force autovac when things get tight, then start issuing warnings, and finally just refuse to assign any more MXIDs. Another thing that might make sense, for both XIDs and MXIDs, is to start killing transactions that are preventing vacuum/autovacuum from doing their thing. This could mean either killing the people who are holding back RecentGlobalXmin, so that we can actually freeze the old stuff; or killing people who are holding a conflicting lock, using the recovery-conflict stuff or some adaptation of it. We've made it fairly difficult to avoid having autovacuum run at all with the xidVacLimit/xidStopLimit stuff, but there's still no real defense against autovacuum running but failing to mitigate the problem, either because there's a long-running transaction holding a snapshot open, or because someone is sitting on a relation or buffer lock. This of course is off-topic from your patch here... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] freezing multixacts
On Thu, Feb 2, 2012 at 4:33 AM, Alvaro Herrera wrote: > However, there are cases where not even that is possible -- consider > tuple freezing during WAL recovery. Recovery is going to need to > replace those multis with other multis, but it cannot create new multis > itself. The only solution here appears to be that when multis are > frozen in the master, replacement multis have to be logged too. So the > heap_freeze_tuple Xlog record will have a map of old multi to new. That > way, recovery can just determine the new multi to use for any particular > old multi; since multixact creation is also logged, we're certain that > the replacement value has already been defined. Multixacts are ignored during recovery. Why do anything at all? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dry-run mode for pg_archivecleanup
Ciao Alvaro, Il 06/02/12 16:02, Alvaro Herrera ha scritto: FWIW I committed this last week, though I changed the debug message wording slightly -- hope that's OK; it can be adjusted of course. http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b2e431a4db81a735d1474c4d1565a20b835878c9 Beautiful! :) Thank you very much. Ciao, Gabriele -- Gabriele Bartolini - 2ndQuadrant Italia PostgreSQL Training, Services and Support gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dry-run mode for pg_archivecleanup
Excerpts from Gabriele Bartolini's message of mar ene 31 12:29:51 -0300 2012: > Hi Robert, > > sorry for the delay. > > Il 27/01/12 15:47, Robert Haas ha scritto: > > This email thread seems to have trailed off without reaching a > > conclusion. The patch is marked as Waiting on Author in the CommitFest > > application, but I'm not sure that's accurate. Can we try to nail this > > down? > Here is my final version which embeds comments from Josh. I have also > added debug information to be printed in case '-d' is given. FWIW I committed this last week, though I changed the debug message wording slightly -- hope that's OK; it can be adjusted of course. http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b2e431a4db81a735d1474c4d1565a20b835878c9 -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
On Sat, Feb 4, 2012 at 2:13 PM, Jeff Janes wrote: > We really need to nail that down. Could you post the scripts (on the > wiki) you use for running the benchmark and making the graph? I'd > like to see how much work it would be for me to change it to detect > checkpoints and do something like color the markers blue during > checkpoints and red elsewhen. They're pretty crude - I've attached them here. > Also, I'm not sure how bad that graph really is. The overall > throughput is more variable, and there are a few latency spikes but > they are few. The dominant feature is simply that the long-term > average is less than the initial burst.Of course the goal is to have > a high level of throughput with a smooth latency under sustained > conditions. But to expect that that long-sustained smooth level of > throughput be identical to the "initial burst throughput" sounds like > more of a fantasy than a goal. That's probably true, but the drop-off is currently quite extreme. The fact that disabling full_page_writes causes throughput to increase by >4x is dismaying, at least to me. > If we want to accept the lowered > throughput and work on the what variability/spikes are there, I think > a good approach would be to take the long term TPS average, and dial > the number of clients back until the initial burst TPS matches that > long term average. Then see if the spikes still exist over the long > term using that dialed back number of clients. Hmm, I might be able to do that. > I don't think the full-page-writes are leading to WALInsert > contention, for example, because that would probably lead to smooth > throughput decline, but not those latency spikes in which those entire > seconds go by without transactions. Right. > I doubt it is leading to general > IO compaction, as the IO at that point should be pretty much > sequential (the checkpoint has not yet reached the sync stage, the WAL > is sequential). So I bet that that is caused by fsyncs occurring at > xlog segment switches, and the locking that that entails. That's definitely possible. > If I > recall, we can have a segment which is completely written to OS and in > the process of being fsynced, and we can have another segment which is > in some state of partially in wal_buffers and partly written out to OS > cache, but that we can't start reusing the wal_buffers that were > already written to OS for that segment (and therefore are > theoretically available for reuse by the upcoming 3rd segment) until > the previous segments fsync has completed. So all WALInsert's freeze. > Or something like that. This code has changed a bit since last time > I studied it. Yeah, I need to better-characterize where the pauses are coming from, but I'm reluctant to invest too much effort in until Heikki's xlog scaling patch goes in, because I think that's going to change things enough that any work done now will mostly be wasted. It might be worth trying a run with wal_buffers=32MB or something like that, just to see whether that mitigates any of the locking pile-ups. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company runtestl Description: Binary data makeplot Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] freezing multixacts
Excerpts from Robert Haas's message of jue feb 02 11:24:08 -0300 2012: > On Wed, Feb 1, 2012 at 11:33 PM, Alvaro Herrera > wrote: > > If there's only one remaining member, the problem is easy: replace it > > with that transaction's xid, and set the appropriate hint bits. But if > > there's more than one, the only way out is to create a new multi. This > > increases multixactid consumption, but I don't see any other option. > > Why do we need to freeze anything if the transactions are still > running? We certainly don't freeze regular transaction IDs while the > transactions are still running; it would give wrong answers. It's > probably possible to do it for mxids, but why would you need to? Well, I was thinking that we could continue generating the mxids continuously and if we didn't freeze the old running ones, we could overflow. So one way to deal with the problem would be rewriting the old ones into new ones. But it has occurred to me that instead of doing that we could simply disallow creation of new ones until the oldest ones have been closed and removed from tables -- which is more in line with what we do for Xids anyway. > Suppose you have a tuple A which is locked by a series of transactions > T0, T1, T2, ...; AIUI, each new locker is going to have to create a > new mxid with all the existing entries plus a new one for itself. > But, unless I'm confused, as it's doing so, it can discard any entries > for locks taken by transactions which are no longer running. That's correct. But the problem is a tuple that is locked or updated by a very old transaction that doesn't commit or rollback, and the tuple is never locked again. Eventually the Xid could remain live while the mxid is in wraparound danger. > So given > an mxid with living members, any dead member in that mxid must have > been living at the time the newest member was added. Surely we can't > be consuming mxids anywhere near fast enough for that to be a problem. Well, the problem is that while it should be rare to consume mxids as fast as necessary for this problem to show up, it *is* possible -- unless we add some protection that they are not created until the old ones are frozen (which now means "removed"). > > However, there are cases where not even that is possible -- consider > > tuple freezing during WAL recovery. Recovery is going to need to > > replace those multis with other multis, but it cannot create new multis > > itself. The only solution here appears to be that when multis are > > frozen in the master, replacement multis have to be logged too. So the > > heap_freeze_tuple Xlog record will have a map of old multi to new. That > > way, recovery can just determine the new multi to use for any particular > > old multi; since multixact creation is also logged, we're certain that > > the replacement value has already been defined. > > This doesn't sound right. Why would recovery need to create a multi > that didn't exist on the master? Any multi it applies to a record > should be one that it was told to apply by the master; and the master > should have already WAL-logged the creation of that multi. I don't > think that "replacement" mxids have to be logged; I think that *all* > mxids have to be logged. Am I all wet? Well, yeah, all mxids are logged, in particular those that would have been used for replacement. However I think I've discarded the idea of replacement altogether now, because it makes simpler both on master and slave. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] libpq parallel build
src/interfaces/libpq/fe-misc.c #include-s pg_config_paths.h, but in the makefile that dependency is not declared. This breaks the build with 'make -jN' (with a "big" N) ... sometimes (non-deterministically). Patch attached. Made from 9.1.1, but from browsing the git master on the web interface, the facts above seem to still be true. -- Lionel diff --recursive -u misc/build/postgresql-9.1.1/src/interfaces/libpq/Makefile misc/build/postgresql-9.1.1.patch/src/interfaces/libpq/Makefile --- misc/build/postgresql-9.1.1/src/interfaces/libpq/Makefile 2012-02-06 15:11:19.0 +0100 +++ misc/build/postgresql-9.1.1.patch/src/interfaces/libpq/Makefile 2012-02-06 15:02:51.0 +0100 @@ -109,6 +109,7 @@ libpq.rc: $(top_builddir)/src/Makefile.global fe-connect.o: fe-connect.c $(top_builddir)/src/port/pg_config_paths.h +fe-misc.o: fe-misc.c $(top_builddir)/src/port/pg_config_paths.h $(top_builddir)/src/port/pg_config_paths.h: $(MAKE) -C $(top_builddir)/src/port pg_config_paths.h -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] double writes using "double-write buffer" approach [WIP]
On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales wrote: > Let me know if you have any thoughts/comments, etc. The patch is > enclosed, and the README.doublewrites is updated a fair bit. ISTM that the double-write can prevent torn-pages in neither double-write file nor data file in *base backup*. Because both double-write file and data file can be backed up while being written. Is this right? To avoid the torn-page problem, we should write FPI to WAL during online backup even if the double-write has been committed? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On 06.02.2012 11:25, Simon Riggs wrote: On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas wrote: Good idea. However, the followup idea to that discussion was to not only avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound altogether, by allowing clog to grow indefinitely. You do want to freeze at some point of course, to truncate the clog, but it would be nice to not have a hard limit. The way to do that is to store an xid "epoch" in the page header, so that Xids are effectively 64-bits wide, even though the xid fields on the tuple header are only 32-bits wide. That does require a new field in the page header. We wouldn't need to do that would we? Huh? Do you mean that we wouldn't need to implement that feature? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas wrote: > On 06.02.2012 10:05, Simon Riggs wrote: >> >> On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas >> wrote: >>> >>> On 06.02.2012 05:48, Bruce Momjian wrote: Thanks. Clearly we don't need 16 bits to represent our page version number because we rarely change it. The other good thing is I don't remember anyone wanting additional per-page storage in the past few years except for a checksum. >>> >>> >>> There's this idea that I'd like to see implemented: >>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php >> >> >> As you'll note, adding that field will change the page format and is >> therefore ruled out for 9.2. >> >> ISTM there is a different way to handle that anyway. All we need to do >> is to record the LSN of the last wraparound in shared memory/control >> file. Any block with page LSN older than that has all-frozen rows. >> That doesn't use any space nor does it require another field to be >> set. > > > Good idea. However, the followup idea to that discussion was to not only > avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound > altogether, by allowing clog to grow indefinitely. You do want to freeze at > some point of course, to truncate the clog, but it would be nice to not have > a hard limit. The way to do that is to store an xid "epoch" in the page > header, so that Xids are effectively 64-bits wide, even though the xid > fields on the tuple header are only 32-bits wide. That does require a new > field in the page header. We wouldn't need to do that would we? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On 06.02.2012 10:05, Simon Riggs wrote: On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas wrote: On 06.02.2012 05:48, Bruce Momjian wrote: Thanks. Clearly we don't need 16 bits to represent our page version number because we rarely change it. The other good thing is I don't remember anyone wanting additional per-page storage in the past few years except for a checksum. There's this idea that I'd like to see implemented: http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php As you'll note, adding that field will change the page format and is therefore ruled out for 9.2. ISTM there is a different way to handle that anyway. All we need to do is to record the LSN of the last wraparound in shared memory/control file. Any block with page LSN older than that has all-frozen rows. That doesn't use any space nor does it require another field to be set. Good idea. However, the followup idea to that discussion was to not only avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound altogether, by allowing clog to grow indefinitely. You do want to freeze at some point of course, to truncate the clog, but it would be nice to not have a hard limit. The way to do that is to store an xid "epoch" in the page header, so that Xids are effectively 64-bits wide, even though the xid fields on the tuple header are only 32-bits wide. That does require a new field in the page header. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Report: race conditions in WAL replay routines
On Sun, Feb 5, 2012 at 11:14 PM, Tom Lane wrote: > Simon Riggs writes: >> Please post the patch rather than fixing directly. There's some subtle >> stuff there and it would be best to discuss first. > > And here's a proposed patch for not throwing ERROR during replay of DROP > TABLESPACE. I had originally thought this would be a one-liner > s/ERROR/LOG/, but on inspection destroy_tablespace_directories() really > needs to be changed too, so that it doesn't throw error for unremovable > directories. Looks good. The existing errmsg of "tablespace is not empty" doesn't cover all reasons why tablespace was not removed. The final message should have errmsg "tablespace not fully removed" errhint "you should resolve this manually if it causes further problems" The errdetail is good. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Report: race conditions in WAL replay routines
On Sun, Feb 5, 2012 at 10:23 PM, Tom Lane wrote: > Simon Riggs writes: >> Please post the patch rather than fixing directly. There's some subtle >> stuff there and it would be best to discuss first. > > Here's a proposed patch for the issues around unlocked updates of > shared-memory state. After going through this I believe that there is > little risk of any real problems in the current state of the code; this > is more in the nature of future-proofing against foreseeable changes. > (One such is that we'd discussed fixing the age() function to work > during Hot Standby.) So I suggest applying this to HEAD but not > back-patching. All looks very good to me. Agreed. > Except for one thing. I realized while looking at the NEXTOID replay > code that it is completely broken: it only advances > ShmemVariableCache->nextOid when that's less than the value in the WAL > record. So that comparison fails if the OID counter wraps around during > replay. I've fixed this in the attached patch by just forcibly > assigning the new value instead of trying to be smart, and I think > probably that aspect of it needs to be back-patched. Ouch! Well spotted. Suggest fixing that as a separate patch; looks like backpatch to 8.0. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas wrote: > On 06.02.2012 05:48, Bruce Momjian wrote: >> >> On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote: >>> >>> In the proposed scheme there are two flag bits set on the page to >>> indicate whether the field is used as a checksum rather than a version >>> number. So its possible the checksum could look like a valid page >>> version number, but we'd still be able to tell the difference. >> >> >> Thanks. Clearly we don't need 16 bits to represent our page version >> number because we rarely change it. The other good thing is I don't >> remember anyone wanting additional per-page storage in the past few >> years except for a checksum. > > > There's this idea that I'd like to see implemented: > http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php As you'll note, adding that field will change the page format and is therefore ruled out for 9.2. ISTM there is a different way to handle that anyway. All we need to do is to record the LSN of the last wraparound in shared memory/control file. Any block with page LSN older than that has all-frozen rows. That doesn't use any space nor does it require another field to be set. That leaves the same problem that if someone writes to the page you need to freeze the rows first. > In any case, I think it's a very bad idea to remove the page version field. > How are we supposed to ever be able to change the page format again if we > don't even have a version number on the page? I strongly oppose removing it. Nobody is removing the version field, nor is anybody suggesting not being able to tell which page version we are looking at. > I'm also not very comfortable with the idea of having flags on the page > indicating whether it has a checksum or not. It's not hard to imagine a > software of firmware bug or hardware failure that would cause pd_flags field > to be zeroed out altogether. It would be more robust if the expected > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that > at upgrade somehow. And it still feels a bit whacky anyway. Good idea. Lets use 0-0-0 to represent upgraded from previous version, needs a bit set 0-0-1 to represent new version number of page, no checksum 1-1-1 to represent new version number of page, with checksum So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator >> I wonder if we should just dedicate 3 page header bits, call that the >> page version number, and set this new version number to 1, and assume >> all previous versions were zero, and have them look in the old page >> version location if the new version number is zero. I am basically >> thinking of how we can plan ahead to move the version number to a new >> location and have a defined way of finding the page version number using >> old and new schemes. > > > Three bits seems short-sighted, but yeah, something like 6-8 bits should be > enough. On the whole, though. I think we should bite the bullet and invent a > way to extend the page header at upgrade. There are currently many spare bits. I don't see any need to allocate them to this specific use ahead of time - especially since that is the exact decision we took last time when we reserved 16 bits for the version. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Mon, Feb 6, 2012 at 4:48 AM, Bruce Momjian wrote: > On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote: >> On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian wrote: >> > On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote: >> >> > Also, as far as I can see this patch usurps the page version field, >> >> > which I find unacceptably short-sighted. Do you really think this is >> >> > the last page layout change we'll ever make? >> >> >> >> No, I don't. I hope and expect the next page layout change to >> >> reintroduce such a field. >> >> >> >> But since we're agreed now that upgrading is important, changing page >> >> format isn't likely to be happening until we get an online upgrade >> >> process. So future changes are much less likely. If they do happen, we >> >> have some flag bits spare that can be used to indicate later versions. >> >> It's not the prettiest thing in the world, but it's a small ugliness >> >> in return for an important feature. If there was a way without that, I >> >> would have chosen it. >> > >> > Have you considered the CRC might match a valuid page version number? >> > Is that safe? >> >> In the proposed scheme there are two flag bits set on the page to >> indicate whether the field is used as a checksum rather than a version >> number. So its possible the checksum could look like a valid page >> version number, but we'd still be able to tell the difference. > > Thanks. Clearly we don't need 16 bits to represent our page version > number because we rarely change it. The other good thing is I don't > remember anyone wanting additional per-page storage in the past few > years except for a checksum. > > I wonder if we should just dedicate 3 page header bits, call that the > page version number, and set this new version number to 1, and assume > all previous versions were zero, and have them look in the old page > version location if the new version number is zero. I am basically > thinking of how we can plan ahead to move the version number to a new > location and have a defined way of finding the page version number using > old and new schemes. > > I don't want to get to a point where we need a bit per page number > version. Agreed, though I think we need at least 2 bits set if we are using the checksum, so we should start at version 2 to ensure that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] about type cast
hi, there is a problem about type cast that i don't understand, follow is my test. postgres=# select 10::bit(3); bit - 010 (1 row) postgres=# select 10::bit varying(3); ERROR: cannot cast type integer to bit varying LINE 1: select 10::bit varying(3); ^ postgres=# my question is why int can cast to bit , but bot for bit varying? i want to know the reason. thank you for your timing.
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
Thanks for the comments! (2012/02/06 5:08), Kohei KaiGai wrote: > Yes. In my opinion, one significant benefit of pgsql_fdw is to execute > qualifiers on the distributed nodes; that enables to utilize multiple > CPU resources efficiently. > Duplicate checks are reliable way to keep invisible tuples being filtered > out, indeed. But it shall degrade one competitive characteristics of the > pgsql_fdw. > > https://github.com/kaigai/pg_strom/blob/master/plan.c#L693 > In my module, qualifiers being executable on device side are detached > from the baserel->baserestrictinfo, and remaining qualifiers are chained > to the list. > The is_device_executable_qual() is equivalent to is_foreign_expr() in > the pgsql_fdw module. Agreed, I too think that pushed-down qualifiers should not be evaluated on local side again from the viewpoint of performance. > Of course, it is your decision, and I might miss something. > > BTW, what is the undesirable behavior on your previous implementation? In early development, maybe during testing PREPARE/EXECUTE or DECALRE, I saw that iterated execution of foreign scan produce wrong result which includes rows which are NOT match pushed-down qualifiers. And, at last, I could recall what happened at that time. It was just trivial bug I made. Perhaps I've removed pushed-down qualifiers in Path generation phase, so generated plan node has lost qualifiers permanently. In short, I'll remove remote qualifiers from baserestrictinfo, like pg_storm. >>> [Design comment] >>> I'm not sure the reason why store_result() uses MessageContext to save >>> the Tuplestorestate within PgsqlFdwExecutionState. >>> The source code comment says it is used to avoid memory leaks in error >>> cases. I also have a similar experience on implementation of my fdw module, >>> so, I could understand per-scan context is already cleared at the timing of >>> resource-release-callback, thus, handlers to external resource have to be >>> saved on separated memory context. >>> In my personal opinion, the right design is to declare a memory context for >>> pgsql_fdw itself, instead of the abuse of existing memory context. >>> (More wise design is to define sub-memory-context for each foreign-scan, >>> then, remove the sub-memory-context after release handlers.) >> >> I simply chose built-in context which has enough lifespan, but now I >> think that using MessageContext directly is not recommended way. As you >> say, creating new context as child of MessageContext for each scan in >> BeginForeignScan (or first IterateForeignScan) would be better. Please >> see attached patch. >> >> One other option is getting rid of tuplestore by holding result rows as >> PGresult, and track it for error cases which might happen. >> ResourceOwner callback can be used to release PGresult on error, similar >> to PGconn. >> > If we could have set of results on per-query memory context (thus, > no need to explicit release on error timing), it is more ideal design. > It it possible to implement based on the libpq APIs? Currently no, so I used tuplestore even though it needs coping results. However, Kyotaro Horiguchi's patch might make it possible. I'm reading his patch to determine whether it suits pgsql_fdw. http://archives.postgresql.org/message-id/20120202143057.ga12...@gmail.com > Please note that per-query memory context is already released on > ResourceOwner callback is launched, so, it is unavailable to implement > if libpq requires to release some resource. I see. We need to use context which has longer lifetime if we want to track malloc'ed PQresult. I already use CacheContext for connection pooling, so linking PGreslts to its source connection would be a solutions. >>> [Design comment] >>> When "BEGIN" should be issued on the remote-side? >>> The connect_pg_server() is an only chance to issue "BEGIN" command >>> at the remote-side on connection being opened. However, the connection >>> shall be kept unless an error is not raised. Thus, the remote-side will >>> continue to work within a single transaction block, even if local-side >>> iterates >>> a pair of "BEGIN" and "COMMIT". >>> I'd like to suggest to close the transaction block at the timing of either >>> end of the scan, transaction or sub-transaction. >> >> Indeed, remote transactions should be terminated at some timing. >> Terminating at the end of a scan seems troublesome because a connection >> might be shared by multiple scans in a query. I'd prefer aborting >> remote transaction at the end of local query. Please see >> abort_remote_tx in attached patch. >> > It seems to me abort_remote_tx in ReleaseConnection() is reasonable. > However, isn't it needed to have ABORT in GetConnection() at first time? Hm, forcing overhead of aborting transaction to all local queries is unreasonable. Redundant BEGIN doesn't cause error but just generate WARNING, so I'll remove abort_remote_tx preceding begin_remote_tx. >>> [Comment to improve] >>> It seems to me the design of