[HACKERS] Single pass vacuum - take 2
Hi All, Here is a revised patch based on our earlier discussion. I implemented Robert's idea of tracking the vacuum generation number in the line pointer itself. For LP_DEAD line pointers, the lp_off/lp_len is unused (and always set to 0 for heap tuples). We use those 30 bits to store the generation number of the vacuum which would have potentially removed the corresponding index pointers, if the vacuum finished successfully. The pg_class information is used to know the status of the vacuum, whether it failed or succeeded. 30-bit numbers are large enough that we can ignore any wrap-around related issues. With this change, we don't need any additional header or special space in the page which was one of the main objection to the previous version. Other than this major change, I have added code commentary at relevant places and also fixed the item.h comments to reflect the change. I think the patch is ready for a serious review now. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c index 20bca0d..6213631 100644 --- a/contrib/pageinspect/heapfuncs.c +++ b/contrib/pageinspect/heapfuncs.c @@ -155,6 +155,7 @@ heap_page_items(PG_FUNCTION_ARGS) * many other ways, but at least we won't crash. */ if (ItemIdHasStorage(id) + !ItemIdIsDead(id) lp_len = sizeof(HeapTupleHeader) lp_offset == MAXALIGN(lp_offset) lp_offset + lp_len = raw_page_size) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 06db65d..cf65c05 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -3984,7 +3984,8 @@ log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, - TransactionId latestRemovedXid) + TransactionId latestRemovedXid, + uint32 vacgen) { xl_heap_clean xlrec; uint8 info; @@ -3999,6 +4000,7 @@ log_heap_clean(Relation reln, Buffer buffer, xlrec.latestRemovedXid = latestRemovedXid; xlrec.nredirected = nredirected; xlrec.ndead = ndead; + xlrec.vacgen = vacgen; rdata[0].data = (char *) xlrec; rdata[0].len = SizeOfHeapClean; @@ -4300,6 +4302,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record) int ndead; int nunused; Size freespace; + uint32 vacgen; /* * We're about to remove tuples. In Hot Standby mode, ensure that there's @@ -4332,6 +4335,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record) nredirected = xlrec-nredirected; ndead = xlrec-ndead; + vacgen = xlrec-vacgen; end = (OffsetNumber *) ((char *) xlrec + record-xl_len); redirected = (OffsetNumber *) ((char *) xlrec + SizeOfHeapClean); nowdead = redirected + (nredirected * 2); @@ -4343,7 +4347,8 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record) heap_page_prune_execute(buffer, redirected, nredirected, nowdead, ndead, - nowunused, nunused); + nowunused, nunused, + vacgen); freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */ diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index 0cfa866..00ac676 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -31,9 +31,12 @@ typedef struct TransactionId new_prune_xid; /* new prune hint value for page */ TransactionId latestRemovedXid; /* latest xid to be removed by this * prune */ + int already_dead; /* number of already dead line pointers */ + int nredirected; /* numbers of entries in arrays below */ int ndead; int nunused; + /* arrays that accumulate indexes of items to be changed */ OffsetNumber redirected[MaxHeapTuplesPerPage * 2]; OffsetNumber nowdead[MaxHeapTuplesPerPage]; @@ -125,8 +128,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin) TransactionId ignore = InvalidTransactionId; /* return value not * needed */ - /* OK to prune */ - (void) heap_page_prune(relation, buffer, OldestXmin, true, ignore); + /* OK to prune - pass invalid vacuum generation number */ + (void) heap_page_prune(relation, buffer, OldestXmin, true, ignore, 0); } /* And release buffer lock */ @@ -153,13 +156,15 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin) */ int heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, -bool report_stats, TransactionId *latestRemovedXid) +bool report_stats, TransactionId *latestRemovedXid, +uint32 current_vacgen) { int ndeleted = 0; Page page = BufferGetPage(buffer); OffsetNumber offnum, maxoff; PruneState prstate; + uint32 last_finished_vacgen = RelationGetLastVacGen(relation); /* * Our strategy is to scan the page and make lists of items to change, @@ -175,6 +180,7 @@
[HACKERS] tab stop in README
hi, i know that postgresql uses ts=4 for C source code. but how about documatation? src/backend/access/transam/README seems to have both of ts=4 and ts=8 mixed. YAMAMOTO Takashi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Fri, Aug 12, 2011 at 04:19:37PM -0700, daveg wrote: This seems to be bug month for my client. Now there are seeing periods where all new connections fail immediately with the error: FATAL: lock AccessShareLock on object 0/1260/0 is already held This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has been up for months. It started happening sporadicly a few days ago. It will do this for a period of several minutes to an hour and then go back to normal for hours or days. One complete failing session out of several hundred around that time: - 2011-08-09 00:01:04.446 8823 [unknown] [unknown] LOG: connection received: host=op05.xxx port=34067 2011-08-09 00:01:04.446 8823 c77 apps LOG: connection authorized: user=apps database=c77 2011-08-09 00:01:04.449 8823 c77 apps FATAL: lock AccessShareLock on object 0/1260/0 is already held -- This is to add additional information to the original report: For a while this was happening on many different databases in one postgresql 8.4.7 instance on a single large host ('U2' 512GB 64cpu) running RH 5. That has been quiet for several days and the newest batches of errors have happened on only on a single database 'c23', in a postgresql 9.0.4 instance on a smaller host ('A', 64GB 8cpu) running SuSE 10.2. No memory errors or other misbehaviour have been seen on either of these hosts in recent months. The original error was: lock AccessShareLock on object 0/1260/0 is already held which is for pg_database. The recent errors are: lock AccessShareLock on object 16403/2615/0 is already held which is for pg_namespace in database c23. All of the orginal and most of the recent batchs of errors were immediately after connecting to a database and being authorized, that is, before any statements were attempted. However, some of the most recent are on the first query statement. That is after logging in and doing things like set transaction ... the first select would hit this error. It seems to come in clusters, sometimes, which suggests something shared by multiple processes. For example, here are the times for the errors on c23 in the afternoon of August 20: 20 07:14:12.722 20 16:05:07.798 20 16:05:07.808 20 16:05:10.519 20 16:07:07.726 20 16:07:08.722 20 16:07:09.734 20 16:07:10.656 20 16:07:25.436 20 16:22:23.983 20 16:22:24.014 20 16:22:24.335 20 16:22:24.409 20 16:22:24.477 20 16:22:24.499 20 16:22:24.516 20 16:30:58.210 20 16:31:15.261 20 16:31:15.296 20 16:31:15.324 20 16:31:15.348 20 18:06:16.515 20 18:06:49.198 20 18:06:49.204 20 18:06:51.444 20 21:03:05.940 So far I've got: - affects system tables - happens very soon after process startup - in 8.4.7 and 9.0.4 - not likely to be hardware or OS related - happens in clusters for period of a few second to many minutes I'll work on printing the LOCK and LOCALLOCK when it happens, but it's hard to get downtime to pick up new builds. Any other ideas on getting to the bottom of this? Thanks -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Question: CREATE EXTENSION and create schema permission?
The attached patch adds permission check at the scenario that I explained bellow. Unlike CreateSchemaCommand(), we don't have check_is_member_of_role() here because the extowner is obviously same with the current user in this code path. I hope this patch being also back ported to v9.1 tree, not only v9.2 development. Thanks, 2011/8/21 Dimitri Fontaine dimi...@2ndquadrant.fr: Kohei KaiGai kai...@kaigai.gr.jp writes: The current implementation set the current user as owner of the new schema. The default permission check of schema allows owner to create several kinds of underlying objects. In the result, we may consider a scenario that a user without permissions to create new objects possibly get a schema created by CREATE EXTENSION that allows him to create new objects (such as table, function, ...). I don't think it is a desirable behavior. :-( Agreed, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- KaiGai Kohei kai...@kaigai.gr.jp pgsql-create-extension-permission-checks.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Fast GiST index build
On Wed, Aug 17, 2011 at 11:11 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 16.08.2011 22:10, Heikki Linnakangas wrote: Here's an version of the patch with a bunch of minor changes: And here it really is, this time with an attachment... Thanks a lot. I'm going to start rerunning the tests now. First bunch of test results will be available soon (tests running and results processing take some time). While there is a patch with few small bugfixes. -- With best regards, Alexander Korotkov. gist_fast_build-0.14.2.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] UUID datatype GiST index support
Hi, Hopefully someone can help me and point me in right direction :) I have been looking for GiST support extension for UUID datatype... since I could not find it... I wanted to write it myself. I need it more for EXCLUSION constraint - than to use GIST index just on UUID column... i.e: CREATE TABLE test_exclude ( id serial NOT NULL, guid uuid NOT NULL, valid_period period NOT NULL, CONSTRAINT test_excludepk PRIMARY KEY (id), EXCLUDE USING gist (guid WITH =, valid_period WITH ) --for the same guid, period must not overlap... ) Has taken a look on btree_gist contrib source code... there are Gist support functions for many datatypes, so I wanted to take the same pattern and make it... however, problem happend in first line of code :) (tough I am comming from totally different world - .Net) pattern is: typedef struct { ADTdataType lower; ADTdataType upper; } datatypeKEY; i.e. for Date: typedef struct { DateADT lower; DateADT upper; } dateKEY; So I guessed for uuid would be: typedef struct { pg_uuid_t lower; pg_uuid_t upper; } uuidKEY; because of in pg uuid.h says: *In C, we use the name pg_uuid_t, * to avoid conflicts with any uuid_t type that might be defined by the system headers... and there is: /* opaque struct; defined in uuid.c */ typedef struct pg_uuid_t pg_uuid_t; But compiler shows error: Field lower (and upper) has incopmplete datatype Succeded to avoid error with adding: struct pg_uuid_t { unsigned char data[UUID_LEN]; } but then getting errors in compare functions: i.e. static int m4_uuidkey_cmp(const void *a, const void *b) { uuidKEY*ia = (uuidKEY *) (((Usrt *) a)-t); uuidKEY*ib = (uuidKEY *) (((Usrt *) b)-t); int res; res = DatumGetInt32(DirectFunctionCall2(uuid_cmp, UUIDPGetDatum(ia-upper), UUIDPGetDatum(ia-upper))); if (res == 0) return DatumGetInt32(DirectFunctionCall2(uuid_cmp, UUIDPGetDatum(ia-upper), UUIDPGetDatum(ib-upper))); return res; } Getting error: aggregate error used where an integer was expected! It would be a lot appreciated if anyone could help me and suggest the best way to make Gist support for UUID datatype... Many thanks, Misa
Re: [HACKERS] UUID datatype GiST index support
Hi! On Mon, Aug 22, 2011 at 2:54 PM, Misa Simic misa.si...@gmail.com wrote: static int m4_uuidkey_cmp(const void *a, const void *b) { uuidKEY*ia = (uuidKEY *) (((Usrt *) a)-t); uuidKEY*ib = (uuidKEY *) (((Usrt *) b)-t); int res; res = DatumGetInt32(DirectFunctionCall2(uuid_cmp, UUIDPGetDatum(ia-upper), UUIDPGetDatum(ia-upper))); if (res == 0) return DatumGetInt32(DirectFunctionCall2(uuid_cmp, UUIDPGetDatum(ia-upper), UUIDPGetDatum(ib-upper))); return res; } Getting error: aggregate error used where an integer was expected! Seems that you need the address-of operator before ia-upper and ia-lower (likely one of operands should be ia-lower). UUIDPGetDatum except pointer as an argument, i.e. UUIDPGetDatum(ia-upper). It would be a lot appreciated if anyone could help me and suggest the best way to make Gist support for UUID datatype... I think you're on the right way. btree_gist is an extension which provides GiST indexing of scalar datatype. UUID is one of them. So, the module you are writing should be quite similar. -- With best regards, Alexander Korotkov.
Re: [HACKERS] UUID datatype GiST index support
Thanks Alexander, 'Scalar datatype' - given me a hint... Looking further in btree_gist source, for inet datatype, which seems a bit complexier then uuid... (char, char, char[16]) structure for inet, compared to jut char[16] for uuid. GiST pattern works with double datatype... and there is method convert_network_to_scalar(Datum, Oid), whick converts an inet value - to scalar value... and then all index compare functions are based on the doubles which leads me to conclusion (maybe is wrong) if I can convert UUID value to double it would make a job a lot easier... and pretty straight forward... Any suggestion, how to convert UUID (char[16]) to scalar? looking into convert inet to scalar, what confuses me, even there is char[16] for an ip address... code is: if (ip_family http://doxygen.postgresql.org/network_8c.html#a7dc77a7bc93b675d36eca352d589b314(ip) == PGSQL_AF_INET http://doxygen.postgresql.org/utils_2inet_8h.html#a8ba3e5fe500d587d3eb8699968450b18) len = 4; else len = 5; res = ip_family http://doxygen.postgresql.org/network_8c.html#a7dc77a7bc93b675d36eca352d589b314(ip); for (i = 0; i len; i++) { res *= 256; res += ip_addr http://doxygen.postgresql.org/network_8c.html#a54558e944989cddebdb93f2f6cd965a4(ip)[i]; } return res; takes just first 4, or 5 values from ipaddres even there is 16 - (decalred as char[16]) Many thanks, Misa 2011/8/22 Alexander Korotkov aekorot...@gmail.com Hi! On Mon, Aug 22, 2011 at 2:54 PM, Misa Simic misa.si...@gmail.com wrote: static int m4_uuidkey_cmp(const void *a, const void *b) { uuidKEY*ia = (uuidKEY *) (((Usrt *) a)-t); uuidKEY*ib = (uuidKEY *) (((Usrt *) b)-t); int res; res = DatumGetInt32(DirectFunctionCall2(uuid_cmp, UUIDPGetDatum(ia-upper), UUIDPGetDatum(ia-upper))); if (res == 0) return DatumGetInt32(DirectFunctionCall2(uuid_cmp, UUIDPGetDatum(ia-upper), UUIDPGetDatum(ib-upper))); return res; } Getting error: aggregate error used where an integer was expected! Seems that you need the address-of operator before ia-upper and ia-lower (likely one of operands should be ia-lower). UUIDPGetDatum except pointer as an argument, i.e. UUIDPGetDatum(ia-upper). It would be a lot appreciated if anyone could help me and suggest the best way to make Gist support for UUID datatype... I think you're on the right way. btree_gist is an extension which provides GiST indexing of scalar datatype. UUID is one of them. So, the module you are writing should be quite similar. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Rethinking sinval callback hook API
On Sun, Aug 21, 2011 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Fri, Aug 19, 2011 at 2:13 PM, Tom Lane t...@sss.pgh.pa.us wrote: 1. Robert was speculating the other day about wanting to be able to snoop the inval traffic. Right now, callbacks can only snoop a fairly small subset of it. Is that true? It appears to me that the events that aren't exposed at all are smgr and relmap invalidations, which don't seem terribly important, and presumably not a majority of the traffic. Well, important is in the eye of the beholder here --- if you did need to see one of those, you're flat outta luck. It's also the case that you can only snoop one catcache per registration, so there aren't enough slots available in the fixed-size list to watch all the catcache traffic. Yeah, I'm not opposed to making this more generic; regardless of whether we have an immediate use case for it, it seems like a pretty good idea. I was just surprised that you described the available portion as a small subset. The one-catcache-per-registration limitation is an interesting point. I doubt that we want to move the is this the relevant catcache? test inside all the callbacks, but we might want to have a special value that means call me back when there's a change that affects ANY catcache... or even more generally call me back when there's a change that affects ANY system catalog, regardless of whether there is an associated catcache or not. sepgsql, for example, really wants to be able to get a callback when pg_seclabel or pg_shseclabel is updated, precisely because it wants to maintain its own special-purpose cache on a catalog that on which we DON'T want to add a catcache. Exposing SharedInvalidationMessage could be going too far in the other direction, but I'm thinking we really ought to do something. I think the best option might be to expose it as an opaque struct. In other words, the header file available to other backends would have something like: struct SharedInvalidationMessage; typedef struct SharedInvalidationMessage SharedInvalidationMessage; typedef enum { SIM_CATCACHE, SIM_CATALOG, SIM_RELCACHE, SIM_SMGR, SIM_RELMAP } SIMType; SIMType SIMGetType(SharedInvalidationMessage *); Oid SIMGetDatabase(SharedInvalidationMessage *); BackendId SIMGetBackendId(SharedInvalidationMessage *); /* etc. */ That allows us to do things like change the number of bits we use to store the backend ID (e.g. from the current 24 to 32 or 16) without needing to change the callers. In fact, you could probably even add whole new message types and most callers wouldn't need to care, since the typical caller is going to do something like ... if (SIMGetType(msg) != SIM_SOMETHING) return right off the bat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PushActiveSnapshot(GetTransactionSnapshot())
Excerpts from Simon Riggs's message of dom ago 21 16:23:39 -0300 2011: In common cases of snapshot use we run GetSnapshotData() into a statically allocated snapshot, then immediately copy the static struct into a dynamically allocated copy. The static allocation was designed to remove the overhead of dynamic allocation, but then we do it anyway. The snapmgr code does this explicitly, but the reason isn't documented, it just says we must do this. IIRC the active snapshot is scribbled onto by some operations, which is why the copy is mandatory. Maybe there's some way to optimize things so that the copy is done only when necessary. IIRC the copying of the ActiveSnapshot was only introduced because some subtle bugs were detected in the code without copy. When I introduced the mandatory copy, I don't remember thinking about the statically allocated struct. The fact that PushActiveSnapshot and GetTransactionSnapshot are in two completely separate modules complicates optimization. Note I'm not saying it's impossible -- I just didn't look into it. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 9.1rc1: TRAP: FailedAssertion(!(item_width 0), File: costsize.c, Line: 3274)
I am developing against 9.1beta3 and encountered this problem. I repeated it against rc1 (or actually, REL9_1_STABLE taken from git today). The below SQL causes: TRAP: FailedAssertion(!(item_width 0), File: costsize.c, Line: 3274) and the session crashes. select * from ( select col1 , col2 , reviewed from (values ( , 'aweyu', 3) , ( , 'bwetu', 2) , ( , 'vwxyz', 1) ) as r(col1,col2,col3) left join ( select 1 as reviewed, c2 from ( values ('abcde') ) as t1(c2) union all select 0 as reviewed, c2 from ( values ('vwxyz') ) as t2(c2) ) as up_all on up_all.c2 = r.col2 order by col3 ) as f ; To simplify I used 'values' for the actual tables that I am using, but the above still causes the FailedAssertion. Removing the outer select avoids the error and removing order by avoids the error Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Single pass vacuum - take 2
On Aug 22, 2011, at 1:22 AM, Pavan Deolasee wrote: Hi All, Here is a revised patch based on our earlier discussion. I implemented Robert's idea of tracking the vacuum generation number in the line pointer itself. For LP_DEAD line pointers, the lp_off/lp_len is unused (and always set to 0 for heap tuples). We use those 30 bits to store the generation number of the vacuum which would have potentially removed the corresponding index pointers, if the vacuum finished successfully. The pg_class information is used to know the status of the vacuum, whether it failed or succeeded. 30-bit numbers are large enough that we can ignore any wrap-around related issues. With this +* Note: We don't worry about the wrap-around issues here since it would +* take a 1 Billion vacuums on the same relation for the vacuum generation +* to wrap-around. That would take ages to happen and even if it happens, +* the chances that we might have dead-vacuumed line pointers still +* stamped with the old (failed) vacuum are infinitely small since some +* other vacuum cycle would have taken care of them. It would be good if some comment explained how we're safe in the case of an aborted vacuum. I'm guessing that when vacuum finds any line pointers that don't match the last successful vacuum exactly it will go and re-examine them from scratch? I'm thinking that there should be a single comment somewhere that explains exactly how the 2-pass algorithm works. The comment in vacuum_log_cleanup_info seems to have the most info, but there's a few pieces still missing. Also, found a typo: + * pass anyways). But this gives us two lareg benefits: -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] cheaper snapshots redux
On Wed, Jul 27, 2011 at 10:51 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wonder whether we could do something involving WAL properties --- the current tuple visibility logic was designed before WAL existed, so it's not exploiting that resource at all. I'm imagining that the kernel of a snapshot is just a WAL position, ie the end of WAL as of the time you take the snapshot (easy to get in O(1) time). Visibility tests then reduce to did this transaction commit with a WAL record located before the specified position?. You'd need some index datastructure that made it reasonably cheap to find out the commit locations of recently committed transactions, where recent means back to recentGlobalXmin. That seems possibly do-able, though I don't have a concrete design in mind. [discussion of why I don't think an LSN will work] But having said that an LSN can't work, I don't see why we can't just use a 64-bit counter. In fact, the predicate locking code already does something much like this, using an SLRU, for serializable transactions only. In more detail, what I'm imagining is an array with 4 billion entries, one per XID, probably broken up into files of say 16MB each with 2 million entries per file. Each entry is a 64-bit value. It is 0 if the XID has not yet started, is still running, or has aborted. Otherwise, it is the commit sequence number of the transaction. I've been giving this quite a bit more thought, and have decided to abandon the scheme described above, at least for now. It has the advantage of avoiding virtually all locking, but it's extremely inefficient in its use of memory in the presence of long-running transactions. For example, if there's an open transaction that's been sitting around for 10 million transactions or so and has an XID assigned, any new snapshot is going to need to probe into the big array for any XID in that range. At 8 bytes per entry, that means we're randomly accessing about ~80MB of memory-mapped data. That seems problematic both in terms of blowing out the cache and (on small machines) possibly even blowing out RAM. Nor is that the worst case scenario: a transaction could sit open for 100 million transactions. Heikki has made the suggestion a few times (and a few other people have since made somewhat similar suggestions in different words) of keeping an-up-to-date snapshot in shared memory such that transactions that need a snapshot can simply copy it. I've since noted that in Hot Standby mode, that's more or less what the KnownAssignedXids stuff already does. I objected that, first, the overhead of updating the snapshot for every commit would be too great, and second, it didn't seem to do a whole lot to reduce the size of the critical section, and therefore probably wouldn't improve performance that much. But I'm coming around to the view that these might be solvable problems rather than reasons to give up on the idea altogether. With respect to the first problem, what I'm imagining is that we not do a complete rewrite of the snapshot in shared memory on every commit. Instead, when a transaction ends, we'll decide whether to (a) write a new snapshot or (b) just record the XIDs that ended. If we do (b), then any backend that wants a snapshot will need to copy from shared memory both the most recently written snapshot and the XIDs that have subsequently ended. From there, it can figure out which XIDs are still running. Of course, if the list of recently-ended XIDs gets too long, then taking a snapshot will start to get expensive, so we'll need to periodically do (a) instead. There are other ways that this could be done as well; for example, the KnownAssignedXids stuff just flags XIDs that should be ignored and then periodically compacts away the ignored entries. I think the real trick is figuring out a design that can improve concurrency. If you keep a snapshot in shared memory and periodically overwrite it in place, I don't think you're going to gain much. Everyone who wants a snapshot still needs a share-lock and everyone who wants to commit still needs an exclusive-lock, and while you might be able to make the critical section a bit shorter, I think it's still going to be hard to make big gains that way. What I'm thinking about instead is using a ring buffer with three pointers: a start pointer, a stop pointer, and a write pointer. When a transaction ends, we advance the write pointer, write the XIDs or a whole new snapshot into the buffer, and then advance the stop pointer. If we wrote a whole new snapshot, we advance the start pointer to the beginning of the data we just wrote. Someone who wants to take a snapshot must read the data between the start and stop pointers, and must then check that the write pointer hasn't advanced so far in the meantime that the data they read might have been overwritten before they finished reading it. Obviously,
Re: [HACKERS] How to define global variable in postgresql
On Aug 19, 2011, at 4:15 PM, Valentine Gogichashvili wrote: Hello. How can we define a global variable in postgresql? you can also use global structure in plpython for example: http://www.postgresql.org/docs/9.0/static/plpython-sharing.html Same thing with plperl. BTW, if you want something that's global but also transactional then you'll want to use a temp table. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Displaying accumulated autovacuum cost
On Aug 18, 2011, at 10:41 AM, Greg Smith wrote: that was all they got. I'm going to add directly computing the write MB/s figure from the dirty data written too, since that ends up being the thing that I keep deriving by hand anyway. I know folks have talked about progress, but I haven't seen anything specific... could you add info about what table/index vacuum is working on, and how far along it is? I realize that's not very close to an actual % completion, but it's far better than what we have right now. FWIW, the number I end up caring about isn't so much write traffic as read. Thanks to a good amount of battery-backed write cache (and possibly some iSCSI misconfiguration), our writes are generally much cheaper than our reads. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.1rc1: TRAP: FailedAssertion(!(item_width 0), File: costsize.c, Line: 3274)
On Mon, Aug 22, 2011 at 4:16 PM, Erik Rijkers e...@xs4all.nl wrote: I am developing against 9.1beta3 and encountered this problem. I repeated it against rc1 (or actually, REL9_1_STABLE taken from git today). The below SQL causes: TRAP: FailedAssertion(!(item_width 0), File: costsize.c, Line: 3274) and the session crashes. select * from ( select col1 , col2 , reviewed from (values ( , 'aweyu', 3) , ( , 'bwetu', 2) , ( , 'vwxyz', 1) ) as r(col1,col2,col3) left join ( select 1 as reviewed, c2 from ( values ('abcde') ) as t1(c2) union all select 0 as reviewed, c2 from ( values ('vwxyz') ) as t2(c2) ) as up_all on up_all.c2 = r.col2 order by col3 ) as f ; To simplify I used 'values' for the actual tables that I am using, but the above still causes the FailedAssertion. Removing the outer select avoids the error and removing order by avoids the error I can reproduce this on master but not on 9.0. I suspect the problem was introduced by this commit: commit 0f61d4dd1b4f95832dcd81c9688dac56fd6b5687 Author: Tom Lane t...@sss.pgh.pa.us Date: Fri Nov 19 17:31:50 2010 -0500 Improve relation width estimation for subqueries. ...but I haven't confirmed that yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cheaper snapshots redux
On Aug 22, 2011, at 4:25 PM, Robert Haas wrote: What I'm thinking about instead is using a ring buffer with three pointers: a start pointer, a stop pointer, and a write pointer. When a transaction ends, we advance the write pointer, write the XIDs or a whole new snapshot into the buffer, and then advance the stop pointer. If we wrote a whole new snapshot, we advance the start pointer to the beginning of the data we just wrote. Someone who wants to take a snapshot must read the data between the start and stop pointers, and must then check that the write pointer hasn't advanced so far in the meantime that the data they read might have been overwritten before they finished reading it. Obviously, that's a little risky, since we'll have to do the whole thing over if a wraparound occurs, but if the ring buffer is large enough it shouldn't happen very often. And a typical snapshot is pretty small unless massive numbers of subxids are in use, so it seems like it might not be too bad. Of course, it's pretty hard to know for sure without coding it up and testing it. Something that would be really nice to fix is our reliance on a fixed size of shared memory, and I'm wondering if this could be an opportunity to start in a new direction. My thought is that we could maintain two distinct shared memory snapshots and alternate between them. That would allow us to actually resize them as needed. We would still need something like what you suggest to allow for adding to the list without locking, but with this scheme we wouldn't need to worry about extra locking when taking a snapshot since we'd be doing that in a new segment that no one is using yet. The downside is such a scheme does add non-trivial complexity on top of what you proposed. I suspect it would be much better if we had a separate mechanism for dealing with shared memory requirements (shalloc?). But if it's just not practical to make a generic shared memory manager it would be good to start thinking about ways we can work around fixed shared memory size issues. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cheaper snapshots redux
On Mon, Aug 22, 2011 at 6:45 PM, Jim Nasby j...@nasby.net wrote: Something that would be really nice to fix is our reliance on a fixed size of shared memory, and I'm wondering if this could be an opportunity to start in a new direction. My thought is that we could maintain two distinct shared memory snapshots and alternate between them. That would allow us to actually resize them as needed. We would still need something like what you suggest to allow for adding to the list without locking, but with this scheme we wouldn't need to worry about extra locking when taking a snapshot since we'd be doing that in a new segment that no one is using yet. The downside is such a scheme does add non-trivial complexity on top of what you proposed. I suspect it would be much better if we had a separate mechanism for dealing with shared memory requirements (shalloc?). But if it's just not practical to make a generic shared memory manager it would be good to start thinking about ways we can work around fixed shared memory size issues. Well, the system I'm proposing is actually BETTER than having two distinct shared memory snapshots. For example, right now we cache up to 64 subxids per backend. I'm imagining that going away and using that memory for the ring buffer. Out of the box, that would imply a ring buffer of 64 * 103 = 6592 slots. If the average snapshot lists 100 XIDs, you could rewrite the snapshot dozens of times times before the buffer wraps around, which is obviously a lot more than two. Even if subtransactions are being heavily used and each snapshot lists 1000 XIDs, you still have enough space to rewrite the snapshot several times over before wraparound occurs. Of course, at some point the snapshot gets too big and you have to switch to retaining only the toplevel XIDs, which is more or less the equivalent of what happens under the current implementation when any single transaction's subxid cache overflows. With respect to a general-purpose shared memory allocator, I think that there are cases where that would be useful to have, but I don't think there are as many of them as many people seem to think. I wouldn't choose to implement this using a general-purpose allocator even if we had it, both because it's undesirable to allow this or any subsystem to consume an arbitrary amount of memory (nor can it fail... especially in the abort path) and because a ring buffer is almost certainly faster than a general-purpose allocator. We have enough trouble with palloc overhead already. That having been said, I do think there are cases where it would be nice to have... and it wouldn't surprise me if I end up working on something along those lines in the next year or so. It turns out that memory management is a major issue in lock-free programming; you can't assume that it's safe to recycle an object once the last pointer to it has been removed from shared memory - because someone may have fetched the pointer just before you removed it and still be using it to examine the object. An allocator with some built-in capabilities for handling such problems seems like it might be very useful -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSL-mode error reporting in libpq
On Sun, Jul 24, 2011 at 11:48 AM, Tom Lane t...@sss.pgh.pa.us wrote: We could perhaps go a bit further and make pqsecure_write responsible for the error message in non-SSL mode too, but it looks to me like pqSendSome has to have a switch on the errno anyway to decide whether to keep trying or not, so moving that responsibility would just lead to duplicative coding. Any objections? No objection. Some users we have have been confused by this message. At first I thought it was an SSL renegotiation problem, but now I realize that the message is a lie, so anything can be the problem. A better message may well decrease support burden for us, and also help us find problems... -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.1rc1: TRAP: FailedAssertion(!(item_width 0), File: costsize.c, Line: 3274)
Robert Haas robertmh...@gmail.com writes: On Mon, Aug 22, 2011 at 4:16 PM, Erik Rijkers e...@xs4all.nl wrote: The below SQL causes: TRAP: FailedAssertion(!(item_width 0), File: costsize.c, Line: 3274) I can reproduce this on master but not on 9.0. I suspect the problem was introduced by this commit: commit 0f61d4dd1b4f95832dcd81c9688dac56fd6b5687 Well, that's a pretty safe bet considering the Assert is in code that didn't exist before that commit ;-). The question is why the per-column width estimate hasn't been set for that column. Looking at the coredump, I see that the width *did* get set for the subquery's other output column, which makes it even weirder. Too tired to dig further right now though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Displaying accumulated autovacuum cost
On 08/22/2011 05:54 PM, Jim Nasby wrote: I know folks have talked about progress, but I haven't seen anything specific... could you add info about what table/index vacuum is working on, and how far along it is? I realize that's not very close to an actual % completion, but it's far better than what we have right now. Due to complaints about the mechanism the first version used to inform the user of the progress, I'm yanking that from the next patch altogether. The goal for now is to get a good report into the logs, and then maybe that gets extended later with a progress report. (All of the proposed alternate mechanisms are way more complicated than anything I have time to do right now) FWIW, the number I end up caring about isn't so much write traffic as read. Thanks to a good amount of battery-backed write cache (and possibly some iSCSI misconfiguration), our writes are generally much cheaper than our reads. VACUUM can't really know its true read rate from what's inside the database. I can add a summary of the accumulated read amounts into the logs, in more useful figures than what is provided so far, which is better than nothing. But those will be kind of deceptive, which is one reason I wasn't so focused on them yet. If the relation is largely in the OS cache, but not the PostgreSQL one, the summary can show a read rate even when that isn't actually doing any reads at all. That was exactly the case in the sample data I posted. VACUUM thought it was reading anywhere from 2.5 to 6MB/s. But at the OS level, it was actually reading zero bytes, since the whole thing was in cache already. What you actually want is a count of the accumulated read counters at the OS level. I've recently figured out how to track those, too, but that code is something that lives outside the database. If this is something useful to you, I think you're about to sign up to be my next beta tester for that program. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers