Re: [HACKERS] small patch to crypt.c
* Joshua D. Drake (j...@commandprompt.com) wrote: Well I was more referring to the default is: check if null, if true return ok check if valuntil today, if true return error else return ok To me we don't need the null check. However, when I tested it, without the null check you can't login. So now I am curious about what is going on. Erm, but what is valuntil set to when it's null? I'd expect it to be zero because it hasn't been changed since: TimestampTz vuntil = 0; Using your pseudo-code, you end up with: check if 0 today, if true return error else return ok Which is why you end up always getting an error when you get rid of the explicit isnull check. Looking at it too quickly, I had assumed that the test was inverted and that your patch worked most of the time but didn't account for GetCurrentTimestamp() going negative. Regardless, setting vuntil to some magic value that really means it's actually NULL, which is what you'd need to do in order to get rid of that explicit check for null, doesn't strike me as a good idea. When a value is null, we shouldn't be looking at the data at all. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/09/2013 08:32 AM, MauMau wrote: - Failure of a disk containing data directory or tablespace If checkpoint can't write buffers to disk because of disk failure, checkpoint cannot complete, thus WAL files accumulate in pg_xlog/. This means that one disk failure will lead to postgres shutdown. ... which is why tablespaces aren't disposable, and why creating a tablespace in a RAM disk is such an awful idea. I'd rather like to be able to recover from this by treating the tablespace as dead, so any attempt to get a lock on any table within it fails with an error and already-in-WAL writes to it just get discarded. It's the sort of thing that'd only be reasonable to do as a recovery option (like zero_damaged_pages) since if applied by default it'd lead to potentially severe and unexpected data loss. I've seen a couple of people bitten by the misunderstanding that tablespaces are a way to split up your data based on different reliability requirements, and I really need to write a docs patch for http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html that adds a prominent warning like: WARNING: Every tablespace must be present before the database can be started. There is no easy way to recover the database if a tablespace is lost to disk failure, deletion, use of volatile storage, etc. bDo not put a tablespace on a RAM disk/b; instead just use UNLOGGED tables. (Opinions on the above?) -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/09/2013 03:02 AM, Jeff Janes wrote: It would be nice to have the ability to specify multiple log destinations with different log_min_messages for each one. I'm sure syslog already must implement some kind of method for doing that, but I've been happy enough with the text logs that I've never bothered to look into it much. Smarter syslog flavours like rsyslog certainly do this. No alert system triggered by events within Pg will ever be fully sufficient. Oops, the postmaster crashed with stack corruption, I'll just exec whatever's in this on_panic_exec GUC (if I can still read it and it's still valid) to hopefully tell the sysadmin about my death. Hmm, sounds as reliable and safe as a bicycle powered by a home-made rocket engine. External monitoring is IMO always necessary. Something like Icinga with check_postgres can trivially poke Pg to make sure it's alive. It can also efficiently check the 'pg_error.log' from rsyslog that contains only severe errors and raise alerts if it doesn't like what it sees. If I'm already doing external monitoring (which is necessary as outlined above) then I see much less point having Pg able to raise alerts for problems, and am more interested in better built-in functions and views for exposing Pg's state. Easier monitoring of WAL build-up, ways to slow the master if async replicas or archiving are getting too far behind, etc. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bad error message on valuntil
On 06/08/2013 04:07 AM, Joshua D. Drake wrote: FATAL: Authentication failed: Check server log for specifics And then we make sure we log proper info? FATAL: Authentication using method 'password' failed, possible reasons are no/wrong password sent, account expired; see server log for details ? We can tell them what they would already know (they tried the 'password' method) and a little about what might be wrong, as well as where to go for more info. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed patch: remove hard-coded limit MAX_ALLOCATED_DESCS
Le samedi 8 juin 2013 23:27:16, Tom Lane a écrit : =?iso-8859-1?q?C=E9dric_Villemain?= ced...@2ndquadrant.com writes: I'm not sure of expected value of max_safe_fds. Your patch now initialize with 5 slots instead of 10, if max_safe_fds is large maybe it is better to double the size each time we need instead of jumping directly to the largest size ? I don't see the point particularly. At the default value of max_files_per_process (1000), the patch can allocate at most 500 array elements, which on a 64-bit machine would probably be 16 bytes each or 8KB total. The point was only if the original comment was still relevant. It seems it is just not accurate anymore. With patch I can union 492 csv tables instead of 32 with beta1. -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] pgbench: introduce a new automatic variable 'client_number'
On 06.06.2013 06:53, Gurjeet Singh wrote: Please find attached a patch for pgbench that introduces a new auto-variable 'client_number'. Following in the footsteps of 'scale' auto-variable, this is not declared if the user has specified this variable using -D switch. Since 'clientid' is a very common name a user can use for their own script's variable, I chose to call this auto-variable client_number; just to avoid conflicts. Hmm, I'm not sure I care too much about that, to be honest. We have 'scale' as an auto-variable as well, which is also a common word. Also, if there's an existing script out there that does \set client_id ..., it will override the automatic value, and work as it used to. Another reason to name it client_id is that in the pgbench -l log format, the documentation calls the first column client_id. Makes sense to call the auto-variable the same. I think you forgot to compile with the patch, because there's a semicolon missing ;-). I moved the code around a bit, setting the variable next to where :scale is set; that's more readable. In the docs, I split the descriptions of :scale and :client_id into a table. I'll commit the attached as soon as the tree opens for 9.4 development. - Heikki commit 85ebed80396313f0b7f943a228421f75c68db2ab Author: Heikki Linnakangas heikki.linnakan...@iki.fi Date: Sun Jun 9 11:33:16 2013 +0300 Add :client_id automatic variable for custom pgbench scripts. This makes it easier to write custom scripts that have different logic for each client. Gurjeet Singh, with some changes by me. diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 8c202bf..1303217 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -2544,6 +2544,20 @@ main(int argc, char **argv) } } + /* + * Define a :client_id variable that is unique per connection. But don't + * override an explicit -D switch. + */ + if (getVariable(state[0], client_id) == NULL) + { + for (i = 0; i nclients; i++) + { + snprintf(val, sizeof(val), %d, i); + if (!putVariable(state[i], startup, client_id, val)) +exit(1); + } + } + if (!is_no_vacuum) { fprintf(stderr, starting vacuum...); diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml index e9900d3..8775606 100644 --- a/doc/src/sgml/pgbench.sgml +++ b/doc/src/sgml/pgbench.sgml @@ -600,13 +600,39 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ Variables can be set by the command-line option-D/ option, explained above, or by the meta commands explained below. In addition to any variables preset by option-D/ command-line options, - the variable literalscale/ is preset to the current scale factor. + there are a few variables that are preset automatically, listed in + xref linkend=pgbench-automatic-variables. A value specified for these + variables using option-D/ takes precedence over the automatic presets. Once set, a variable's value can be inserted into a SQL command by writing literal:/replaceablevariablename/. When running more than one client session, each session has its own set of variables. /para + table id=pgbench-automatic-variables +titleAutomatic variables/title +tgroup cols=2 + thead + row + entryVariable/entry + entryDescription/entry + /row + /thead + + tbody + row + entry literalscale/literal /entry + entrycurrent scale factor/entry + /row + + row + entry literalclient_id/literal /entry + entryunique number identifying the client session (starts from zero)/entry + /row + /tbody +/tgroup + /table + para Script file meta commands begin with a backslash (literal\/). Arguments to a meta command are separated by white space. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch API for After Triggers
On 8 June 2013 22:25, Kevin Grittner kgri...@ymail.com wrote: Simon Riggs si...@2ndquadrant.com wrote: Comments please. How much of this problem space do you think could be addressed by providing OLD and NEW *relations* to AFTER EACH STATEMENT triggers? It's a reasonable question because those two things sound a little like they might be related. Providing the proposed additional info costs almost nothing since the work to calculate that info is already performed. I've written this patch since it was trivial to do so, while inspecting the code to see if it was possible. As it now turns out, I'll be putting most effort into the WHEN clause approach for FKs, but there's no reason why others like Slony or pgmemcache wouldn't benefit here also - hence posting the patch. The proposed API changes don't conflict in any way with the feature you propose. Providing the whole OLD and NEW sets as relations to a trigger would require significant resources and wouldn't be done for performance reasons AFAICS. There are also difficulties in semantics, since when we have OLD and NEW at row level we know we are discussing the same row. With sets of OLD and NEW we'd need to be able to link the relations back together somehow, which couldn't be done by PK since that could change. So we'd need to invent some semantics for a linking identifier of some description. Which once we've done it would be used by people to join them back together again, which is what we already had in the first place. So my take is that it sounds expressive, but definitely not performant. Since my objective is performance, not standards, I don't see a reason to work on that, yet. I might have time to work on it later, lets see. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch API for After Triggers
On 9 June 2013 05:08, Stephen Frost sfr...@snowman.net wrote: * Simon Riggs (si...@2ndquadrant.com) wrote: While fiddling with FK tuning, Noah suggested batching trigger executions together to avoid execution overhead. I like the general idea, but I'd prefer a way to avoid having to queue up tons of trigger events to be executed in the first place. There's already a thread on that exact topic, for FKs, which is what spawned this thread. Aggregates do this by providing a way to store up information to be processed by an eventual 'final' function. As I mentioned in my post, I did consider that and then chose not to do that. However, having a final func is a major modification in the way that we specify trigger functions. We'd also need to cope with recursive trigger execution, which would mean the final func would get called potentially many times, so there's no way of knowing if the final func is actually the last call needed. That sounded complex and confusing to me. The proposed API allows you to do exactly that anyway, more easily, by just waiting until tg_event_num == tg_tot_num_events. Another option, as Kevin asked about, would be statement level triggers which are able to see both the OLD and the NEW versions of the relation. The per-row trigger option with a way to be called immediately ... it already exists and is known as the WHEN clause. This is the mechanism I expect to use to tune FKs and then store what it cares about for a later final function strikes me as very generalized and able to do things that the statement-level option couldn't, but I'm not sure if there's a use-case that could solve which the OLD/NEW statement trigger capability couldn't. I think the two things are quite different, as I explained on a separate post to Kevin. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to find out whether a view is updatable
On 6 June 2013 08:09, Dean Rasheed dean.a.rash...@gmail.com wrote: On 5 June 2013 08:59, Dean Rasheed dean.a.rash...@gmail.com wrote: I'm still not happy with pg_view_is_updatable() et al. and the information_schema views. I accept that the information_schema views have to be the way they are because that's what's defined in the standard, but as it stands, the distinction between updatable and trigger-updatable makes it impossible in general to answer the simple question does foo support UPDATEs?. I'm thinking what we really need is a single function with a slightly different signature, that can be used to support both the information schema views and psql's \d+ (and potentially other client apps). Perhaps something like:- pg_relation_is_updatable(include_triggers boolean) returns int OK, here's what it looks like using this approach: FUNCTION pg_relation_is_updatable(reloid oid, include_triggers boolean) RETURNS integer FUNCTION pg_column_is_updatable(reloid oid, attnum integer, include_triggers boolean) RETURNS boolean These replace pg_view_is_updatable() and pg_view_is_insertable(). I think I definitely prefer this over the old API, because it gives much greater flexibility. The information schema views all pass include_triggers = false for compatibility with the standard. The return value from pg_relation_is_updatable() is now an integer bitmask reflecting whether or not the relation is insertable, updatable and/or deletable. psql and other clients can more usefully pass include_triggers = true to determine whether a relation actually supports INSERT, UPDATE and DELETE, including checks for INSTEAD OF triggers on the specified relation or any underlying base relations. I thought about having pg_relation_is_updatable() return text, like the GRANT support functions, but I thought that it would make the information schema views harder to write, using a single call to check for updatable+deletable, whereas integer bit operations are easy. There is a backwards-incompatible change to the information schema, reflected in the regression tests: if a view is updatable but not deletable, the relevant rows in information_schema.columns now say 'YES' --- the columns are updatable, even though the relation as a whole isn't. I've initially defined matching FDW callback functions: int IsForeignRelUpdatable (Oid foreigntableid, bool include_triggers); bool IsForeignColUpdatable (Oid foreigntableid, int attnum, bool include_triggers); but I'm now having second thoughts about whether we should bother passing include_triggers to the FDW. If we regard the foreign table as a black box, we only care about whether it is updatable, not *how* that update is performed. Here's a more complete patch along those lines. It defines the following pair of functions to test for updatability from SQL: FUNCTION pg_catalog.pg_relation_is_updatable(reloid oid, include_triggers boolean) RETURNS integer FUNCTION pg_catalog.pg_column_is_updatable(reloid oid, attnum smallint, include_triggers boolean) RETURNS boolean and the following FDW functions: int IsForeignRelUpdatable (Oid foreigntableid); bool IsForeignColUpdatable (Oid foreigntableid, AttrNumber attnum); As an initial implementation of this API in the postgres-fdw, I've added a new option updatable (true by default), which can be specified as a server option or as a per-table option, to give user control over whether individual foreign tables are read-only or updatable. I called it updatable rather than writable or read-only because it might perhaps be extended in the future with separate options for insertable and deletable. It could also be extended to give column-level control over updatability, or something like use_remote_updatability could be added, but that all feels like 9.4 material. Regards, Dean pg_relation_is_updatable.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)
On 05/31/2013 03:41 PM, Fabien COELHO wrote: However I'm not sure that pg_stat_replication currently has the necessary information on either side to measure the lag (in time transactions, but how do I know when a transaction was committed? or number of transactions?). The BDR codebase now has a handy function to report when a transaction was committed, pg_get_transaction_committime(xid) . This looks handy for monitoring a replication setup. It should really be in core... Any plans? Or is there other ways to get this kind of information in core? Yes, it's my understanding that the idea is to eventually get all the BDR functionality merged, piece by piece, including the commit time tracking feature. pg_get_transaction_committime isn't trivial to just add to core because it requires a commit time to be recorded with commit records in the transaction logs, among other changes. I don't know if Andres or any of the others involved are planning on trying to get this particular feature merged in 9.4, but I wouldn't be too surprised since (AFAIK) it's fairly self-contained and would be useful for monitoring streaming replication setups as well. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimising Foreign Key checks
On 9 June 2013 02:12, Noah Misch n...@leadboat.com wrote: On Sat, Jun 08, 2013 at 08:20:42PM -0400, Robert Haas wrote: On Sat, Jun 8, 2013 at 5:41 PM, Noah Misch n...@leadboat.com wrote: Likewise; I don't see why we couldn't perform an optimistic check ASAP and schedule a final after-statement check when an early check fails. That changes performance characteristics without changing semantics. ...this seems like it might have some promise; but what if the action we're performing isn't idempotent? And how do we know? The action discussed so far is RI_FKey_check_ins(). It acquires a KEY SHARE lock (idempotent by nature) on a row that it finds using B-tree equality (presumed IMMUTABLE, thus idempotent). RI_FKey_check_upd() is nearly the same action, so the same argument holds. Before treating any other operation in the same way, one would need to conduct similar analysis. As long as we are talking about FKs only, then this approach can work. All we are doing is applying the locks slightly earlier than before. Once locked they will prevent any later violations, so we are safe from anybody except *ourselves* from making changes that would invalidate the earlier check. Trouble is, there are various ways I can see that as possible, so making a check early doesn't allow you to avoid making the check later as well. AFAICS there are weird cases where changing the way FKs execute will change the way complex trigger applications will execute. I don't see a way to avoid that other than do nothing. Currently, we execute the checks following the normal order of execution rules for triggers. Every idea we've had so far changes that in some way; variously in major or minor ways, but changed nonetheless. Even the approach of deferring checks to allow them to be applied in a batch mean we might change the way applications execute in detail. However, since the only possible change there would be to decrease the number of self-induced failures that seems OK. So the question is how much change do we want to introduce? I'll guess not much, rather than lots or none. Proposal: Have a WHEN clause that accumulates values to be checked in a hash table up to work_mem in size, allowing us to eliminate the most common duplicates (just not *all* duplicates). If the value isn't a duplicate (or at least the first seen tuple with that value), we will queue up a check for later. That approach gives us *exactly* what we have now and works with the two common cases: i) few, mostly duplicated values, ii) many values, but clustered together. Then apply changes in batches at end of statement. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
On 06/07/2013 04:38 AM, Jeff Janes wrote: On Mon, Jun 3, 2013 at 5:03 AM, Craig Ringer cr...@2ndquadrant.com wrote: My database is slow - This autovacuum thing is using up lots of I/O and CPU, I'll increase this delay setting here Do you think this was the correct diagnosis but with the wrong action taken, or was the diagnosis incorrect in the first place (i.e. it may be using some IO and CPU, but that isn't what was causing the initial problem)? And if the diagnosis was correct, was it causing problems under default settings, or only because they already turned off the cost delay? The problem is that vacuum running too slow tends to result in table and index bloat. Which results in less efficient cache use, slower scans, and generally worsening performance. I've repeatedly seen the user attribute the resulting high I/O to autovacuum (which is, after all, always working away trying to keep up) - and solving the problem by further slowing autovacuum. It is very counter-intuitive that to fix the problem the user needs to make the background process that's doing the I/O take up *more* resources, so that other queries take *even less*. - I'll whack in some manual VACUUM cron jobs during low load maintenance hours and hope that keeps the worst of the problem away, that's what random forum posts on the Internet say to do. - oh my, why did my DB just do an emergency shutdown? This one doesn't make much sense to me, unless they mucked around with autovacuum_freeze_max_age as well as turning autovacuum itself off (common practice?). Unfortunately, yes, as an extension of the above reasoning people seem to apply around autovacuum. The now horrifyingly bloated DB is being kept vaguely functional by regular cron'd vacuum runs, but then autovacuum kicks back in and starts thrashing the system. It's already performing really badly because of all the bloat so this is more than it can take and performance tanks critically. Particularly since it probably has 1000 or more backends thrashing away if it's anything like many of the systems I've been seeing in the wild. The operator's response: Panic and find out how to make it stop. Once autovacuum quits doing its thing the system returns to staggering along and they go back to planning a hardware upgrade someday, then suddenly it's emergency wraparound prevention time. I suspect vacuum, autovacuum, autovacuum tuning, table and index bloat, etc is just too complicated for a lot of people running Pg installs to really understand. I'd really, really love to see some feedback-based auto-tuning of vacuum. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch API for After Triggers
On Sun, Jun 09, 2013 at 10:15:09AM +0100, Simon Riggs wrote: As I mentioned in my post, I did consider that and then chose not to do that. However, having a final func is a major modification in the way that we specify trigger functions. We'd also need to cope with recursive trigger execution, which would mean the final func would get called potentially many times, so there's no way of knowing if the final func is actually the last call needed. That sounded complex and confusing to me. The proposed API allows you to do exactly that anyway, more easily, by just waiting until tg_event_num == tg_tot_num_events. Can you signal partial completion? For example, if a trigger know that blocks of 10,000 are optimal and it sees tg_tot_num_events == 1,000,000 that it could do work every 10,000 entries, as in when: (tg_event_num % 1) == 0 || tg_event_num == tg_tot_num_events Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Batch API for After Triggers
On 06/09/2013 04:58 PM, Simon Riggs wrote: There are also difficulties in semantics, since when we have OLD and NEW at row level we know we are discussing the same row. With sets of OLD and NEW we'd need to be able to link the relations back together somehow, which couldn't be done by PK since that could change. We don't currently have OLD and NEW relations so we're free to define how this works pretty freely. Rather than having OLD and NEW as separate relations, we could just have one OLD_AND_NEW relation. In that relation we exploit Pg's composite types to nest the old and new tuples in a single outer change record. OLD_AND_NEW would look to PL/PgSQL as if it were: CREATE TEMPORARY TABLE OLD_AND_NEW ( OLD tabletype NOT NULL, NEW tabletype NOT NULL ); ...though presumably without the ability to create indexes on it and the other things you can do to a real temp table. Though I can see cases where that'd be awfully handy too. For DELETE and INSERT we'd either provide different relations named OLD and NEW respectively, or we'd use OLD_AND_NEW with one field or the other blank. I'm not sure which would be best. Alternately, we could break the usual rules for relations and define OLD and NEW as ordered, so lock-step iteration would always return matching pairs of rows. That's useless in SQL since there's no way to achieve lock-step iteration, but if we provide a for_each_changed_row('some_function'::regproc) that scans them in lock-step and invokes `some_function` for each one...? (I haven't yet done enough in the core to have any idea if this approach is completely and absurdly impossible, or just ugly. Figured I'd throw it out there anyway.) -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimising Foreign Key checks
On 2013-06-01 09:41:13 +0100, Simon Riggs wrote: FK checks can be expensive, especially when loading large volumes of data into an existing table or partition. A couple of ideas for improving performance are discussed here: Another idea would be to optimize away the row level locks if we have a table level lock that already provides more protection than the asked for row level lock. In bulk data loading scenarios that's not uncommon and can matter quite a bit, especially if loading in parallel since both, the wal traffic and the dirtying of pages, is considerably reduced. Should be implementable relatively easily in heap_lock_tuple without being visible to the outside. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE ... ALTER CONSTRAINT
On 2013-06-08 21:45:24 +0100, Simon Riggs wrote: While fiddling with FK tuning, it was useful to be able to enable and disable the DEFERRED mode of constraints. That is not currently possible in SQL, so I wrote this patch. Without this you have to drop and then re-add a constraint, which is impractical for large tables. e.g. CREATE TABLE fktable (id integer, fk integer REFERENCES pktable (id)); ALTER TABLE foo ALTER CONSTRAINT fktable_fk_fkey DEFERRED INITIALLY IMMEDIATE; Includes docs and tests. Currently works for FKs only. Potentially other constraints can be supported in future. I haven't looked at the patch in detail, but I am very, very much in favor of the feature in general… I have wished for this more than once, and it certainly cost me more time working around it than it would have cost to implement it. Thanks, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2013-06-08 13:26:56 -0700, Joshua D. Drake wrote: At the points where the XLogInsert()s happens we're in critical sections out of which we *cannot* ERROR out because we already may have made modifications that cannot be allowed to be performed partially/unlogged. That's why we're throwing a PANIC which will force a cluster wide restart including *NOT* writing any further buffers from s_b out. Does this preclude (sorry I don't know this part of the code very well) my suggestion of on log create? Well, yes. We create log segments some layers below XLogInsert() if necesary, and as I said above, we're in a critical section at that point, so just rolling back isn't one of the options. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] on existing update construct
On 05/16/2013 02:44 AM, Dev Kumkar wrote: Hello, Is there an alternative of Sybase on existing update construct in pgsql. No. See: http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/ -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MVCC catalog access
On 2013-06-06 12:49:14 -0400, Robert Haas wrote: On Thu, Jun 6, 2013 at 5:30 AM, Andres Freund and...@2ndquadrant.com wrote: + * XXX: Now that we have MVCC catalog access, the reasoning above is no longer + * true. Are there other good reasons to hard-code this, or should we revisit + * that decision? We could just the function by looking in the shared relmapper. Everything that can be mapped via it is shared. I suspect there are several possible sources for this information, but it's hard to beat a hard-coded list for efficiency, so I wasn't sure if we should tinker with this or not. I can tell from experience that it makes adding a new shared relation more of a pain than it already is, but we're hopefully not doing that all that often. I just don't think that the mvcc angle has much to do with the decision. --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -480,6 +480,11 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, bool recheck, LOCKMOD * against concurrent SnapshotNow scans of pg_index. Therefore this is unsafe * to execute with less than full exclusive lock on the parent table; * otherwise concurrent executions of RelationGetIndexList could miss indexes. + * + * XXX: Now that we have MVCC catalog access, SnapshotNow scans of pg_index + * shouldn't be common enough to worry about. The above comment needs + * to be updated, and it may be possible to simplify the logic here in other + * ways also. */ You're right, the comment needs to be changed, but I don't think the effect can. A non-inplace upgrade changes the xmin of the row which is relevant for indcheckxmin. OK. (In fact, isn't this update possibly causing problems like delaying the use of such an index already) Well, maybe. In general, the ephemeral snapshot taken for a catalog scan can't be any older than the primary snapshot already held. But there could be some corner case where that's not true, if we use this technique somewhere that such a snapshot hasn't already been acquired. I wasn't talking about catalog scans or this patch, I wonder whether the update we do there won't cause the index not being used for concurrent (normal) scans since now the xmin is newer while it might be far in the past before. I.e. we might need to unset indexcheckxmin if xmin is far enough in the past. Hm. Looks like this should also change the description of SecondarySnapshot: /* * CurrentSnapshot points to the only snapshot taken in transaction-snapshot * mode, and to the latest one taken in a read-committed transaction. * SecondarySnapshot is a snapshot that's always up-to-date as of the current * instant, even in transaction-snapshot mode. It should only be used for * special-purpose code (say, RI checking.) * I think that's still more or less true, though we could add catalog scans as another example. I guess my feeling is that once catalog scans use it, it's not so much special purpose anymore ;). But I admit that the frequency of usage doesn't say much about its specificity... I actually wonder if we shouldn't just abolish GetLatestSnapshot(). None of the callers seem to rely on it's behaviour from a quick look and it seems rather confusing to have both. I assume Tom had some reason for making GetLatestSnapshot() behave the way it does, so I refrained from doing that. I might be wrong. At least I don't see any on a quick look - which doesn't say very much. I think I just dislike having *instant and *latest functions in there, seems to be confusing to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)
On 2013-06-09 17:50:13 +0800, Craig Ringer wrote: On 05/31/2013 03:41 PM, Fabien COELHO wrote: However I'm not sure that pg_stat_replication currently has the necessary information on either side to measure the lag (in time transactions, but how do I know when a transaction was committed? or number of transactions?). The BDR codebase now has a handy function to report when a transaction was committed, pg_get_transaction_committime(xid) . This looks handy for monitoring a replication setup. It should really be in core... Any plans? Or is there other ways to get this kind of information in core? pg_get_transaction_committime isn't trivial to just add to core because it requires a commit time to be recorded with commit records in the transaction logs, among other changes. The commit records actually already have that information available (c.f. xl_xact_commit(_compact) in xact.h), the problem is having a datastructure which collects all that. That's why the committs (written by Alvaro) added an slru mapping xids to timestamps. And yes, we want to submit that sometime. The pg_xlog_wait_remote_apply(), pg_xlog_wait_remote_receive() functions however don't need any additional infrastructure, so I think those are easier and less controversial to add. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
On 06/09/2013 12:38 AM, Noah Misch wrote: On Sat, Jun 08, 2013 at 11:50:53PM -0400, Andrew Dunstan wrote: On 06/08/2013 10:52 PM, Noah Misch wrote: Let's return to the drawing board on this one. I would be inclined to keep the current bad behavior until we implement the i18n-aware case folding required by SQL. If I'm alone in thinking that, perhaps switch to downcasing only ASCII characters regardless of the encoding. That at least gives consistent application behavior. I apologize for not noticing to comment on this week's thread. The behaviour which this fixes is an unambiguous bug. Calling tolower() on the individual bytes of a multi-byte character can't possibly produce any sort of correct result. A database that contains such corrupted names, probably not valid in any encoding at all, is almost certainly not restorable, and I'm not sure if it's dumpable either. I agree with each of those points. However, since any change here breaks compatibility, we should fix it right the first time. A second compatibility break would be all the more onerous once this intermediate step helps more users to start using unquoted, non-ASCII object names. It's already produced several complaints in recent months, so ISTM that returning to it for any period of time is unthinkable. PostgreSQL has lived with this wrong behavior since ... the beginning? It's a problem, certainly, but a bandage fix brings its own trouble. If you have a better fix I am all ears. I can recall at least one discussion of this area (concerning Turkish I quite a few years ago) where we failed to come up with anything. I have a fairly hard time believing in your relies on this and somehow works scenario. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch API for After Triggers
Simon Riggs si...@2ndquadrant.com wrote: On 8 June 2013 22:25, Kevin Grittner kgri...@ymail.com wrote: Simon Riggs si...@2ndquadrant.com wrote: There are also difficulties in semantics, since when we have OLD and NEW at row level we know we are discussing the same row. With sets of OLD and NEW we'd need to be able to link the relations back together somehow, which couldn't be done by PK since that could change. So we'd need to invent some semantics for a linking identifier of some description. Which once we've done it would be used by people to join them back together again, which is what we already had in the first place. So my take is that it sounds expressive, but definitely not performant. I have used a feature like this in other database products, and can say from experience that these relations in a statement trigger can be very useful without the linkage you propose. I can see how the linkage could potentially be useful, but if that is the only hang-up, we would be adding a powerful feature without it. Since my objective is performance, not standards, I don't see a reason to work on that, yet. I might have time to work on it later, lets see. This seems like it has some overlap with the delta relations I will need to generate for incremental maintenance of materialized views, so we should coordinate on those efforts if they happen to occur around the same time. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimising Foreign Key checks
On Sun, Jun 9, 2013 at 10:51 AM, Simon Riggs si...@2ndquadrant.com wrote: AFAICS there are weird cases where changing the way FKs execute will change the way complex trigger applications will execute. I don't see a way to avoid that other than do nothing. Currently, we execute the checks following the normal order of execution rules for triggers. Every idea we've had so far changes that in some way; variously in major or minor ways, but changed nonetheless. The obvious case to handle would be if someone has a trigger to automatically create any missing references. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture
Craig Ringer cr...@2ndquadrant.com wrote: On 06/07/2013 04:38 AM, Jeff Janes wrote: Craig Ringer cr...@2ndquadrant.com The problem is that vacuum running too slow tends to result in table and index bloat. Which results in less efficient cache use, slower scans, and generally worsening performance. I've repeatedly seen the user attribute the resulting high I/O to autovacuum (which is, after all, always working away trying to keep up) - and solving the problem by further slowing autovacuum. It is very counter-intuitive that to fix the problem the user needs to make the background process that's doing the I/O take up *more* resources, so that other queries take *even less*. Exactly. It can be very hard to convince someone to make autovacuum more aggressive when they associate its default configuration with slowness. - I'll whack in some manual VACUUM cron jobs during low load maintenance hours and hope that keeps the worst of the problem away, that's what random forum posts on the Internet say to do. - oh my, why did my DB just do an emergency shutdown? This one doesn't make much sense to me, unless they mucked around with autovacuum_freeze_max_age as well as turning autovacuum itself off (common practice?). Unfortunately, yes, as an extension of the above reasoning people seem to apply around autovacuum. The now horrifyingly bloated DB is being kept vaguely functional by regular cron'd vacuum runs, but then autovacuum kicks back in and starts thrashing the system. It's already performing really badly because of all the bloat so this is more than it can take and performance tanks critically. Particularly since it probably has 1000 or more backends thrashing away if it's anything like many of the systems I've been seeing in the wild. The operator's response: Panic and find out how to make it stop. Once autovacuum quits doing its thing the system returns to staggering along and they go back to planning a hardware upgrade someday, then suddenly it's emergency wraparound prevention time. I have seen exactly this pattern multiple times. They sometimes completely ignore all advice about turning on and tuning autovacuum and instead want to know the exact formula for when the the wraparound prevention autovacuum will trigger, so they can run a vacuum just in time to prevent it -- since they believe this will minimize disk access and thus give them best performance. They often take this opportunity to run VACUUM FULL on the table and don't see the point of following that with any other form of VACUUM, so they wipe out their visibility map in the process. I suspect vacuum, autovacuum, autovacuum tuning, table and index bloat, etc is just too complicated for a lot of people running Pg installs to really understand. The ones who suffer most are those who learn just enough to think they know how to tune better than the defaults, but not enough to really understand the full impact of the changes they are making. I have no particular ideas on what to do about that observation, unfortunately. I'd really, really love to see some feedback-based auto-tuning of vacuum. +1 -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE ... ALTER CONSTRAINT
Andres Freund and...@2ndquadrant.com wrote: On 2013-06-08 21:45:24 +0100, Simon Riggs wrote: ALTER TABLE foo ALTER CONSTRAINT fktable_fk_fkey DEFERRED INITIALLY IMMEDIATE; I haven't looked at the patch in detail, but I am very, very much in favor of the feature in general… I have wished for this more than once +1 -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
Andrew Dunstan and...@dunslane.net writes: On 06/09/2013 12:38 AM, Noah Misch wrote: PostgreSQL has lived with this wrong behavior since ... the beginning? It's a problem, certainly, but a bandage fix brings its own trouble. I don't see this as particularly bandage-y. It's a subset of the spec-required folding behavior, sure, but at least now it's a proper subset of that behavior. It preserves all cases in which the previous coding did the right thing, while removing some cases in which it didn't. If you have a better fix I am all ears. I can recall at least one discussion of this area (concerning Turkish I quite a few years ago) where we failed to come up with anything. Yeah, Turkish handling of i/I messes up any attempt to use the locale's case-folding rules straightforwardly. However, I think we've already fixed that with the rule that ASCII characters are folded manually. The resistance to moving this code to use towlower() for non-ASCII mainly comes from worries about speed, I think; although there was also something about downcasing conversions that change the string's byte length being problematic for some callers. I have a fairly hard time believing in your relies on this and somehow works scenario. The key point for me is that if tolower() actually does anything in the previous state of the code, it's more than likely going to produce invalidly encoded data. The consequences of that can't be good. You can argue that there might be people out there for whom the transformation accidentally produced a validly-encoded string, but how likely is that really? It seems much more likely that the only reason we've not had more complaints is that on most popular platforms, the code accidentally fails to fire on any UTF8 characters (or any common ones, anyway). On those platforms, there will be no change of behavior. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small patch to crypt.c
Stephen Frost sfr...@snowman.net writes: Regardless, setting vuntil to some magic value that really means it's actually NULL, which is what you'd need to do in order to get rid of that explicit check for null, doesn't strike me as a good idea. When a value is null, we shouldn't be looking at the data at all. Even aside from that, the proposed change seems like a bad idea because it introduces an unnecessary call of GetCurrentTimestamp() in the common case where there's no valuntil limit. On some platforms that call is pretty slow. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small patch to crypt.c
On 06/09/2013 09:28 AM, Tom Lane wrote: Even aside from that, the proposed change seems like a bad idea because it introduces an unnecessary call of GetCurrentTimestamp() in the common case where there's no valuntil limit. On some platforms that call is pretty slow. And that would explain why we don't do something like this: index f01d904..4935c9f 100644 --- a/src/backend/libpq/crypt.c +++ b/src/backend/libpq/crypt.c @@ -145,12 +145,10 @@ md5_crypt_verify(const Port *port, const char *role, char *client_pass) /* * Password OK, now check to be sure we are not past rolvaliduntil */ - if (isnull) + if (isnull || vuntil GetCurrentTimestamp()) retval = STATUS_OK; - else if (vuntil GetCurrentTimestamp()) - retval = STATUS_ERROR; else - retval = STATUS_OK; + retval = STATUS_ERROR; } Right. Ty for the feedback, I know it was just a little bit of code but it just looked off and I appreciate the explanation. JD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimising Foreign Key checks
On 9 June 2013 14:59, Greg Stark st...@mit.edu wrote: On Sun, Jun 9, 2013 at 10:51 AM, Simon Riggs si...@2ndquadrant.com wrote: AFAICS there are weird cases where changing the way FKs execute will change the way complex trigger applications will execute. I don't see a way to avoid that other than do nothing. Currently, we execute the checks following the normal order of execution rules for triggers. Every idea we've had so far changes that in some way; variously in major or minor ways, but changed nonetheless. The obvious case to handle would be if someone has a trigger to automatically create any missing references. Exactly my thoughts. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] postgres_fdw regression tests order dependency
It looks like the postgres_fdw's regression tests expect data back from the following statement in a given order, which presumably isn't guaranteed: UPDATE ft2 SET c2 = c2 + 600 WHERE c1 % 10 = 8 RETURNING *; See http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=frogmouthdt=2013-06-08%2018%3A30%3A00 Maybe we need to wrap this in a CTE and then order the results for consistency? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Valgrind Memcheck support
Valgrind's Memcheck tool[1] is handy for finding bugs, but our use of a custom allocator limits its ability to detect problems in unmodified PostgreSQL. During the 9.1 beta cycle, I found some bugs[2] with a rough patch adding instrumentation to aset.c and mcxt.c such that Memcheck understood our allocator. I've passed that patch around to a few people over time, and I've now removed the roughness such that it's ready for upstream. In hopes of making things clearer in the commit history, I've split out a preliminary refactoring patch from the main patch and attached each separately. Besides the aset.c/mcxt.c instrumentation, this patch adds explicit checks for undefined memory to PageAddItem() and printtup(); this has caught C-language functions that fabricate a Datum without initializing all bits. The inclusion of all this is controlled by a pg_config_manual.h setting. The patch also adds a suppression file that directs Valgrind to silences nominal errors we don't plan to fix. To smoke-test the instrumentation, I used make installcheck runs on x86_64 GNU/Linux and ppc64 GNU/Linux. This turned up various new and newly-detected memory bugs, which I will discuss in a separate thread. With those fixed, make installcheck has a clean report (in my one particular configuration). I designed the suppression file to work across platforms; I specifically anticipated eventual use on x86_64 Darwin and x86_64 FreeBSD. Valgrind 3.8.1 quickly crashed when running PostgreSQL on Darwin; I did not dig further. Since aset.c and mcxt.c contain the hottest code paths in the backend, I verified that a !USE_VALGRIND, non-assert build produces the same code before and after the patch. Testing that revealed the need to move up the AllocSizeIsValid() check in repalloc(), though I don't understand why GCC reacted that way. Peter Geoghegan and Korry Douglas provided valuable feedback on earlier versions of this code. Possible future directions: - Test make installcheck-world. When I last did this in past years, contrib did trigger some errors. - Test recovery, such as by running a streaming replica under Memcheck while the primary runs make installcheck-world. - Test newer compilers and higher optimization levels. I used GCC 4.2 at -O1. A brief look at -O2 results showed a new error that I have not studied. GCC 4.8 at -O3 might show still more due to increasingly-aggressive assumptions. - A buildfarm member running its installcheck steps this way. - Memcheck has support for detecting leaks. I have not explored that side at all, always passing --leak-check=no. We could add support for freeing everything at process exit, thereby making the leak detection meaningful. Brief notes for folks reproducing my approach: I typically start the Memcheck-hosted postgres like this: valgrind --leak-check=no --gen-suppressions=all \ --suppressions=src/tools/valgrind.supp --time-stamp=yes \ --log-file=$HOME/pg-valgrind/%p.log postgres If that detected an error on which I desired more detail, I would rerun a smaller test case with --track-origins=yes --read-var-info=yes. That slows things noticeably but gives more-specific messaging. When even that left the situation unclear, I would temporarily hack allocChunkLimit so every palloc() turned into a malloc(). I strongly advise installing the latest-available Valgrind, particularly because recent releases suffer far less of a performance drop processing the instrumentation added by this patch. A make installcheck run takes 273 minutes under Vaglrind 3.6.0 but just 27 minutes under Valgrind 3.8.1. Thanks, nm [1] http://valgrind.org/docs/manual/mc-manual.html [2] http://www.postgresql.org/message-id/20110312133224.ga7...@tornado.gateway.2wire.net -- Noah Misch EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index 6a111c7..de64377 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -308,6 +308,37 @@ AllocSetFreeIndex(Size size) return idx; } +#ifdef CLOBBER_FREED_MEMORY + +/* Wipe freed memory for debugging purposes */ +static void +wipe_mem(void *ptr, size_t size) +{ + memset(ptr, 0x7F, size); +} +#endif + +#ifdef MEMORY_CONTEXT_CHECKING +static void +set_sentinel(void *base, Size offset) +{ + char *ptr = (char *) base + offset; + + *ptr = 0x7E; +} + +static bool +sentinel_ok(const void *base, Size offset) +{ + const char *ptr = (const char *) base + offset; + boolret; + + ret = *ptr == 0x7E; + + return ret; +} +#endif + #ifdef RANDOMIZE_ALLOCATED_MEMORY /* @@ -492,8 +523,7 @@ AllocSetReset(MemoryContext context) char *datastart = ((char *) block) + ALLOC_BLOCKHDRSZ; #ifdef CLOBBER_FREED_MEMORY - /* Wipe freed memory for debugging purposes */ - memset(datastart, 0x7F,
[HACKERS] 9.3 crop of memory errors
My make installcheck runs while completing the just-posted Valgrind Memcheck patch revealed seven new and newly-detected (due to tighter checking) memory errors. Proposed patches attached. * SP-GiST moveLeafs() and doPickSplit() read past the end of a palloc These functions construct arrays of OffsetNumber in palloc'd chunks, which they then place in WAL records. The rdata entry is done with a MAXALIGN'd size, but the associated palloc request may not extend to the next MAXALIGN boundary. This use of MAXALIGN is wasted space anyway; the arrays only need SHORTALIGN, which the code maintains without assistance (so long as each xlrec structure needs at least SHORTALIGN, which seems inevitable). I've just removed the explicit alignment bumps. This behavior dates to the initial SP-GiST commit. Since every palloc chunk has MAXALIGN'd size under the hood, the excess read cannot cause a SIGSEGV or other concrete bad outcome. Therefore, I propose to change 9.4 only. * GinFormTuple() leaves pad bytes uninitialized When it repalloc()s an IndexTuple, this function does not zero the new space. The increase has up to two regions of alignment padding, one byte after the null category (if any) and up to six bytes at the end. This is not, to my knowledge, a concrete problem. However, I've observed no other place in the system where we send uninitialized data to PageAddItem(). This looks like an oversight, and we should clear affected memory to bring consistency with the other core bufpage consumers. This behavior goes back to 8.4 or earlier. Since it's a mere point of hygiene, I intend this for 9.4 only. * Uses of a NULL-terminated string as a Name datum A Name is expected to have NAMEDATALEN bytes. Presenting a shorter cstring as a Name causes reading past the end of the string's allocation. Switch to the usual idiom. New in 9.3 (765cbfdc9263bf7c90b9d1f1044c6950b8b7088c, 3855968f328918b6cd1401dd11d109d471a54d40). * HaveVirtualXIDsDelayingChkpt() wrongly assumes a terminated array This function looks for a !VirtualTransactionIdIsValid() terminator, but GetVirtualXIDsDelayingChkpt() does not add one. Patch makes the function's looping logic more like its 9.2 version; I cannot find discussion of or discern a benefit of that aspect of the change. It also reverts the addition of an unused variable, presumably a a debugging relic, by the same commit. New in 9.3 (f21bb9cfb5646e1793dcc9c0ea697bab99afa523). * Passing oidvector by value oidvector ends with a flexible array, so this is almost always wrong. New in 9.3 (7ac5760fa283bc090c25e4ea495a0d2bb41db7b5). * hba.c tokenize_file can read backward past the beginning of a stack variable This arises when a file like pg_hba.conf contains an empty line. New in 9.3 (7f49a67f954db3e92fd96963169fb8302959576e). * json parser can read one byte past the datum end I swapped order of tests like if (has_some_property(*p) p end). Perhaps this was intended as a micro-optimization, putting the more-selective check first. If that is important, we might arrange to have a known byte after the end to make it safe. New in 9.3 (a570c98d7fa0841f17bbf51d62d02d9e493c7fcc). Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c index 5f6bcdd..1199916 100644 --- a/src/backend/access/spgist/spgdoinsert.c +++ b/src/backend/access/spgist/spgdoinsert.c @@ -535,9 +535,9 @@ moveLeafs(Relation index, SpGistState *state, { XLogRecPtr recptr; - ACCEPT_RDATA_DATA(xlrec, MAXALIGN(sizeof(xlrec)), 0); - ACCEPT_RDATA_DATA(toDelete, MAXALIGN(sizeof(OffsetNumber) * nDelete), 1); - ACCEPT_RDATA_DATA(toInsert, MAXALIGN(sizeof(OffsetNumber) * nInsert), 2); + ACCEPT_RDATA_DATA(xlrec, sizeof(xlrec), 0); + ACCEPT_RDATA_DATA(toDelete, sizeof(OffsetNumber) * nDelete, 1); + ACCEPT_RDATA_DATA(toInsert, sizeof(OffsetNumber) * nInsert, 2); ACCEPT_RDATA_DATA(leafdata, leafptr - leafdata, 3); ACCEPT_RDATA_BUFFER(current-buffer, 4); ACCEPT_RDATA_BUFFER(nbuf, 5); @@ -1118,7 +1118,7 @@ doPickSplit(Relation index, SpGistState *state, leafdata = leafptr = (char *) palloc(totalLeafSizes); - ACCEPT_RDATA_DATA(xlrec, MAXALIGN(sizeof(xlrec)), 0); + ACCEPT_RDATA_DATA(xlrec, sizeof(xlrec), 0); ACCEPT_RDATA_DATA(innerTuple, innerTuple-size, 1); nRdata = 2; @@ -1154,7 +1154,7 @@ doPickSplit(Relation index, SpGistState *state, { xlrec.nDelete = nToDelete; ACCEPT_RDATA_DATA(toDelete, - MAXALIGN(sizeof(OffsetNumber) * nToDelete), + sizeof(OffsetNumber) * nToDelete,
Re: [HACKERS] Valgrind Memcheck support
On 2013-06-09 17:25:59 -0400, Noah Misch wrote: Valgrind's Memcheck tool[1] is handy for finding bugs, but our use of a custom allocator limits its ability to detect problems in unmodified PostgreSQL. During the 9.1 beta cycle, I found some bugs[2] with a rough patch adding instrumentation to aset.c and mcxt.c such that Memcheck understood our allocator. I've passed that patch around to a few people over time, and I've now removed the roughness such that it's ready for upstream. In hopes of making things clearer in the commit history, I've split out a preliminary refactoring patch from the main patch and attached each separately. Besides the aset.c/mcxt.c instrumentation, this patch adds explicit checks for undefined memory to PageAddItem() and printtup(); this has caught C-language functions that fabricate a Datum without initializing all bits. The inclusion of all this is controlled by a pg_config_manual.h setting. The patch also adds a suppression file that directs Valgrind to silences nominal errors we don't plan to fix. Very nice work. I've started to do this quite some time back to smoke out some bugs in code of mine, but never got remotely to a point where it was submittable. But I already found some bugs with it. So I'd very happy if this could get committed. Will take a look. I strongly advise installing the latest-available Valgrind, particularly because recent releases suffer far less of a performance drop processing the instrumentation added by this patch. A make installcheck run takes 273 minutes under Vaglrind 3.6.0 but just 27 minutes under Valgrind 3.8.1. At least on linux amd64 I'd strongly suggest using something newer than (afair) 3.8.1, i.e. the svn version. Up to that version it corrupts the stack alignment inside signal handlers which doesn't get fixed up even after a fork(). This leads to mysterious segfaults, e.g. during elog()s due to the usage of sse registers which have stronger alignment requirements. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
From: Craig Ringer cr...@2ndquadrant.com On 06/09/2013 08:32 AM, MauMau wrote: - Failure of a disk containing data directory or tablespace If checkpoint can't write buffers to disk because of disk failure, checkpoint cannot complete, thus WAL files accumulate in pg_xlog/. This means that one disk failure will lead to postgres shutdown. I've seen a couple of people bitten by the misunderstanding that tablespaces are a way to split up your data based on different reliability requirements, and I really need to write a docs patch for http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html that adds a prominent warning like: WARNING: Every tablespace must be present before the database can be started. There is no easy way to recover the database if a tablespace is lost to disk failure, deletion, use of volatile storage, etc. bDo not put a tablespace on a RAM disk/b; instead just use UNLOGGED tables. (Opinions on the above?) Yes, I'm sure this is useful for DBAs to know how postgres behaves and take some preparations. However, this does not apply to my case, because I'm using tablespaces for I/O distribution across multiple disks and simply for database capacity. The problem is that the reliability of the database system decreases with more disks, because failure of any one of those disks would result in a database PANIC shutdown I'd rather like to be able to recover from this by treating the tablespace as dead, so any attempt to get a lock on any table within it fails with an error and already-in-WAL writes to it just get discarded. It's the sort of thing that'd only be reasonable to do as a recovery option (like zero_damaged_pages) since if applied by default it'd lead to potentially severe and unexpected data loss. I'm in favor of taking a tablespace offline when I/O failure is encountered, and continue running the database server. But WAL must not be discarded because committed transactions must be preserved for durability of ACID. Postgres needs to take these steps when it encounters an I/O error: 1. Take the tablespace offline, so that subsequent read/write against it returns an error without actually issuing read/write against data files. 2. Discard shared buffers containing data in the tablespace. WAL is not affected by the offlining of tablespaces. WAL records already written on the WAL buffer will be written to pg_xlog/ and archived as usual. Those WAL records will be used to recover committed transactions during archive recovery. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON and unicode surrogate pairs
On 06/06/2013 12:53 PM, Robert Haas wrote: On Wed, Jun 5, 2013 at 10:46 AM, Andrew Dunstan and...@dunslane.net wrote: In 9.2, the JSON parser didn't check the validity of the use of unicode escapes other than that it required 4 hex digits to follow '\u'. In 9.3, that is still the case. However, the JSON accessor functions and operators also try to turn JSON strings into text in the server encoding, and this includes de-escaping \u sequences. This works fine except when there is a pair of sequences representing a UTF-16 type surrogate pair, something that is explicitly permitted in the JSON spec. The attached patch is an attempt to remedy that, and a surrogate pair is turned into the correct code point before converting it to whatever the server encoding is. Note that this would mean we can still put JSON with incorrect use of surrogates into the database, as now (9.2 and later), and they will cause almost all the accessor functions to raise an error, as now (9.3). All this does is allow JSON that uses surrogates correctly not to fail when applying the accessor functions and operators. That's a possible violation of POLA, and at least worth of a note in the docs, but I'm not sure what else we can do now - adding this check to the input lexer would possibly cause restores to fail, which users might not thank us for. I think the approach you've proposed here is a good one. I did that, but it's evident from the buildfarm that there's more work to do. The problem is that we do the de-escaping as we lex the json to construct the look ahead token, and at that stage we don't know whether or not it's really going to be needed. That means we can cause errors to be raised in far too many places. It's failing on this line: converted = pg_any_to_server(utf8str, utf8len, PG_UTF8); even though the operator in use (-) doesn't even use the de-escaped value. The real solution is going to be to delay the de-escaping of the string until it is known to be wanted. That's unfortunately going to be a bit invasive, but I can't see a better solution. I'll work on it ASAP. Getting it to work well without a small API change might be pretty hard, though. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw regression tests order dependency
Andrew Dunstan and...@dunslane.net writes: It looks like the postgres_fdw's regression tests expect data back from the following statement in a given order, which presumably isn't guaranteed: UPDATE ft2 SET c2 = c2 + 600 WHERE c1 % 10 = 8 RETURNING *; See http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=frogmouthdt=2013-06-08%2018%3A30%3A00 I think what happened here is that autovacuum ran while the test was in process, and freed up some space near the start of the table that was then used by the two INSERTs just above this. I was able to duplicate that diff by adding a VACUUM S 1.T 1 command just ahead of the INSERTs. Maybe we need to wrap this in a CTE and then order the results for consistency? In principle, we'd have to do that to every update in the test, which doesn't seem either painless or conducive to thorough testing. What I'm a bit inclined to do instead is to modify the test case so that the rows inserted by the two INSERTs aren't touched by this UPDATE. It's probably easier to guarantee that no rows are updated twice in this test sequence than to make the query results totally order-independent. I'm going to go experiment with adding more VACUUMs to the test script just to see if any other results change ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/10/2013 06:39 AM, MauMau wrote: The problem is that the reliability of the database system decreases with more disks, because failure of any one of those disks would result in a database PANIC shutdown More specifically, with more independent sets of disks / file systems. I'd rather like to be able to recover from this by treating the tablespace as dead, so any attempt to get a lock on any table within it fails with an error and already-in-WAL writes to it just get discarded. It's the sort of thing that'd only be reasonable to do as a recovery option (like zero_damaged_pages) since if applied by default it'd lead to potentially severe and unexpected data loss. I'm in favor of taking a tablespace offline when I/O failure is encountered, and continue running the database server. But WAL must not be discarded because committed transactions must be preserved for durability of ACID. [snip] WAL is not affected by the offlining of tablespaces. WAL records already written on the WAL buffer will be written to pg_xlog/ and archived as usual. Those WAL records will be used to recover committed transactions during archive recovery. (I'm still learning the details of Pg's WAL, WAL replay and recovery, so the below's just my understanding): The problem is that WAL for all tablespaces is mixed together in the archives. If you lose your tablespace then you have to keep *all* WAL around and replay *all* of it again when the tablespace comes back online. This would be very inefficient, would require a lot of tricks to cope with applying WAL to a database that has an on-disk state in the future as far as the archives are concerned. It's not as simple as just replaying all WAL all over again - as I understand it, things like CLUSTER or TRUNCATE will result in relfilenodes not being where they're expected to be as far as old WAL archives are concerned. Selective replay would be required, and that leaves the door open to all sorts of new and exciting bugs in areas that'd hardly ever get tested. To solve the massive disk space explosion problem I imagine we'd have to have per-tablespace WAL. That'd cause a *huge* increase in fsync costs and loss of the rather nice property that WAL writes are nice sequential writes. It'd be complicated and probably cause nightmares during recovery, for archive-based replication, etc. The only other thing I can think of is: When a tablespace is offline, write WAL records to a separate tablespace recovery log as they're encountered. Replay this log when the tablespace comes is restored, before applying any other new WAL to the tablespace. This wouldn't affect archive-based recovery since it'd already have the records from the original WAL. None of these options seem exactly simple or pretty, especially given the additional complexities that'd be involved in allowing WAL records to be applied out-of-order, something that AFAIK _never_h happens at the moment. The key problem, of course, is that this all sounds like a lot of complicated work for a case that's not really supposed to happen. Right now, the answer is your database is unrecoverable, switch to your streaming warm standby and re-seed it from the standby. Not pretty, but at least there's the option of using a sync standby and avoiding data loss. How would you approach this? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Revisit items marked 'NO' in sql_features.txt
Hi, While reviewing sql_features.txt, found a few items marked NO ('Not supported') whereas, at the outset, they seemed to be supported. Apologies, if this is already considered and / or still marked 'NO' for a reason, but a list of such items mentioned below: F202TRUNCATE TABLE: identity column restart option NO F263Comma-separated predicates in simple CASE expressionNO T041Basic LOB data type support 01 BLOB data type NO T051Row types NO T174Identity columnsNO T176Sequence generator support NO T177Sequence generator support: simple restart option NO T522Default values for IN parameters of SQL-invoked procedures NO T571Array-returning external SQL-invoked functions NO -- Robins Tharakan
Re: [HACKERS] JSON and unicode surrogate pairs
Andrew Dunstan and...@dunslane.net writes: I did that, but it's evident from the buildfarm that there's more work to do. The problem is that we do the de-escaping as we lex the json to construct the look ahead token, and at that stage we don't know whether or not it's really going to be needed. That means we can cause errors to be raised in far too many places. It's failing on this line: converted = pg_any_to_server(utf8str, utf8len, PG_UTF8); even though the operator in use (-) doesn't even use the de-escaped value. The real solution is going to be to delay the de-escaping of the string until it is known to be wanted. That's unfortunately going to be a bit invasive, but I can't see a better solution. I'll work on it ASAP. Not sure that this idea isn't a dead end. IIUC, you're proposing to jump through hoops in order to avoid complaining about illegal JSON data, essentially just for backwards compatibility with 9.2's failure to complain about it. If we switch over to a pre-parsed (binary) storage format for JSON values, won't we be forced to throw these errors anyway? If so, maybe we should just take the compatibility hit now while there's still a relatively small amount of stored JSON data in the wild. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Server side lo-funcs name
Recently we got a complain about server side large object function names described in the doc: http://www.postgresql.org/message-id/51b2413f.8010...@gmail.com In the doc: http://www.postgresql.org/docs/9.3/static/lo-funcs.html There are server-side functions callable from SQL that correspond to each of the client-side functions described above; indeed, for the most part the client-side functions are simply interfaces to the equivalent server-side functions From the description it is hard for users to find out server side functions loread and lowrite becuase they are looking for lo_read and lo_write. So I think his complain is fair. Included patches attempt to fix the problem. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp diff --git a/doc/src/sgml/lobj.sgml b/doc/src/sgml/lobj.sgml index db5bc10..6dbc84c 100644 --- a/doc/src/sgml/lobj.sgml +++ b/doc/src/sgml/lobj.sgml @@ -527,10 +527,14 @@ int lo_unlink(PGconn *conn, Oid lobjId); para There are server-side functions callable from SQL that correspond to - each of the client-side functions described above; indeed, for the - most part the client-side functions are simply interfaces to the - equivalent server-side functions. The ones that are actually useful - to call via SQL commands are + each of the client-side functions described above(please note + that the server side function + for functionlo_read/function is functionloread/function and + the server side function for functionlo_write/function + is functionlowrite/function); indeed, for the most part the + client-side functions are simply interfaces to the equivalent + server-side functions. The ones that are actually useful to call + via SQL commands are functionlo_creat/functionindextermprimarylo_creat//, functionlo_create/functionindextermprimarylo_create//, functionlo_unlink/functionindextermprimarylo_unlink//, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump with postgis extension dumps rules separately
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/03/2013 07:57 PM, Tom Lane wrote: I'd have put the getRules call where getEventTriggers is now, or at least adjacent to getTriggers in one direction or the other. I'm not sure there is anything functionally wrong with what you have here; but IMO rules and table-level triggers are pretty much the same kind of critters so far as pg_dump is concerned, to wit, they're table properties not free-standing objects (which is exactly the point of this change). So it seems to me to make sense to process them together. OK, done this way and committed. BTW, don't forget that the getRules move needs to be back-patched. This too. Thanks, Joe - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, 24x7 Support -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJRtR8OAAoJEDfy90M199hlCvsP/0esIWG7S7yPh91tGHQmoUiP OlbbIwQRVAVvKAdStT3RtvI/NjmflZrqMmCXueGoy3dOrVZ+NcfMW09gLOSLxpjV 0PEnWBvU9JiyhQ4dyRjfqygZzD4AJZwMhtgPJqIIZUTsoctIPGwW5PZocTeuNJvu RzM4I+nfQSMkqTFuYawyD8l8uH802DfkiTrKCFAPvEExdzQOPac4Mc042tWdJiLU BlbqAF3Tw2V4MqHSnvumvRoIitFkWi3IpVkfIrsSZJ3a+meZP8Sqp2dMZNP2f8i9 u6drVw8hrYibHQvEG+xYhUBtPEfqrkIh7hqREtfyuMHyaXT+DcpIKcMjhHVeA/X0 1+lxGcW7I/IQZF5d8ql59xGdMPG+xjDxo04e2tu9E+yyfF82uFIBa6dNmFEA4Scr fegoYLHr/VJv3FHXTv5nFPxcuXA/H+2C+0WaJaweLSrh0Sg33myKV46m2YXb1KRA mtW/ahV2g2yLq7uYK7rwvEZltUVKmko6uMuJzAOf75rT8hXkZEy6poMTIqH+wc4B qs+ZzgDIu5e/YgPMdgLNidfLJHuUchfcOYvleUGynzydZoLQTdO9DI3wGxrhEL4m kFu/ypmahMrYmb/2dG6cMfyLuF7DKwyysGzUBA5HISoywvSy55CEqaZKs5muE3eb XFB8fC0rphS4dLOrRU0a =0LVS -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Placing hints in line pointers
On Sat, 2013-06-01 at 15:45 +0100, Simon Riggs wrote: Doing this would have two purposes: * We wouldn't need to follow the pointer if the row is marked aborted. This would save a random memory access for that tuple That's quite similar to LP_DEAD, right? You could potentially set this new hint sooner, which may be an advantage. We wouldn't need to do a FPW when a hint changes, we would only need to take a copy of the ItemId array, which is much smaller. And it could be protected by its own checksum. I like the direction of this idea. I don't see exactly how you want to make this safe, though. During replay, we can't just replace the ItemId array, because the other contents on the page might be newer (e.g. some tuples may have been cleaned out, leaving the item pointers pointing to the wrong place). (In addition, if we wanted, this could be used to extend block size to 64kB if we used 8-byte alignment for tuples) I think that supporting 64KB blocks has merit. Interesting observation about the extra bits available. That would be an on-disk format change, so perhaps this is something to add to the list of waiting for a format change? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_filedump 9.3: checksums (and a few other fixes)
Attached is a first draft of an update to pg_filedump for 9.3. I know pg_filedump is a pgfoundry project, but that seems like it's just there to host the download; so please excuse the slightly off-topic post here on -hackers. I made a few changes to support 9.3, which were mostly fixes related two things: * new htup_details.h and changes related to FK concurrency improvements * XLogRecPtr is now a uint64 And, of course, I added support for checksums. They are always displayed and calculated, but it only throws an error if you pass -k. Only the user knows whether checksums are enabled, because we removed page-level bits indicating the presence of a checksum. The patch is a bit ugly: I had to copy some code, and copy the entire checksum.c file (minus some Asserts, which don't work in an external program). Suggestions welcome. Regards, Jeff Davis diff -Nc pg_filedump-9.2.0/checksum.c pg_filedump-9.3.0j/checksum.c *** pg_filedump-9.2.0/checksum.c 1969-12-31 16:00:00.0 -0800 --- pg_filedump-9.3.0j/checksum.c 2013-06-09 21:20:34.036176831 -0700 *** *** 0 --- 1,157 + /*- + * + * checksum.c + * Checksum implementation for data pages. + * + * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/storage/page/checksum.c + * + *- + * + * Checksum algorithm + * + * The algorithm used to checksum pages is chosen for very fast calculation. + * Workloads where the database working set fits into OS file cache but not + * into shared buffers can read in pages at a very fast pace and the checksum + * algorithm itself can become the largest bottleneck. + * + * The checksum algorithm itself is based on the FNV-1a hash (FNV is shorthand + * for Fowler/Noll/Vo) The primitive of a plain FNV-1a hash folds in data 1 + * byte at a time according to the formula: + * + * hash = (hash ^ value) * FNV_PRIME + * + * FNV-1a algorithm is described at http://www.isthe.com/chongo/tech/comp/fnv/ + * + * PostgreSQL doesn't use FNV-1a hash directly because it has bad mixing of + * high bits - high order bits in input data only affect high order bits in + * output data. To resolve this we xor in the value prior to multiplication + * shifted right by 17 bits. The number 17 was chosen because it doesn't + * have common denominator with set bit positions in FNV_PRIME and empirically + * provides the fastest mixing for high order bits of final iterations quickly + * avalanche into lower positions. For performance reasons we choose to combine + * 4 bytes at a time. The actual hash formula used as the basis is: + * + * hash = (hash ^ value) * FNV_PRIME ^ ((hash ^ value) 17) + * + * The main bottleneck in this calculation is the multiplication latency. To + * hide the latency and to make use of SIMD parallelism multiple hash values + * are calculated in parallel. The page is treated as a 32 column two + * dimensional array of 32 bit values. Each column is aggregated separately + * into a partial checksum. Each partial checksum uses a different initial + * value (offset basis in FNV terminology). The initial values actually used + * were chosen randomly, as the values themselves don't matter as much as that + * they are different and don't match anything in real data. After initializing + * partial checksums each value in the column is aggregated according to the + * above formula. Finally two more iterations of the formula are performed with + * value 0 to mix the bits of the last value added. + * + * The partial checksums are then folded together using xor to form a single + * 32-bit checksum. The caller can safely reduce the value to 16 bits + * using modulo 2^16-1. That will cause a very slight bias towards lower + * values but this is not significant for the performance of the + * checksum. + * + * The algorithm choice was based on what instructions are available in SIMD + * instruction sets. This meant that a fast and good algorithm needed to use + * multiplication as the main mixing operator. The simplest multiplication + * based checksum primitive is the one used by FNV. The prime used is chosen + * for good dispersion of values. It has no known simple patterns that result + * in collisions. Test of 5-bit differentials of the primitive over 64bit keys + * reveals no differentials with 3 or more values out of 10 random keys + * colliding. Avalanche test shows that only high order bits of the last word + * have a bias. Tests of 1-4 uncorrelated bit errors, stray 0 and 0xFF bytes, + * overwriting page from random position to end with 0 bytes, and overwriting + * random segments of page with 0x00, 0xFF and random data all show optimal + * 2e-16
Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)
Jeff Davis wrote: --- 1000,1015 strcat (flagString, HASEXTERNAL|); if (infoMask HEAP_HASOID) strcat (flagString, HASOID|); + if (infoMask HEAP_XMAX_KEYSHR_LOCK) + strcat (flagString, XMAX_KEYSHR_LOCK|); if (infoMask HEAP_COMBOCID) strcat (flagString, COMBOCID|); if (infoMask HEAP_XMAX_EXCL_LOCK) strcat (flagString, XMAX_EXCL_LOCK|); ! if (infoMask HEAP_XMAX_SHR_LOCK) ! strcat (flagString, XMAX_SHR_LOCK|); ! if (infoMask HEAP_XMAX_LOCK_ONLY) ! strcat (flagString, XMAX_LOCK_ONLY|); if (infoMask HEAP_XMIN_COMMITTED) strcat (flagString, XMIN_COMMITTED|); if (infoMask HEAP_XMIN_INVALID) Hm, note that XMAX_SHR_LOCK is two bits, so when that flag is present you will get the three lock modes displayed with the above code, which is probably going to be misleading. htup_details.h does this: /* * Use these to test whether a particular lock is applied to a tuple */ #define HEAP_XMAX_IS_SHR_LOCKED(infomask) \ (((infomask) HEAP_LOCK_MASK) == HEAP_XMAX_SHR_LOCK) #define HEAP_XMAX_IS_EXCL_LOCKED(infomask) \ (((infomask) HEAP_LOCK_MASK) == HEAP_XMAX_EXCL_LOCK) #define HEAP_XMAX_IS_KEYSHR_LOCKED(infomask) \ (((infomask) HEAP_LOCK_MASK) == HEAP_XMAX_KEYSHR_LOCK) Presumably it'd be better to do something similar. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers