Re: [HACKERS] GSoC - proposal - Materialized Views in PostgreSQL
On 11.04.10 20:47 , Robert Haas wrote: On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Robert Haas wrote: 2010/4/10 Andrew Dunstanand...@dunslane.net: Heikki Linnakangas wrote: 1. Keep the materialized view up-to-date when the base tables change. This can be further divided into many steps, you can begin by supporting automatic updates only on very simple views with e.g a single table and a where clause. Then extend that to support joins, aggregates, subqueries etc. Keeping it really limited, you could even require the user to write the required triggers himself. That last bit doesn't strike me as much of an advance. Isn't the whole point of this to automate it? Creating greedy materialized views is usually not terribly difficult now, but you do have to write the triggers. Yeah, I agree. It doesn't accomplish anything interesting on its own. But if you do the planner changes to automatically use the materialized view to satisfy queries (item 2. in my previous email), it's useful. But you can't do that with a snapshot view, only a continuous updated one. If continuous updates prove to be too hard initially, you could instead update the view on select if it's outdated. Such a materialized view would be a kind of inter-session cache for subselects. The hard part would probably be to figure out how to decide whether the view is outdated or not, and to deal with two concurrent transactions trying to use an outdates view (and both trying to refresh it). What makes the second problem hard is that you wouldn't want one of the transactions to wait for the other to complete, because this is not how SELECTs traditionally behave. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Serializable implementation
On 28.12.09 18:54 , Kevin Grittner wrote: To give some idea of the scope of development, Michael Cahill added SSI to InnoDB by modifying 250 lines of code and adding 450 lines of code; however, InnoDB already had the S2PL option and the prototype implementation isn't as sophisticated as I feel is necessary for real production use (particularly regarding the granularity of SIREAD locks). I'm assuming it would take more to reach real production quality in PostgreSQL. My SWAG would be to multiply by two or three. I believe the hard part of implementing true serializability is not the actual SSI or S2PL algorithm, but rather the necessary predicate locking strategy. So I think checking how InnoDB tackles that and how much of it's code is invovled might give a more realistic estimate of the effort required. best regards, Florian Plug smime.p7s Description: S/MIME Cryptographic Signature
[HACKERS] [WIP] Inspection of row types in pl/pgsq l and pl/sql
Hi I've completed a (first) working version of a extension that allows easier introspection of composite types from SQL and pl/PGSQL. The original proposal and ensuing discussion can be found here: http://archives.postgresql.org/pgsql-hackers/2009-11/msg00695.php The extension can be found on: http://github.com/fgp/pg_record_inspect This is what the extension currently provides (all in schema record_inspect). * fieldinfo [composite type] Used to by fieldinfos() to describe a record's fields. Contains the fields fieldname (name), fieldtype (regclass), fieldtypemod (varchar) * fieldinfo[] fieldinfos(record) Returns an array of fieldinfos describing the record''s fields * anyelement fieldvalue(record, field name, defval anyelement, coerce boolean) Returns the value of the field field, or defval should the value be null. If coerce is true, the value is coerced to defval's type if possible, otherwise an error is raised if the field''s type and defval's type differ. * anyelement fieldvalues(record, defval anyelement, coerce boolean) Returns an array containing values of the record'' fields. NULL values are replaced by defval. If coerce is false, only the fields with the same type as defval are considered. Otherwise, the field'' values are coerced if possible, or an error is raised if not. The most hacky part of the code is probably coerceDatum() - needed to coerce a field's value to the requested output type. I wanted to avoid creating and parsing an actual SQL statement for every cast, and instead chose to use coerce_to_target_type() to create the expression trees representing casts. I use the noe type CoerceToDomainValue to inject the source value into the cast plan upon execution - see makeCastPlan() and execCastPlan() for details. If anyone has a better idea, please speak up I personally would like to see this becoming a contrib module one day, but that of course depends on how much interest there is in such a feature. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
[HACKERS] Compiling HEAD with -Werror int 64-bit mode
Hi HEAD fails to compile in 64-bit mode on Mac OS X 10.6 with gcc 4.2 and -Werror. What happens is that INT64_FORMAT gets defined as %ld (which is correct - long and unsigned long are 64 bits wide on x86_64), but the check for a working 64-bit int fails, causing INT64_IS_BUSTED to get defined and int64 becoming actually a 32-bit type. This is turn causes warnings when these pseudo int64s are passed to printf with format specified INT64_FORMAT, which get turned to errors by -Werror. configure fails to recognize long as a working 64-bit type because the does_int64_work configure test produces warning due to a missing return value declaration for main() and a missing prototype for does_int64_work(). (Aain, those warning are turned into errors by -Werror). I use the following envvar settings (when running ./configure) to force 64-bit mode and -Werror CC=gcc-4.2 CFLAGS=-arch x86_64 -Werror LDFLAGS=-arch x86_64 The following patch fixed the problem for me - though I didn't yet try it on any other platform that Mac OS X 10.6 with gcc 4.2 and in 64-bit mode. -- diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 9ac2c30..c6bd523 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -35,7 +35,7 @@ AC_CACHE_CHECK([whether $1 is 64 bits], [Ac_cachevar], ac_int64 a = 2001; ac_int64 b = 4005; -int does_int64_work() +static int does_int64_work() { ac_int64 c,d; @@ -49,8 +49,8 @@ int does_int64_work() return 0; return 1; } -main() { - exit(! does_int64_work()); +int main() { + return(! does_int64_work()); }], [Ac_cachevar=yes], [Ac_cachevar=no], -- best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Compiling HEAD with -Werror int 64-bit mode
On 15.12.09 16:02 , Tom Lane wrote: Florian G. Pflugf...@phlo.org writes: configure fails to recognize long as a working 64-bit type because the does_int64_work configure test produces warning due to a missing return value declaration for main() and a missing prototype for does_int64_work(). (Aain, those warning are turned into errors by -Werror). autoconf's test programs tend to be sufficiently sloppy that I would expect -Werror to break a whole lot of things, not just this. We can possibly neaten up the particular test case but there are many tests whose expansion we don't have much control over. Yeah, I expected all hell to break loose - only to be pleasantly surprised by this being the only issue I encountered. So I figured fixing this might be worthwhile - even if this surely does not fix -Werror builds on all platforms and/or compilers. Alternatively - is there a way to use -Werror only for building the actual sources, not the configure tests? I didn't find one, but my autoconf-fu is pretty limited... best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Range types
On 15.12.09 15:52 , Tom Lane wrote: to...@tuxteam.de writes: (and as Andrew Dunstan pointed out off-list: I was wrong with my bold assertion that one can squeeze infinitely many (arbitrary length) strings between two given. This is not always the case). Really? If the string length is unbounded I think you were right. One example is a and aa (assuming a is minimal character in your alphabet). The general case is the strings A and Aaa...a I think - it doesn't get any more exciting than this. This *is* a bit surprising, since one usually assumes that the ordering of strings and reals is fairly similar, since both are lexical. But note that the mapping of strings into the reals this intuition is based on (simply prefix a the string with 0. and interpret as a real, or something similar if the alphabet isn't {0,1}) isn't one-to-one - the strings 1, 10, 100, ... are all mapped to the *same* real number 0.1 So for reals, the statement is reduced to the trivial fact that for every x there is no y with x y x. Which is of course true.. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Compiling HEAD with -Werror int 64-bit mode
On 15.12.09 23:38 , Tom Lane wrote: Peter Eisentrautpete...@gmx.net writes: So to summarize, this is just a bad idea. Creating a less obscure way to use -Werror might be worthwhile, though. I suppose we could add --with-Werror but it seems pretty specialized to me. A more appropriate solution would allow the user to provide flags that get added to CFLAGS only after we do all the configure tests (implying that it's on the user's head that these flags are right and don't break anything, but then again that's pretty much true of up-front CFLAGS too). And that basically describes COPTS ... the only thing lacking is documentation. For what it's worth, I agree. Though we might want to arrange for configure to store the value of COPT somewhere so that COPT=-Werror ./configure make works which it currently doesn't seem to. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Application name patch - v4
Tom Lane wrote: : One possibility would be to make it possible to issue SETs that behave : as if set in a startup packet - imho its an implementation detail that : SET currently is used. I think there's a good deal of merit in this, and it would't be hard at all to implement, seeing that we already have SET LOCAL and SET SESSION. We could add a third keyword, say SET DEFAULT, that would have the behavior of setting the value in a fashion that would persist across resets. I'm not sure that DEFAULT is exactly le mot juste here, but agreeing on a keyword would probably be the hardest part of making it happen. Hm, but without a way to prevent the users of a connection pool from issuing SET DEFAULT, that leaves a connection pool with no way to revert a connection to a known state. How about SET CONNECTION, with an additional GUC called connection_setup which can only be set to true, never back to false. Once connection_setup is set to true, further SET CONNECTION attempts would fail. In a way, this mimics startup-packet SETs without actually doing things in the startup packet. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] File IO - why does PG do things in pages?
Dan Eloff wrote: At the lower levels in PG, reading from the disk into cache, and writing from the cache to the disk is always done in pages. Why does PG work this way? Is it any slower to write whole pages rather than just the region of the page that changed? Conversely, is it faster? From what I think I know of operating systems, reading should bring the whole page into the os buffers anyway, so reading the whole page instead of just part of it isn't much more expensive. Perhaps writing works similarly? First, data fetched from the disk is (except for data in temporary tables, I believe) not stored in private memory of the backend process doing the read(), but instead a a shared memory segment accessible by all backend processes. This allows two different backend processes to work modify the data concurrently without them stepping on each other's toes. Note that immediatly writing back any changes is *not* an option, since WAL logging mandates that all changes got to the WAL *first*. Hence, if you were to write out each changed tuple immediately, you'd have to first write the changes to the WAL *and* fsync the WAL to guarantee they hit the disk first. Sharing the data between backend processes requires a fair amount of infrastructure. You need a way to locate a given chunk of on-disk data in the shared memory buffer cache, and be able to acquire and release locks on those buffers to prevent two backends from wrecking havoc when they try to update the same piece of information. Organizing data in fixed-sized chunks (which is what pages are) helps with keeping the complexity of that infrastructure manageable, and the overhead reasonably low. There are also things like tracking the free space in a data file, which also gets easier if you only have to track it page-wise (Is there free space on this page or not), instead of having to track arbitrary ranges of free space. Finally, since data writes happen in units of blocks (and not bytes), you need to guarantee that you do your IO in some multiple of that unit anyway, otherwise you'd have a very hard time guaranteeing data consistency after a crash. Google for torn page writes, that should give you more details about this problem. Note, however, that a postgres page (usually 8K) is usually larger than the filesystem's blocksize (usually 512b). So always reading in full pages induces *some* IO overhead. Just not that much - especially since the blocks comprising a page are extremely likely to be arranges consecutively on disk, so there is no extra seeking involved. This, at least, are what I believe to be the main reasons for doing things in units of pages - hope this helps at least somewhat. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
[HACKERS] ALTER TABLE, find_composite_type_dependencies and locking
Hi I'm currently investigating how much work it'd be to implement arrays of domains since I have a client who might be interested in sponsoring that work. The comments around the code handling ALTER DOMAIN ADD CONSTRAINT are pretty clear about the lack of proper locking in that code - altering a domain while simultaneously add a column with that domain as a type might result in inconsistencies between the data in that column and the domain's constraints after both transactions committed. I do, however, suspect that ALTER TABLE is plagued by similar problems. Currently, during the rewrite phase of ALTER TABLE, find_composite_type_dependencies is used to verify that the table's row type (or any type directly or indirectly depending on that type) is not used as a column's type anywhere in the database. But since this code does not take any permanent locks on the visited types, it seems that adding such a column concurrently is not prevented. If the original ALTER TABLE changed a column's type, data inserted into the newly added column before the original ALTER TABLE committed will have a type different from what the catalog says after the original ALTER TABLE commits. Or at least so I think - I haven't yet tested that theory... I am aware that since a commit fest is currently running, now might not be the best time to bring up this topic. Since I feared forgetting this all together, I decided to still post now, though. I figured people still have to option to ignore this for now if they're busy with getting those patches committed. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] ALTER TABLE, find_composite_type_dependencies and locking (Confirmed)
Florian G. Pflug wrote: I do, however, suspect that ALTER TABLE is plagued by similar problems. Currently, during the rewrite phase of ALTER TABLE, find_composite_type_dependencies is used to verify that the table's row type (or any type directly or indirectly depending on that type) is not used as a column's type anywhere in the database. But since this code does not take any permanent locks on the visited types, it seems that adding such a column concurrently is not prevented. If the original ALTER TABLE changed a column's type, data inserted into the newly added column before the original ALTER TABLE committed will have a type different from what the catalog says after the original ALTER TABLE commits. Or at least so I think - I haven't yet tested that theory... I was able to confirm that this is an actual bug in 8.5. I did, however, need to use an array-of-composite type. With only nested composite types it seems that CheckAttributeType() protects against the race, because it follows the dependency chain and opens each type's relation in AccessShareLock mode. This blocks once the traversal hits the type which is being altered, hence forcing the table creation to wait for the concurrent alter table to complete. Create two types in session 1 session 1: create table t1 (t1_i int); session 1: create type t2 as (t2_t1 t1); Warm the type cache in session 2 (A simple select array[row(row(-1))::t2] would probably suffice) session 2: create table bug (bug_t2s t2[]); session 2: insert into bug (bug_t2s) values (array[row(row(-1))::t2]); session 2: select bug.bug_t2s[1].t2_t1.t1_i from bug; [select correctly returns one row containing -1] session 2: drop table bug; Alter type of t1_i in session 1 session 1: alter table t1 alter column t1_i type varchar; [Pause session 1 using gdb *right* after the call to find_composite_type_dependencies in ATRewriteTable returned] Create the bug table in session 2, and insert record session 2: create table bug (bug_t2s t2[]); session 2: insert into bug (bug_t2s) values (array[row(row(-1))::t2]); session 2: select bug.bug_t2s[1].t2_t1.t1_i from bug; [select correctly returns one row containing -1] Complete the alter table in session 1 [Resume session 1 using gdb] session 1: select bug.bug_t2s[1].t2_t1.t1_i from bug; [select returns bogus string. On my 8.5 debug+cassert build, its a long chain of \x7F\x7F\x7F\x...] Don't have any good idea how to fix this, yet. If CheckAttributeType() really *does* offer sufficient protected in the non-array case, extending that to the general case might work. But OTOH it might equally well be that a more sophisticated race exists even in the non-array case, and I simply didn't manage to trigger it... best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)
Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: (2) this change, while very useful, does change what had been a simple rule (All variables are NULL unless specifically set otherwise) into a conditional one (All variables are NULL unless set otherwise OR unless they are declared as domain types with defaults). Do people feel that the new behavior would be sufficiently intuitive to avoid user confusion? I'm inclined to leave it alone. It complicates the mental model, and frankly attaching defaults to domains was not one of the SQL committee's better ideas anyway. It's *fundamentally* non-orthogonal. I've always though of domains as being a kind of subtype of it's base type. In this picture, DEFAULTs for domains correspond to overriding the default constructor of the type (thinking C++ now), and seem like a natural thing to have. But maybe that's more a C++ programmers than a database designers point of view... I've just checked how rowtypes behave, and while the set to null unless specifically set otherwise rule kind of holds for them, their NULL value seems to be special-cased enough to blur the line quite a bit create or replace function myt() returns t as $body$ declare r t; begin raise notice 'r: %, r is null: %', r, (r is null); return r; end; $body$ language plpgsql immutable; select myt(),myt() is null; gives: NOTICE: r: (,), r is null: t NOTICE: r: (,), r is null: t myt | ?column? -+-- (,) | f Strange I think... And at least half of an exception to the simple always null unless specifically set otherwise rule It also seems that while domain DEFAULTs are ignored, the resulting (null-initialized) variable is still checked against the domain's constraints, including a potential NOT NULL constraint create domain myint as int not null; create or replace function myint() returns myint as $body$ declare i myint; begin return i; end; $body$ language plpgsql immutable; raises ERROR: domain myint does not allow null values CONTEXT: PL/pgSQL function myint line 3 during statement block local variable initialization This has the potential to cause some headache I think if you use domains to prohibit NULL values because they make no semantic sense for your application, and depend on DEFAULT to fill in some other value (like an empty string or an empty array). best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)
Gurjeet Singh wrote: On Sat, Nov 21, 2009 at 7:26 AM, Josh Berkus j...@agliodbs.com mailto:j...@agliodbs.com wrote: However, there are some other issues to be resolved: (1) what should be the interaction of DEFAULT parameters and domains with defaults? The function's DEFAULT parameter should take precedence over the default of the domain. I think Josh was pondering whether create domain myint as int default 0; create function f(i myint) ...; should behave like create function f(i myint default 0) ...; and hence call f(0) if you do select f();, or instead raise an error because no f with zero parameters is defined (as it does now). I'd say no, because no default should be treated the same as default null, so for consistency we'd then have to also support create function g(i int) ...; select g(); And of course throw an error if there was another function defined as create function g() ...; This way leads to madness... If one really wanted to do that, there'd have to be an OPTIONAL clause for function parameters that works like DEFAULT, but doesn't take a default value and instead uses the type's default (NULL except for domains with DEFAULT clause). But I wouldn't got that far, personally... best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)
Robert Haas wrote: On Thu, Nov 19, 2009 at 9:06 PM, Florian G. Pflug f...@phlo.org wrote: I've tried to create a patch, but didn't see how I'd convert the result from get_typedefault() (A Node*, presumeably the parsetree corresponding to the default expression?) into a plan that I could store in a PLpgSQL_expr. I guess I'd need something like SPI_prepare_plan that takes a parse tree instead of a query string. Or am I on a completely wrong track there? While trying to cook up a patch I've also stumbled over what I perceive as a bug relating to DOMAINS and column DEFAULTs. I'll write that up in a second E-Mail to avoid confusion. I suggest adding this to the open CommitFest (2010-01) at https://commitfest.postgresql.org/action/commitfest_view/open Hm, but I don't (yet) have a patch to add... best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)
Tom Lane wrote: Florian G. Pflug f...@phlo.org writes: It seems that pl/pgsql ignores the DEFAULT value of domains for local variables. The plpgsql documentation seems entirely clear on this: The DEFAULT clause, if given, specifies the initial value assigned to the variable when the block is entered. If the DEFAULT clause is not given then the variable is initialized to the SQL null value. Hm, must have missed that paragraph :-(. Sorry for that. Would a patch that changes that have any chance of being accepted? Or is the gain (not having to repeat the DEFAULT clause, and being able to maintain it at one place instead of many) considered too small compared to the risk of breaking existing code? best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Listen / Notify - what to do when the queue is full
Heikki Linnakangas wrote: Joachim Wieland wrote: On Thu, Nov 19, 2009 at 4:12 AM, Tom Lane t...@sss.pgh.pa.us wrote: Yes, I have been thinking about that also. So what should happen when you prepare a transaction that has sent a NOTIFY before? From the user's point of view, nothing should happen at prepare. At a quick glance, it doesn't seem hard to support 2PC. Messages should be put to the queue at prepare, as just before normal commit, but the backends won't see them until they see that the XID has committed. Yeah, but if the server is restarted after the PREPARE but before the COMMIT, the notification will be lost, since all notification queue entries are lost upon restart with the slru design, no? best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Listen / Notify - what to do when the queue is full
Tom Lane wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: A better approach is to do something similar to what we do now: at prepare, just store the notifications in the state file like we do already. In notify_twophase_postcommit(), copy the messages to the shared queue. Although it's the same approach we have now, it becomes a lot cleaner with the patch, because we're not piggybacking the messages on the backend-private queue of the current transaction, but sending the messages directly on behalf of the prepared transaction being committed. This is still ignoring the complaint: you are creating a clear risk that COMMIT PREPARED will fail. I'm not sure that it's really worth it, but one way this could be made safe would be for PREPARE to reserve the required amount of queue space, such that nobody else could use it during the window from PREPARE to COMMIT PREPARED. I'd see no problem with COMMIT PREPARED failing, as long as it was possible to retry the COMMIT PREPARED at a later time. There surely are other failure cases for COMMIT PREPARED too, like an IO error that prevents the clog bit from being set, or a server crash half-way through COMMIT PREPARED. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Listen / Notify - what to do when the queue is full
Tom Lane wrote: Florian G. Pflug f...@phlo.org writes: Tom Lane wrote: This is still ignoring the complaint: you are creating a clear risk that COMMIT PREPARED will fail. I'd see no problem with COMMIT PREPARED failing, as long as it was possible to retry the COMMIT PREPARED at a later time. There surely are other failure cases for COMMIT PREPARED too, like an IO error that prevents the clog bit from being set, or a server crash half-way through COMMIT PREPARED. Yes, there are failure cases that are outside our control. That's no excuse for creating one that's within our control. True. On the other hand, people might prefer having to deal with (very unlikely) COMMIT PREPARED *transient* failures over not being able to use NOTIFY together with 2PC at all. Especially since any credible distributed transaction manager has to deal with COMMIT PREPARED failures anyway. Just my $0.02, though. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
[HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)
Hi It seems that pl/pgsql ignores the DEFAULT value of domains for local variables. With the following definitions in place create domain myint as int default 0; create or replace function myint() returns myint as $body$ declare v_result myint; begin return v_result; end; $body$ language plpgsql immutable; issuing select myint(); returns NULL, not 0 on postgres 8.4.1 If the line v_result myint; is changes to v_result myint default 0; than 0 is returned as expected. I've tried to create a patch, but didn't see how I'd convert the result from get_typedefault() (A Node*, presumeably the parsetree corresponding to the default expression?) into a plan that I could store in a PLpgSQL_expr. I guess I'd need something like SPI_prepare_plan that takes a parse tree instead of a query string. Or am I on a completely wrong track there? While trying to cook up a patch I've also stumbled over what I perceive as a bug relating to DOMAINS and column DEFAULTs. I'll write that up in a second E-Mail to avoid confusion. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
[HACKERS] column DEFAULTs and prepared statements
Hi While trying to come up with a patch to handle domain DEFAULTs in plpgsql I've stumbled across the following behavior regarding domain DEFAULTs and prepared statements. session 1: create domain myint as int default 0 ; session 1: create table mytable (i myint) ; session 2: prepare ins as insert into mytable (i) values (default); session 2: execute ins; session 1: alter domain myint set default 1; session 2: execute ins; select * from mytable returns: i --- 0 0 while I'd have expected: i --- 0 1 After doing the same without using a domain session 1: create table mytable (i myint default 0) ; session 2: prepare ins as insert into mytable (i) values (default); session 2: execute ins; session 1: alter table mytable alter column i default 1; session 2: execute ins; select * from mytable returns: i --- 0 1 As far as I understand the code this happens because the dependency on the domain (for the default value) is not recorded in the plan cache entry. This would imply that the same error also occurs if the INSERT happens from a pl/pgsql function instead of a manually prepared statement, but I haven't tested that. If someone gives me a general idea where to start, I could try to come up with a patch best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql
Tom Lane wrote: Florian G. Pflug f...@phlo.org writes: Tom Lane wrote: Trying to do this in plpgsql is doomed to failure and heartache, Well, the proposed functions at least allow for some more flexibility in working with row types, given that you know in advance which types you will be dealing with (but not necessarily the precise ordering and number of the record's fields). They might feel a bit kludgy because of the anyelement dummy argument that bridges the gap between the statically typed nature of SQL and the rather dynamic RECORDs, but the kludgy-ness factor is still within reasonable limits I think. It sounds pretty d*mn klugy to me, and I stand by my comment that it isn't going to work anyway. When you try it you are going to run into parameter type doesn't match that while preparing the plan errors. Ok, I must be missing something. I currently fail to see how my proposed record_value(record, name, anyelement) returns anyelement function differs (from the type system's point of view) from value_from_string(text, anyelement) returns anyelement which simply casts the text value to the given type and can easily be implemented in plpgsq. create or replace function value_from_string(v_value text, v_type_dummy anyelement) returns anyelement as $body$ declare v_result v_type_dummy%type; begin if v_value is null then return null; end if; v_result := v_value; return v_result; end; $body$ language plpgsql immutable; -- Returns 124 select value_from_string('123', NULL::int) + 1; -- returns {1,2,3,4} select value_from_string('{1,2,3}', NULL::int[]) || array[4]; best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql
Tom Lane wrote: Florian G. Pflug f...@phlo.org writes: Ok, I must be missing something. I currently fail to see how my proposed record_value(record, name, anyelement) returns anyelement function differs (from the type system's point of view) from value_from_string(text, anyelement) returns anyelement which simply casts the text value to the given type and can easily be implemented in plpgsq. The problem is at the call site --- if you try to call it with different record types on different calls you're going to get a failure. Or so I expect anyway. Ah, OK - so it's really the record type, and not my anyelement kludge that might cause problems. Actually, I do now realize that record is a way more special case than I'd have initially thought. For example, I could have sworn that it's possible to pass record values to pl/pgsql functions, but just found out the hard way that it isn't. Seems that the possibility of declaring record variables lulled me into thinking it's pretty standard type when it actually isn't. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Yes, and I have used it, but it really would be nicer to have some introspection facilities built in, especially for use in triggers. Maybe, but the proposal at hand is spectacularly ugly --- in particular it seems designed around the assumption that a given trigger will only care about handling a predetermined set of datatypes, which hardly fits with PG's normal goals for datatype extensibility. If the argument is that you don't like hstore or other PLs because they'll smash everything to text, then I think you have to do better than this. While I agree that handling arbitrary datatypes at runtime would be nice, I really don't see how that could ever be done from within a plpgsql procedure, unless plpgsql somehow morphs into a dynamically typed language. Plus, the set of datatypes an application deals with is usually much smaller than the set of tables, and less likely to change over time. I'd also argue that this restriction does not conflict with PG's goal of datatype extensibility at all. Datatype extensibility in PG's boils down to being able to create new datatypes without modifying postgres itself - but it still expects that you do so while designing your application. Which also is when trigger functions that use record_value() or a similar function would be written. Plus, fully generic handling of data of arbitrary type is a somewhat strange notion anyway, because it leaves you with very few operations guaranteed to be defined for those values. In the case of PG, you'd be pretty much limited to casting those values from and to text. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql
Tom Lane wrote: Florian G. Pflug f...@phlo.org writes: While I agree that handling arbitrary datatypes at runtime would be nice, I really don't see how that could ever be done from within a plpgsql procedure, unless plpgsql somehow morphs into a dynamically typed language. Which is not likely to happen, which is why this is fundamentally a dead end. I don't think it's appropriate to put ugly, hard to use band-aids over the fact that plpgsql isn't designed to do this. One of the principal reasons why we work so hard to support multiple PLs is that they have different strengths. If you need something that's more dynamically typed than plpgsql, you should go use something else. In principle, I agree. In pratice, however, the company who I do my current project for has settled on plpgsql and isn't willing to use other PLs in their software because they lack the skill to maintain code written in other PLs. Therefore I'm trying to find an at least somewhat acceptable solution using plpgsql. Plus, fully generic handling of data of arbitrary type is a somewhat strange notion anyway, because it leaves you with very few operations guaranteed to be defined for those values. In the case of PG, you'd be pretty much limited to casting those values from and to text. Well, that's the wrong way to look at it. To me, the right design would involve saying that my trigger needs to do operation X on the data, and therefore it should support all datatypes that can do X. It should not need a hard-wired list of which types those are. True, but that'd require fairly large changes to plpgsql AFAICS. Perhaps it would help if we looked at some specific use-cases that people need, rather than debating abstractly. What do you need your generic trigger to *do*? I need to build a global index table of all values of a certain type together with a pointer to the row and table that contains them. Since all involved tables have an id column, storing that pointer is the easy part. The hard part is collecting all those values in an insert/update/delete trigger so that I can update the global index accordingly. Currently, a set of plpgsql functions generate a seperate trigger function for each table. Yuck! Instead of this nearly-impossible to read code-generating function I want to create a generic trigger function that works for any of the involved tables. Preferrably in plpgsql because of the skill issue mentioned above. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql
Tom Lane wrote: Florian G. Pflug f...@phlo.org writes: Tom Lane wrote: Perhaps it would help if we looked at some specific use-cases that people need, rather than debating abstractly. What do you need your generic trigger to *do*? I need to build a global index table of all values of a certain type together with a pointer to the row and table that contains them. Since all involved tables have an id column, storing that pointer is the easy part. The hard part is collecting all those values in an insert/update/delete trigger so that I can update the global index accordingly. So in this case it seems like you don't actually need any polymorphism at all; the target columns are always of a known datatype. You just don't want to commit to their names. I wonder though why you're willing to pin down the name of the id column but not the name of the data column. There might be more than one (or none at all) columns of the type to be indexed. I need to process all such columns (each of them produces a seperate record in the index table). Plus, this schema is relatively volatile - new fields are added about once a month or so. Currently, a set of plpgsql functions generate a seperate trigger function for each table. Yuck! Would you be happy with an approach similar to what Andrew mentioned, ie, you generate CREATE TRIGGER commands that list the names of the target column(s) as TG_ARGV arguments? The alternative to that seems to be that you iterate at runtime through all the table columns to see which ones are of the desired type. Which might be less trouble to set up, but the performance penalty of figuring out basically-unchanging information again on every single tuple update seems awful high. Hm.. I had hoped to get away without any need to modify the trigger definitions if the schema changes. But having a function that does DROP TRIGGER; CREATE TRIGGER... is already a huge improvement over having one that does CREATE FUNCTION I've now played around with the EXECUTE 'select $1.' || quote_ident(fieldname)' USING NEW/OLD trick, and simply look up the existing field with SELECT attname FROM pg_attribute WHERE attrelid = TG_RELID AND atttypeid IN (...) AND attname NOT IN ('referenced_by', 'self') AND attnum 0 AND NOT attisdropped This at least gives me a working proof-of-concept implementation of the trigger. Still, doing that SELECT seems rather silly since NEW and OLD already contain the required information. So I still believe that having something like record_name() and record_types() would be useful. And at least these functions have less of an issue with the type system... best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Check constraint on domain over an array not executed for array literals
Heikki Linnakangas wrote: Agreed, it's a bug. A simpler example is just: [snipped] Will this fix for this be included in 8.4.2 (or .3), or will it have to wait for 8.4 because it changes behavior? There's a special case in transformExpr function to handle the ARRAY[...]::arraytype construct, which skips the usual type-casting and just constructs an ArrayExpr with the right target type. However, it's not taking into account that the target type can be a domain. Attached patch fixes that. Anyone see a problem with it? I'm not familiar with the parser so I can't really judge this. However, I've applied the patch to my development db and it seems to work fine, and fixes the bug. Thanks for the quick response! best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
[HACKERS] Inspection of row types in pl/pgsql and pl/sql
Hi I'm currently working on a project where we need to build a global cache table containing all values of certain types found in any of the other tables. Currently, a seperate insert, update and delete (plpgsql) trigger function exists for each table in the database which is auto-generated by a (plpgsql) function which queries the system catalogs to find all fields with a certain type, and then generates the appropriate plpgsql function using EXECUTE '...'. I'd like to replace this function-generating function by a generic trigger function that works for all tables. Due to the lack of any way to inspect the *structure* of a record type, however, I'd have to use a C language function for that, which induces quite some maintenance headaches (especially if deployed on windows). I'm therefore thinking about implementing the following generate-purpose inspection functions for row types record_length(record) returns smallint Returns the number of fields in the given record. record_names(record) returns name[] Returns the names of the record's fields. Array will contain NULLs if one or more fields are unnamed. record_types(record) returns regtype[]; Returns the OIDs of the record's types. Array won't contain NULLs record_value(record, name, anyelement) returns anyelement Returns the value of a certain (named) field. The type of the third argument defines the return type (its value is ignored). The field's value is cast to that type if possible, otherwise an error is raised. record_value(record, smallint, anyelement) returns anyelement Returns the value of the field at the given position. record_values(record, regtype, anyelement) returns anyarray Returns an array of all values of all fields with the given type or whose type is a domain over the given type. No casting is done. Any comment/critique is appreciated. Would anyone else find those functions useful? best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql
Tom Lane wrote: Florian G. Pflug f...@phlo.org writes: I'd like to replace this function-generating function by a generic trigger function that works for all tables. Due to the lack of any way to inspect the *structure* of a record type, however, I'd have to use a C language function for that, which induces quite some maintenance headaches (especially if deployed on windows). Trying to do this in plpgsql is doomed to failure and heartache, because it's fundamentally a strongly typed language. The proposed functions won't fix that and hence will be unusable in practice. I'd suggest either using C, or using one of the less-strongly-typed PLs. Well, the proposed functions at least allow for some more flexibility in working with row types, given that you know in advance which types you will be dealing with (but not necessarily the precise ordering and number of the record's fields). They might feel a bit kludgy because of the anyelement dummy argument that bridges the gap between the statically typed nature of SQL and the rather dynamic RECORDs, but the kludgy-ness factor is still within reasonable limits I think. Since all the other PLs (except C) are not nearly as integrated with the postgres type system, using them for this task does not really buy anything IMHO. AFAIK, all these PLs will convert any SQL type which isn't specifically mapped to one of the PLs types to a string. *That* I can do with pl/pgsql too, by simply using record_out() and then parsing the result... C of course lets me work around all these problems - but at the cost of a longer development time and (more importantly) more maintenance headaches (especially on windows, where a C compiler is not just one apt-get/yum/whatever call away). Regarding usability - the proposed function would for example allow you to implement a wide-range of row-to-text conversion functions in pure pl/pgsql by calling record_value(record, name, anyelement) with NULL::varchar as the last argument for each field, and then concatinating the resulting text together any way you like. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
[HACKERS] Check constraint on domain over an array not executed for array literals
Hi While trying to create a domain over an array type to enforce a certain shape or certain contents of an array (like the array being only one-dimensional or not containing NULLs), I've stumbled over what I believe to be a bug in postgresql 8.4 It seems that check constraints on domains are *not* executed for literals of the domain-over-array-type - in other words, for expressions like: array[...]::my-domain-over-array-type. They are, however, executed if I first force the array to be of the base type, and then cast it to the array type. Here is an example that reproduces the problem: create domain myintarray as int[] check ( -- Check that the array is neither null, nor empty, -- nor multi-dimensional (value is not null) and (array_length(value,1) is not null) and (array_length(value,1) 0) and (array_length(value,2) is null) ); select null::myintarray; -- Fails (Right) select array[]::myintarray; -- Succeeds (Wrong) select array[]::int[]::myintarray; -- Fails (Right) select array[1]::myintarray; -- Succeeds (Right) select array[1]::int[]::myintarray; -- Succeeds (Right) select array[array[1]]::myintarray; -- Succeeds (Wrong) select array[array[1]]::int[][]::myintarray; -- Fails (Right) I guess the reason is that the ::arraytype part of array[...]::arraytype isn't really a cast at all, but instead part of the array literal syntax. Hence, array[]::myintarray probably creates an empty myintarray instance, and then adds the elements between the square brackets (none) - with none of this steps triggering a run of the check constraint. I still have the feeling that this a bug, though. First, because it leaves you with no way at guarantee that values of a given domain always fulfill certain constraints. And second because array[...]::arraytype at least *looks* like a cast, and hence should behave like one too. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Transaction Snapshots and Hot Standby
Simon Riggs wrote: On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote: The main idea was to invert the meaning of the xid array in the snapshot struct - instead of storing all the xid's between xmin and xmax that are to be considering in-progress, the array contained all the xid's xmin that are to be considered completed. The downside is that the size of the read-only snapshot is theoretically unbounded, which poses a bit of a problem if it's supposed to live inside shared memory... Why do it inverted? That clearly has problems. Because it solves the problem of sponteaously apprearing XIDs in the WAL. At least prior to 8.3 with virtual xids, a transaction might have allocated it's xid long before actually writing anything to disk, and therefore long before this XID ever shows up in the WAL. And with a non-inverted snapshot such an XID would be considered to be completed by transactions on the slave... So, one either needs to periodically log a snapshot on the master or log XID allocations which both seem to cause considerable additional load on the master. With an inverted snapshot, it's sufficient to log the current RecentXmin - a values that is readily available on the master, and therefore the cost amounts to just one additional 4-byte field per xlog entry. regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Transaction Snapshots and Hot Standby
Simon Riggs wrote: On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote: The current read-only snapshot (which current meaning the corresponding state on the master at the time the last replayed wal record was generated) was maintained in shared memory. It' xmin field was continually updated with the (newly added) XLogRecord.xl_xmin field, which contained the xid of the oldest running query on the master, with a pruning step after each ReadOnlySnapshot.xmin update to remove all entries xmin from the xid array. If a commit was seen for an xid, that xid was added to the ReadOnlySnapshot.xid array. The advantage of this concept is that it handles snapshotting on the slave without too much additional work for the master (The only change is the addition of the xl_xmin field to XLogRecord). It especially removes that need to track ShmemVariableCache-nextXid. Snapshots only need to know which transactions are currently running during WAL apply. The standby can't remove any tuples itself, so it doesn't need to know what the master's OldestXmin is. I used the logged xmin value to track the shared snapshot's xmin, which in turn allowed me to prune the xid array, eliminating all xids that xmin. regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Transaction Snapshots and Hot Standby
Heikki Linnakangas wrote: BTW, we haven't talked about how to acquire a snapshot in the slave. You'll somehow need to know which transactions have not yet committed, but will in the future. In the master, we keep track of in-progress transaction in the ProcArray, so I suppose we'll need to do the same in the slave. Very similar to prepared transactions, actually. I believe the Abort records, which are not actually needed for normal operation, become critical here. The slave will need to put an entry to ProcArray for any new XLogRecord.xl_xid it sees in the WAL, and remove the entry at a Commit and Abort record. And clear them all at a shutdown record. For reference, here is how I solved the snapshot problem in my Summer-of-Code project last year, which dealt exactly with executing read-only queries on PITR slaves (But sadly never came out of alpha stage due to both my and Simon's lack of time) The main idea was to invert the meaning of the xid array in the snapshot struct - instead of storing all the xid's between xmin and xmax that are to be considering in-progress, the array contained all the xid's xmin that are to be considered completed. The current read-only snapshot (which current meaning the corresponding state on the master at the time the last replayed wal record was generated) was maintained in shared memory. It' xmin field was continually updated with the (newly added) XLogRecord.xl_xmin field, which contained the xid of the oldest running query on the master, with a pruning step after each ReadOnlySnapshot.xmin update to remove all entries xmin from the xid array. If a commit was seen for an xid, that xid was added to the ReadOnlySnapshot.xid array. The advantage of this concept is that it handles snapshotting on the slave without too much additional work for the master (The only change is the addition of the xl_xmin field to XLogRecord). It especially removes that need to track ShmemVariableCache-nextXid. The downside is that the size of the read-only snapshot is theoretically unbounded, which poses a bit of a problem if it's supposed to live inside shared memory... regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Getting to universal binaries for Darwin
Tom Lane wrote: You can get around that by hacking up the generated config files with #ifdef __i386__ and so on to expose the correct values of the hardware-dependent symbols to each build. Of course you have to know what the correct values are --- if you don't have a sample of each architecture handy to run configure against, it'd be easy to miss some things. And even then it's pretty tedious. I am not sure if it is possible or worth the trouble to try to automate this part better. Hm - configure *does* the right thing if CFLAGS is set to *just* -arch i386 or -arch ppc (at least on intel hardware, because OSX can run ppc binaries there, but not vice versa), right? If this is true, we need some way to run configure multiple times, once for each arch, but then still get *one* set of Makefiles that have all the archs in their CFLAGS.. Modulo the above problems, I was able to build i386+ppc binaries that do in fact work on both architectures. I haven't got any 64-bit Apple machines to play with, so there might be 64-bit issues I missed. Still, this is a huge step forward compared to what was discussed here: http://archives.postgresql.org/pgsql-general/2008-02/msg00200.php I think that my MacBook should be able to build and run 64-bit binaries, so I can test that if you want. Do you have a script that does the necessary config file magic, or did you do that by hand? regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCHES] [HACKERS] Solaris ident authentication using unix domain sockets
Josh Berkus wrote: Tom, Indeed. If the Solaris folk feel that getupeercred() is insecure, they had better explain why their kernel is that broken. This is entirely unrelated to the known shortcomings of the ident IP protocol. The Solaris security kernel folks do, actually. However, there's no question that TRUST is inherently insecure, and that's what people are going to use if they can't get IDENT to work. I'd be *very* interested in how they come to that assessment. I'd have thought that the only alternative to getpeereid/getupeercred is password-based or certificate-based authenticated - which seem *less* secure because a) they also rely on the client having the correct uid or gid (to read the password/private key), plus b) the risk of the password/private key getting into the wrong hands. How is that sort of authenticated handled by services shipping with solaris? regards, Florian Pflug, hoping to be enlightened beyond his limited posix-ish view of the world... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VirtualXactLockTableInsert
Simon Riggs wrote: When we move from having a virtual xid to having a real xid I don't see any attempt to re-arrange the lock queues. Surely if there are people waiting on the virtual xid, they must be moved across to wait on the actual xid? Otherwise the locking queue will not be respected because we have two things on which people might queue. Anybody explain that? Locks on real xids serve a different purpose than locks on virtual xids. Locks on real xids are used to wait for transaction who touched a certain tuple (in which case they certainly must have acquired a real xid) to end. Locks on vxids on the other hand are used to wait for the ending of transactions which either hold a certain lock or use a snapshot with a xmin earlier than some point in time. indexcmds.c is the only place where VirtualXactLockTableWait() is used - the concurrent index creation needs to wait for all transactions to end which either might not know about the index (after phase 1 and 2), or who might still see tuples not included in the index (before marking the index valid). In cases where we know we will assign a real xid, can we just skip the assignment of the virtual xid completely? For example, where an implicit transaction is started by a DML statement. Otherwise we have to wait for 2 lock table inserts, not just one. A more general solution would be to get rid of vxid locks completly. This is possible if we figure out a way to handle the first two waiting phases or concurrent index builds in another way. One idea I had for approaching this was to extend the lock manager by adding some sort of WaitForCurrentLockHolders(LockTag) function. I felt (and still feel) feel I didn't understand the locking code well enough to start hacking it though, and Tom didn't like the idea either. His argument was that it wasn't clear how deadlock detection would cope with such a facility IIRC. regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hint Bits and Write I/O
Kevin Grittner wrote: On Wed, May 28, 2008 at 6:26 PM, in message [EMAIL PROTECTED], Florian G. Pflug [EMAIL PROTECTED] wrote: I think we should put some randomness into the decision, to spread the IO caused by hit-bit updates after a batch load. Currently we have a policy of doing a VACUUM FREEZE ANALYZE on a table after a bulk load, or on the entire database after loading a pg_dump of a database. We do this before putting the table or database into production. This avoids surprising clusters of writes at unpredictable times. Please don't defeat that. (I'm not sure whether your current suggestion would.) No, VACUUM (and therefore VACUUM FREEZE) dirty all buffers they set hit bits on anyway, since they also update the xmin values. But a more IO-friendly approach to setting hit bits might make that VACUUM FREEZE step unnecessary ;-) regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hint Bits and Write I/O
Simon Riggs wrote: Hmm, I think the question is: How many hint bits need to be set before we mark the buffer dirty? (N) Should it be 1, as it is now? Should it be never? Never is a long time. As N increases, clog accesses increase. So it would seem there is likely to be an optimal value for N. After further thought, I begin to think that the number of times we set a dirty hint-bit shouldn't influence the decision of whether to dirty the page too much. Instead, we should look at the *age* of the last xid which modified the tuple. The idea is that the clog pages showing the status of young xids are far more likely to be cached that the pages for older xids. This makes a lost hint-bit update much cheaper for young than for old xids, because we probably won't waste any IO if we have to set the hint-bit again later, because the buffer was evicted from shared_buffers before being written out. Additionally, I think we should put some randomness into the decision, to spread the IO caused by hit-bit updates after a batch load. All in all, I envision a formula like chance_of_dirtying = min(1, alpha *floor((next_xid - last_modifying_xid)/clog_page_size) /clog_buffers ) This means that a hint-bit update never triggers dirtying if the last modifying xid belongs to the same clog page as the next unused xid - which sounds good, since that clog page gets touched on every commit and abort, and therefore is cached nearly for sure. For xids on older pages, the chance of dirtying grows (more aggresivly for larger alpha values). For alpha = 1, a hint-bit update dirties a buffer for sure only if the xid is older than clog_page_size*clog_buffers. regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hint Bits and Write I/O
Simon Riggs wrote: After some discussions at PGCon, I'd like to make some proposals for hint bit setting with the aim to reduce write overhead. Currently, when we see an un-hinted row we set the bit, if possible and then dirty the block. If we were to set the bit but *not* dirty the block we may be able to find a reduction in I/O. In many cases this would make no difference at all, since we often set hints on an already dirty block. In other cases, particularly random INSERTs, UPDATEs and DELETEs against large tables this would reduce I/O, though possibly increase accesses to clog. Hm, but the io overhead of hit-bit setting occurs only once, while the pressure on the clog is increased until we set the hint-bit. This looks like not writing the hit-bit update to disk results in worse throughput unless there are many updated, and only very few selects. But not too many updates either, because if a page gets hit by tuple updates faster than the bgwriter writes it out, you won't waste any io on hit-bit-only writes either. That might turn out to be a pretty slim window which actually shows substantial IO savings... My proposal is to have this as a two-stage process. When we set the hint on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY then we mark it BM_DIRTY. The objective of this is to remove effects of single index accesses. So effectively, only the first hit-bit update hitting a previously clean buffer gets treated specially - the second hit-bit update flags the buffer as dirty, just as it does now? That sounds a bit strange - why is it exactly the *second* write that triggers the dirtying? Or did I missunderstand what you wrote? regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] triggers on prepare, commit, rollback... ?
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: trigger on prepare, commit, rollback, savepoint, This is a sufficiently frequently asked question that I wish someone would add an entry to the FAQ about it, or add it to the TODO list's Features we don't want section. OK, remind me why we don't want it again? I'm sure I've ranted on this several times before, but a quick archive search doesn't find anything. So, here are a few points to chew on: * Trigger on rollback: what's that supposed to do? The current transaction is already aborted, so the trigger has no hope of making any database changes that will ever be visible to anyone. * Trigger on commit: what do you do if the transaction fails after calling the trigger? The reductio ad absurdum for this is to consider having two on-commit triggers, where obviously the second one could fail. I full agree that having triggers on rollback and on commit of 2PC transactions is broken by design. Triggers on COMMIT (for non-2PC transactions) and PREPARE (for 2PC-Transactions) seem workable though - I'd expect such a trigger to be executed *before* any actual commit handling takes place. Essentially, doing BEGIN some work COMMIT in the presence of an on-commit trigger would be equivalent to doing BEGIN some work SELECT my_trigger_function COMMIT. A possible use-case for that is aggregating some statistics collected during a transaction. One could e.g. maintain a cache of table rowcounts by summing up the number of inserted and deleted records per table with some per-row ON INSERT and ON DELETE (presumably C-language) triggers, and than update a global cache at transaction end. regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.
Bruce Momjian wrote: Is this a TODO? It's for from clear that avoing an exclusive ProcArray lock on subxact abort will bring a measurable performance benefit, so probably not. I've actually coded a prototype for this a few months ago, to check if it would bring any benefit at all, though I ran out of time before I had time to benchmark this, and I probably also lack the hardware for running high-concurrency tests. --- Florian G. Pflug wrote: Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: Currently, we do not assume that either the childXids array, nor the xid cache in the proc array are sorted by ascending xid order. I believe that we could simplify the code, further reduce the locking requirements, and enabled a transaction to de-overflow it's xid cache if we assume that those arrays are in ascending xid order. de-overflowing the cache sounds completely unsafe, as other backends need that state to determine whether they need to look into pg_subtrans. We'd only de-overflow if we abort *all* xids that are missing from the xid cache. And only after marking them as aborted in the clog. If someone concurrently checks for an overflow, and already sees the new (non-overflowed) state, than he'll assume the xid is not running if he hasn't found it in the array. Which is correct - we just aborted it. Plus, removing the exclusive lock doesn't depend on de-overflowing. It's just something that seems rather easy to do once the subxid handling is in a state that allows concurrent removal of entries. If it turns out that it's not that easy, than I'll just drop the idea again. I still don't believe you can avoid taking exclusive lock, either; your argument here did not address latestCompletedXid. Sorry, not addressing latestCompletedXid was an oversight :-(. My point is the we only *need* to advance latestCompletedXid on COMMITS. We do so for aborts only to avoid running with unnecessarily low xmins after a transaction ABORT. That corner case can only happen after a toplevel ABORT, though - aborting subxacts cannot change the xmin, because the toplevel xact will have a lower xid than any of it's subtransactions anyway. We can therefore just remember the largest assigned xid for a given transaction, and update latestCompletedXid to that on toplevel commit or abort. That prevents that corner-case too, without updating latestCompletedXid during subxact abort. But the main point remains this: there is no evidence whatsoever that these code paths are sufficiently performance-critical to be worth speeding up by making the code more fragile. The gain will be less than that of the locking improvements done so far. It will be a win for heavy users of plpgsql BEGIN/END/EXCEPTION blocks, though I think. We'll also save some cycles in TransactionIdIsInProgress, because we can use a binary search, but that's just an added bonus. I'm currently trying to code up a patch, since it's easier to judge the correctness of actual code than that of a mere proposals. I'll do some benchmarking when the patch is done to see if it brings measurable benefits. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Read-ahead and parallelism in redo recovery
Pavan Deolasee wrote: What I am thinking is if we can read ahead these blocks in the shared buffers and then apply redo changes to them, it can potentially improve things a lot. If there are multiple read requests, kernel (or controller ?) can probably schedule the reads more efficiently. The same holds true for index scans, though. Maybe we can find a solution that benefits both cases - something along the line of a bgreader process Btw, isn't our redo recovery completely physical in nature ? I mean, can we replay redo logs related to a block independent of other blocks ? The reason I am asking because if thats the case, ISTM we can introduce parallelism in recovery by splitting and reordering the xlog records and then run multiple processes to do the redo recovery. I'd say its physical on the tuple level (We just log the new tuple on an update, not how to calculate it from the old one), but logical on the page level (We log the fact that a tuple was inserted on a page, but e.g. the physical location of the tuple on the page can come out differently upon replay). It's even more logical for indices, because we log page splits as multiple wal records, letting the recovery process deal with synthesizing upper-level updates should we crash in the middle of a page split. Additionally, we log full-page images as a safeguard against torn page writes. Those would need to be considered as a kind of reorder barrier in any parallel restore scenario, I guess. I know that Simon has some ideas about parallel restored, though I don't know how he wants to solve the dependency issues involved. Perhaps by not parallelizing withon one table or index... ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Read-ahead and parallelism in redo recovery
Greg Stark wrote: Florian G. Pflug wrote: The same holds true for index scans, though. Maybe we can find a solution that benefits both cases - something along the line of a bgreader process I posted a patch to do readahead for bitmap index scans using posix_fadvise. Experiments showed it works great on raid arrays on Linux. Solaris will need to use libaio though which I haven't tried yet. Cool! I'd like to try it out - is that patch available in the pg-patches archives? Doing it for normal index scans is much much harder. You can readahead a single page by using the next pointer if it looks like you'll need it. But I don't see a convenient way to get more than that. I was thinking that after reading a page from the index, the backend could post a list of heap pages referenced from that index page to the shmem. A background process would repeatedly scan that list, and load those pages into the buffer cache. regards, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] CREATE TABLE, load and freezing
Pavan Deolasee wrote: In a typical scenario, user might create a table and load data in the table as part of a single transaction (e.g pg_restore). In this case, it would help if we create the tuples in the *frozen* state to avoid any wrap-around related issues with the table. Without this, very large read-only tables would require one round of complete freezing if there are lot of transactional activities in the other parts of the database. And when that happens, it would generate lots of unnecessary IOs on these large tables. If that works, then we might also want to set the visibility hint bits. Not because lookup of that information is expensive - the tuples all came from the same transaction, virtually guaranteeing that the relevent pg_clog page stays in memory after the first few pages. But by setting them immediatly we'd save some IO, since we won't dirty all pages during the first scan. I don't know if this is a real problem for anybody, but I could think of its use case, at least in theory. A cannot speak for freeze-on-restore, but in a project I'm currently working on, the IO caused (I guess) by hint-bit updates during the first scan of the table is at least noticeably... regards, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] An idea for parallelizing COPY within one backend
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: ... Neither the dealer, nor the workers would need access to the either the shared memory or the disk, thereby not messing with the one backend is one transaction is one session dogma. ... Unfortunately, this idea has far too narrow a view of what a datatype input function might do. Just for starters, consider enum input, which certainly requires catalog access. We have also explicitly acknowledged the idea that datatype I/O functions might try to store typmod-related data in some special catalog somewhere. Hm... how many in-core datatypes are there which need catalog access in their input or output functions? Maybe we could change the API for i/o functions in a way that allows us to request all needed information to be cached? regards, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] An idea for parallelizing COPY within one backend
Dimitri Fontaine wrote: Of course, the backends still have to parse the input given by pgloader, which only pre-processes data. I'm not sure having the client prepare the data some more (binary format or whatever) is a wise idea, as you mentionned and wrt Tom's follow-up. But maybe I'm all wrong, so I'm all ears! As far as I understand, pgloader starts N threads or processes that open up N individual connections to the server. In that case, moving then text-binary conversion from the backend into the loader won't give any additional performace I'd say. The reason that I'd love some within-one-backend solution is that I'd allow you to utilize more than one CPU for a restore within a *single* transaction. This is something that a client-side solution won't be able to deliver, unless major changes to the architecture of postgres happen first... regards, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] An idea for parallelizing COPY within one backend
Brian Hurt wrote: Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: ... Neither the dealer, nor the workers would need access to the either the shared memory or the disk, thereby not messing with the one backend is one transaction is one session dogma. ... Unfortunately, this idea has far too narrow a view of what a datatype input function might do. Just for starters, consider enum input, which certainly requires catalog access. We have also explicitly acknowledged the idea that datatype I/O functions might try to store typmod-related data in some special catalog somewhere. Would it be possible to determine when the copy is starting that this case holds, and not use the parallel parsing idea in those cases? In theory, yes. In pratice, I don't want to be the one who has to answer to an angry user who just suffered a major drop in COPY performance after adding an ENUM column to his table. I was thinking more along the line of letting a datatype specify a function void* ioprepare(typmod) which returns some opaque object specifying all that the input and output function needs to know. We could than establish the rule that input/output functions may not access the catalog, and instead pass them a pointer to that opaque object. All pretty pie-in-the-sky at the moment, though... regards, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] An idea for parallelizing COPY within one backend
Andrew Dunstan wrote: Florian G. Pflug wrote: Would it be possible to determine when the copy is starting that this case holds, and not use the parallel parsing idea in those cases? In theory, yes. In pratice, I don't want to be the one who has to answer to an angry user who just suffered a major drop in COPY performance after adding an ENUM column to his table. I am yet to be convinced that this is even theoretically a good path to follow. Any sufficiently large table could probably be partitioned and then we could use the parallelism that is being discussed for pg_restore without any modification to the backend at all. Similar tricks could be played by an external bulk loader for third party data sources. That assumes that some specific bulkloader like pg_restore, pgloader or similar is used to perform the load. Plain libpq-users would either need to duplicate the logic these loaders contain, or wouldn't be able to take advantage of fast loads. Plus, I'd see this as a kind of testbed for gently introducing parallelism into postgres backends (especially thinking about sorting here). CPU gain more and more cores, so in the long run I fear that we will have to find ways to utilize more than one of those to execute a single query. But of course the architectural details need to be sorted out before any credible judgement about the feasability of this idea can be made... regards, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] An idea for parallelizing COPY within one backend
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: Plus, I'd see this as a kind of testbed for gently introducing parallelism into postgres backends (especially thinking about sorting here). This thinking is exactly what makes me scream loudly and run in the other direction. I don't want threads introduced into the backend, whether gently or otherwise. The portability and reliability hits that we'll take are too daunting. Threads that invoke user-defined code (as anything involved with datatype-specific operations must) are especially fearsome, as there is precisely 0 chance of that code being thread-safe. Exactly my thinking. That is why I was looking for a way to introduce parallelism *without* threading. Though it's not so much the user-defined code that scares me, but rather the portability issues. The differences between NPTL and non-NPTL threads on linux alone make me shudder... Was I was saying is that there might be a chance to get some parallelism without threading, by executing well-defined pieces of code with controlled dependencies in separate processes. COPY seemed like an ideal testbed for that idea, since the conversion of received lines into tuples seemed reasonable self-contained, and with little outside dependencies. If the idea can't be made to work there, it probably won't work anywhere. If it turns out that it does (with an API change for input/output functions) however, then it *might* be possible to apply it to other relatively self-contained parts in the future... To restate, I don't want threaded backends. Not in the foreseeable future at least. But I'd still love to see a single transaction using more than one core. regards, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] An idea for parallelizing COPY within one backend
As far as I can see the main difficulty in making COPY run faster (on the server) is that pretty involved conversion from plain-text lines into tuples. Trying to get rid of this conversion by having the client send something that resembles the data stored in on-disk tuples is not a good answer, either, because it ties the client too closely to backend-version specific implementation details. But those problems only arise if the *client* needs to deal with the binary format. What I envision is parallelizing that conversion step on the server, controlled by a backend process, kind of like a filter between the server and the client. Upon reception of a COPY INTO command, a backend would .) Retrieve all catalog information required to convert a plain-text line into a tuple .) Fork off a dealer and N worker processes that take over the client connection. The dealer distributes lines received from the client to the N workes, while the original backend receives them as tuples back from the workers. Neither the dealer, nor the workers would need access to the either the shared memory or the disk, thereby not messing with the one backend is one transaction is one session dogma. Now I'm eagerly waiting to hear all the reasons why this idea is broken as hell ;-) regards, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] dblink doesn't honor interrupts while waiting a result
Marko Kreen wrote: On 2/25/08, Florian G. Pflug [EMAIL PROTECTED] wrote: I'm not sure how a proper fix for this could look like, since the blocking actually happens inside libpq - but this certainly makes working with dblink painfull... Proper fix would be to use async libpq API, then loop on poll(2) with small timeout. You can look at pl/proxy for example code. Ah, cool, I'll check out pl/proxy. regards, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Behaviour of rows containg not-null domains in plpgsql
Hi I just stumbled over the following behaviour, introduced with 8.3, and wondered if this is by design or an oversight. If you define a domain over some existing type, constrain it to non-null values, and use that domain as a field type in a table definition, it seems to be impossible to declare pl/pgsql variables of that table's row type. The problem seems to be that upon declaration, the row variable is filled with nulls - but since the domain is marked not-null, that immediatly triggers an exception. Here is an example CREATE DOMAIN d AS varchar NOT NULL; CREATE TABLE t (txt d); CREATE FUNCTION f() RETURNS VOID AS $$ DECLARE v_t t; BEGIN END; $$ LANGUAGE 'plpgsql' VOLATILE; SELECT f(); Note that the following works. CREATE TABLE t2 (txt varchar not null); CREATE FUNCTION f2() RETURNS VOID AS $$ DECLARE v_t t2; BEGIN END; $$ LANGUAGE 'plpgsql' VOLATILE; SELECT f2(); If you guys agree that this is a bug, I'll try to find a fix and send a patch. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Behaviour of rows containg not-null domains in plpgsql
Andrew Dunstan wrote: Florian G. Pflug wrote: If you define a domain over some existing type, constrain it to non-null values, and use that domain as a field type in a table definition, it seems to be impossible to declare pl/pgsql variables of that table's row type. The problem seems to be that upon declaration, the row variable is filled with nulls - but since the domain is marked not-null, that immediatly triggers an exception. Here is an example snipped example What seems worse is that it still fails even if you declare the domain to have a default value. I didn't try that, but I *did* try was providing a default value for the row variable - which doesn't work either, since we do not currently support row variable defaults. The only workaround I found was to define the variable as record. regards, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Behaviour of rows containg not-null domains in plpgsql
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: I just stumbled over the following behaviour, introduced with 8.3, and wondered if this is by design or an oversight. No, this was in 8.2. Ah, sorry - I'm porting an app from 8.1 straight to 8.3, and blindly assumes that i'd have worked with 8.2... If you define a domain over some existing type, constrain it to non-null values, and use that domain as a field type in a table definition, it seems to be impossible to declare pl/pgsql variables of that table's row type. The problem seems to be that upon declaration, the row variable is filled with nulls - but since the domain is marked not-null, that immediatly triggers an exception. What else would you expect it to do? AFAICS any other behavior would be contrary to spec. It's the inconsistency between row types (where the not-null contraint in the table definition *doesn't* prevent a declaration like myvar mytable in pl/pgsql), and domains (where the not-null constraint *does* prevent such a declaration) that bugs me. Plus, the fact that we don't support default specifications in pl/pgsql for row types turns this inconvenience into a major PITA, forcing you to use record when you know that correct type perfectly well... Is there some difficulty in implementing row-type defaults, or is it just that nobody cared enough about them to do the work? regards, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Behaviour of rows containg not-null domains in plpgsql
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: Plus, the fact that we don't support default specifications in pl/pgsql for row types turns this inconvenience into a major PITA, You mean initialization expressions, not defaults, correct? (I would consider the latter to mean that whatever attrdef entries were attached to the rowtype's parent table would be used implicitly.) Yeah, I mean writing declare; v_var schema.table default row() regards, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] dblink doesn't honor interrupts while waiting a result
Hi dblink in 8.3 blocks without any possibility of interrupting it while waiting for an answer from the remote server. Here is a strace [pid 27607] rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0 [pid 27607] sendto(56, Q\0\0\0008lock table travelhit.booking_code in exclusive mode\0, 57, 0, NULL, 0) = 57 [pid 27607] rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0 [pid 27607] poll([{fd=56, events=POLLIN|POLLERR}], 1, -1) = ? ERESTART_RESTARTBLOCK (To be restarted) [pid 27607] --- SIGTERM (Terminated) @ 0 (0) --- [pid 27607] rt_sigreturn(0xf) = -1 EINTR (Interrupted system call) [pid 27607] poll( As you can see I'm trying to lock the table travelhit.booking_code, which blocks because someone else is already holding that lock. When I send a SIGTERM to the backend, the poll() syscalll is interruped - but immediatly restarted. I'm not sure how a proper fix for this could look like, since the blocking actually happens inside libpq - but this certainly makes working with dblink painfull... regards, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] configurability of OOM killer
Tom Lane wrote: Florian Weimer [EMAIL PROTECTED] writes: * Alvaro Herrera: I am wondering if we can set the system up so that it skips postmaster, How much does that help? Postmaster c still need to be shut down when a regular backend dies due to SIGKILL. The $64 problem is that if the parent postmaster process is victimized by the OOM killer, you won't get an automatic restart. In most people's eyes that is considerably worse than the momentary DOS imposed by a kill of a child backend. And what we now find, which is truly staggeringly stupid on the kernel's part, is that it *preferentially* kills the parent instead of whatever child might actually be eating the memory. Maybe we should just react equally brute-force, and just disable the OOM-Killer for the postmaster if we're running on linux. It seems that something like echo -17 /proc/pid/oom_adj should do the trick. And maybe add a note to the docs telling people to disable memory overcommit on dedicated database servers if that isn't already there... regards, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] configurability of OOM killer
Tom Lane wrote: Another thought is to tell people to run the postmaster under a per-process memory ulimit that is conservative enough so that the system can't get into the regime where the OOM killer activates. ulimit actually behaves the way we want, ie, it's polite about telling you you can't have more memory ;-). That will only work if postgres in the only service running on the machine, though, no? If the postmaster and it's chilren use up 80% of the available memory, then launching a forkbomb will still lead to the postmaster being killed (Since it will get the most points). Or at least this is how I interpret link posted originally. And *if* postgres is the only service, does setting a ulimit have an advantage over disabling memory overcommitting? AFAICS, memory overcommit helps if a program creates 50mb of mosty read-only data, and than forks 10 times, or if it maps a large amount of memory but writes to that block only sparsely. Since postgres does neither, a dedicated postgres server won't see any benefits from overcommitting memory I'd think. regards, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] CLUSTER and synchronized scans and pg_dump et al
Guillaume Smet wrote: On Jan 27, 2008 9:07 PM, Markus Bertheau [EMAIL PROTECTED] wrote: 2008/1/28, Tom Lane [EMAIL PROTECTED]: Do we have nominations for a name? The first idea that comes to mind is synchronized_scanning (defaulting to ON). synchronized_sequential_scans is a bit long, but contains the keyword sequential scans, which will ring a bell with many, more so than synchronized_scanning. synchronize_seqscans? How about enable_syncscan, or enable_seqscan_sync? It's not strictly something the influences the planner, but maybe it's similar enough to justify a similar naming? regards, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] CLUSTER and synchronized scans and pg_dump et al
Steve Atkins wrote: On Jan 28, 2008, at 8:36 AM, Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Kevin Grittner wrote: It would seem reasonable to me for pg_dump to use ORDER BY to select data from clustered tables. What will be the performance hit from doing that? That worries me too. Also, in general pg_dump's charter is to reproduce the state of the database as best it can, not to improve it. One common use of cluster around here is to act as a faster version of vacuum full when there's a lot of dead rows in a table. There's no intent to keep the table clustered on that index, and the cluster flag isn't removed with alter table (why bother, the only thing it affects is the cluster command). I'm guessing that's not unusual, and it'd lead to sorting tables as part of pg_dump. I've done that too - and every time I typed that CLUSTER ... I thought why, oh why isn't there something like REWRITE TABLE table, which would work just like CLUSTER, but without the sorting ;-) Maybe something to put on the TODO list... We might even call it VACCUM REWRITE ;-) regards, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Transaction Snapshot Cloning
Tom Lane wrote: I'm not sure what the most convenient user API would be for an on-demand hard-read-only mode, but we can't use SET TRANSACTION READ ONLY for it. It'd have to be some other syntax. Maybe just use a GUC variable instead of bespoke syntax? SET TRANSACTION is really just syntactic sugar for GUC SET operations anyway ... We could reuse the transaction_read_only GUC, adding strict as a 3rd allowed value beside on and off. And maybe make ansi an alias for on to emphasize that one behavior is what the standard wants, and the other is a postgres extension. regards, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Postgresql Materialized views
Tom Lane wrote: Well, my point is that taking automatic rewriting as a required feature has at least two negative impacts: * it rules out any form of lazy update, even though for many applications an out-of-date summary view would be acceptable for some purposes; * requiring MVCC consistency will probably hugely reduce the variety of views that we can figure out how to materialize, and cost performance even for the ones we can do at all. It's not zero-cost, even if you consider implementation effort and complexity as free (which I don't). There is one big additional advantage of automatic rewriting though, I believe. If we had the infrastructure to recognize that possibility of using a predefined (materialized) view for executing a query, we can also use that infrastructure to get implement a kind of optimizer hints. How? We'd need statistics-materialized views, which don't materialize the full result of the view, but instead compute it's statistical properties (the same which ANALYZE computes for a table). When planning a query we can then substitute the guessed values for rowcount and friends of a subplan by the values computed for the corresponding statistics-materialized view. However, until someone figures out *how* to *actually* recognize that possibility of using a MV for executing a query, this is just hand-wavering of course... regards, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Transaction Snapshot Cloning
Tom Lane wrote: Chris Browne [EMAIL PROTECTED] writes: Note that we required that the provider transaction have the attributes IsXactIsoLevelSerializable and XactReadOnly both being true, so we have the mandates that the resultant backend process: a) Is in read only mode, and b) Is in serializable mode. If XactReadOnly were a hard read only constraint, that argument might be worth the electrons it's written on. I quote TFM: Now I think someone was looking into a hard read only mode for use in doing read-only queries against a PITR slave; if that ever happens it might be adaptable to serve this purpose too. But we haven't got it today. That would haven been me then ;-) I think that lazy xid assignment actually got us 90% of the way towards a hard transaction read-only constraint - nearly all data-modfying operation surely depend on the xact having an xid assigned, no? (The only exception might be nextval() and friends). I seem to remember there being some pushback to the idea of changing the semantics of set transaction isolation read only from soft to hard semantics though - on the basis that it might break existing applications. If that has changed (or my memory tricks me ;-) ) I'd volunteer to create a patch for 8.4 to make set transaction read only a hard constraint. regards, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] module archive
Peter Eisentraut wrote: Am Donnerstag, 25. Oktober 2007 schrieb Andrew Dunstan: From time to time people have raised the idea of a CPAN-like mechanism for downloading, building and installing extensions and the like (types, functions, sample dbs, anything not requiring Postgres itself to be rebuilt), and I have been thinking on this for the last few days. What sort of requirements would people have of such a mechanism? How do people envision it working? Downloading, building, and installing extensions is actually fairly standardized already (well, perhaps there are 2 or 3 standards, but CPAN has that as well). I think the inhibitions relate more to the management of what is installed. I imagine we need a package manager inside of PostgreSQL to manage installation, setup, removal, dependencies and so on. Much like rpm or dpkg really. That should replace the current run this .sql file mechanism, much like rpm and dpkg replaced the run make install and trust me mechanism. I have some of this mapped out in my head if there is interest. The major challenge that I see is getting updates right, especially when the package/module contains tables which the user might have added data to (Like for example pre-8.3 tsearch). Both for updates of the packages, and for upgrading postgres to a new major revision. Maybe there are some schema-versioning tools available already that might help, though... We'd also need easy integration with the real rpm and dpkg, so that distribution packages can be built easily and I can run apt-get install postgresql extension1 extension2 Wow, that is ambitious ;-) I haven't yet seen a single distribution that gets this right for CPAN, ruby gems, or anything the like - if you know one, I'd be very interested in trying it out. Speaking for myself, I'd already be very happy if I could do apt-get install postgresql postgresql-pkg and then postpkg database install whatever module. That'd also allow postpkg to deal with the database-specific requirements of a package manager (Like specifying which db to add the module too). regards, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] PostgreSQL 8.3, libpq and WHERE CURRENT OF
Sebastien FLAESCH wrote: Forget this one, just missing the WITH HOLD option... Must teach myself a bit more before sending further mails. Seb AFAIK you cannot use WITH HOLD together with updateable cursors. I might be wrong, though... regards, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] full text search in 8.3
andy wrote: Is there any chance there is an easier way to backup/restore? On one hand, its not too bad, and it'll only be once (correct?). Now that fts is in core future backup/restores will work, right? I think it's analogous to telling someone they are updating from tsearch2 to tsearch3, and it might be a little more painful than just a backup/restore. On the other hand I think a backup/restore will pollute the new db with a bunch of functions and types that wont ever be used, so it's so much cleaner to build it by hand. Are there other fts users that might have opinions on that? I'm not really a tsearch user (just played with it a bit once). But I wondered if you are aware that you can prevent certain objects from being restored quite easiy if you use pg_dump and pg_restore together with custom format (-Fc). There is some option to pg_restore that reads the dump, and ouputs a table of contents. You can then remove some entries from that list, and pass the modified list to pg_restore which will skip entries that do not show up on your modified list. Maybe we could document some regexp, awk script, or similar that strips the tsearch stuff from such a table of contents? regards, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Plan invalidation vs temp sequences
Gregory Stark wrote: Tom Lane [EMAIL PROTECTED] writes: There doesn't seem to be any very nice way to fix this. There is not any existing support mechanism (comparable to query_tree_walker) for scanning whole plan trees, which means that searching a cached plan for regclass Consts is going to involve a chunk of new code no matter how we approach it. We might want to do that someday --- in particular, if we ever try to extend the plan inval mechanism to react to redefinitions of non-table objects, we'd likely need some such thing anyway. I'm disinclined to try to do it for 8.3 though. The use-case for temp sequences seems a bit narrow and there are several workarounds (see followups to bug report), so I'm feeling this is a fix-some-other-day kind of issue. Given that sequences are in fact relations is there some way to work around the issue at least in this case by stuffing the sequence's relid someplace which the plan invalldation code can check for it? Hm... couldn't this be worked around by doing create or replace function dynamic_oid(text) returning regclass as 'select $1::regclass' language 'pl/pgsql' stable; And then writing nextval(dynamic_oid('mysequence')). I didn't test this, but it it actually works, maybe we should just stick this into the docs somewhere. It's probably too late to add that function to the backend, though... As long as mysequence is really a temporary sequence, this wont even have searchpath issues I think, because those are always on top of the searchpatch anyway, aren't they? greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled
Heikki Linnakangas wrote: Tom Lane wrote: I tend to agree that truncating the file, and extending the fsync request mechanism to actually delete it after the next checkpoint, is the most reasonable route to a fix. Ok, I'll write a patch to do that. What is the argument against making relfilenodes globally unique by adding the xid and epoch of the creating transaction to the filename? Those 64 bits could be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem should support IMHO. regards, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled
Heikki Linnakangas wrote: Tom Lane wrote: I tend to agree that truncating the file, and extending the fsync request mechanism to actually delete it after the next checkpoint, is the most reasonable route to a fix. Ok, I'll write a patch to do that. What is the argument against making relfilenodes globally unique by adding the xid and epoch of the creating transaction to the filename? Those 64 bits could be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem should support IMHO. regards, Florian Pflug PS: Sorry if this arrives twice - I'm having a few troubles with my mail setup. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: What is the argument against making relfilenodes globally unique by adding the xid and epoch of the creating transaction to the filename? 1. Zero chance of ever backpatching. (I know I said I wasn't excited about that, but it's still a strike against a proposed fix.) 2. Adds new fields to RelFileNode, which will be a major code change, and possibly a noticeable performance hit (bigger hashtable keys). 3. Adds new columns to pg_class, which is a real PITA ... 4. Breaks oid2name and all similar code that knows about relfilenode. Ah, Ok. I was under the impression that relfilenode in pg_class is a string of some kind. In that case only GetNewRelFileNode would have needed patching... But that is obviously not the case, as I realized now :-( Thanks for setting me straight ;-) regards, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
Heikki Linnakangas wrote: I wrote: Unfortunately I don't see any easy way to fix it. One approach would be to avoid reusing the relfilenodes until next checkpoint, but I don't see any nice place to keep track of OIDs that have been dropped since last checkpoint. Ok, here's one idea: Instead of deleting the file immediately on commit of DROP TABLE, the file is truncated to release the space, but not unlink()ed, to avoid reusing that relfilenode. The truncated file can be deleted after next checkpoint. Now, how does checkpoint know what to delete? We can use the fsync request mechanism for that. When a file is truncated, a new kind of fsync request, a deletion request, is sent to the bgwriter, which collects all such requests to a list. Before checkpoint calculates new RedoRecPtr, the list is swapped with an empty one, and after writing the new checkpoint record, all the files that were in the list are deleted. We would leak empty files on crashes, but we leak files on crashes anyway, so that shouldn't be an issue. This scheme wouldn't require catalog changes, so it would be suitable for backpatching. Any better ideas? Couldn't we fix this by forcing a checkpoint before we commit the transaction that created the new pg_class entry for the clustered table? Or rather, more generally, before committing a transaction that created a new non-temporary relfilenode but didn't WAL-log any subsequent inserts. Thats of course a rather sledgehammer-like approach to this problem - but at least for the backbranched the fix would be less intrusive... regards, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
Simon Riggs wrote: On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote: Simon Riggs wrote: On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: Second, suppose that no checkpoint has occured during the upper series--authough not quite possible; That part is irrelevant. It's forced out to disk and doesn't need recovery, with or without the checkpoint. There's no hole that I can see. No, Jacky is right. The same problem exists at least with CLUSTER, and I think there's other commands that rely on immediate fsync as well. Attached is a shell script that demonstrates the problem on CVS HEAD with CLUSTER. It creates two tables, T1 and T2, both with one row. Then T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file happens to get the same relfilenode that T1 had. Then we crash the server, forcing a WAL replay. After that, T2 is empty. Oops. Unfortunately I don't see any easy way to fix it. So, what you are saying is that re-using relfilenodes can cause problems during recovery in any command that alters the relfilenode of a relation? For what I understand, I'd say that creating a relfilenode *and* subsequently inserting data without WAL-logging causes the problem. If the relfilenode was recently deleted, the inserts might be effectively undone upon recovery (because we first replay the delete), but later *not* redone (because we didn't WAL-log the inserts). That brings me to another idea from a fix that is less heavyweight than my previous checkpoint-before-commit suggestion. We could make relfilenodes globally unique if we added the xid and epoch of the creating transaction to the filename. Those are 64 bits, so if we encode them in base 36 (using A-Z,0-9), that'd increase the length of the filenames by 13. regards, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pg_restore oddity?
Heikki Linnakangas wrote: Mario Weilguni wrote: I cannot use -1 for performance, because some gist stuff has changed and the restore fails. But there seems to be no option for pg_restore to use transactions for data restore, so it's very very slow (one million records, each obviously in it's own transaction - because a separate session select count(1) from logins shows a growing number). By default, pg_dump/pg_restore uses a COPY command for each table, and each COPY executes as a single transaction, so you shouldn't see the row count growing like that. Is the dump file in --inserts format? It would be nice to use transactions for the data stuff itself, but not for schema changes or functions. I know I can use separate pg_restore runs for schema and data, but it's complicated IMHO. pg_restore -s foo pg_restore -a -1 foo doesn't seem too complicated to me. Am I missing something? Doesn't pg_restore create the indices *after* loading the data if you let it restore the schema *and* the data in one step? The above workaround would disable that optimization, thereby making the data-restore phase much more costly. Now that I think about it, I remember that I've often whished that we not only had --schema-only and --data-only, but also --schema-unconstrained-only and --constraints-only. regards, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Locales and Encodings
Tom Lane wrote: Peter Eisentraut [EMAIL PROTECTED] writes: Am Freitag, 12. Oktober 2007 schrieb Gregory Stark: It would make Postgres inconsistent and less integrated with the rest of the OS. How do you explain that Postgres doesn't follow the system's configurations and the collations don't agree with the system collations? We already have our own encoding support (for better or worse), and I don't think having one's own locale support would be that much different. Well, yes it would be, because encodings are pretty well standardized; there is not likely to be any user-visible difference between one platform's idea of UTF8 and another's. This is very very far from being the case for locales. See for instance the recent thread in which we found out that en_US locale has utterly different sort orders on Linux and OS X. For me, this paragraph is more of in argument *in favour* of having our own locale support. At least for me, consistency between PG running on different platforms would bring more benefits than consistency between PG and the platform it runs on. At the company I used to work for, we had all our databases running with encoding=utf-8 and locale=C, because I didn't want our applications to depend on platform-specific locale issues. Plus, some of the applications supported multiple languages, making a cluster-global locale unworkable anyway - a restriction which would go away if we went with ICU. regards, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Plan invalidation vs temp sequences
Gregory Stark wrote: Tom Lane [EMAIL PROTECTED] writes: There doesn't seem to be any very nice way to fix this. There is not any existing support mechanism (comparable to query_tree_walker) for scanning whole plan trees, which means that searching a cached plan for regclass Consts is going to involve a chunk of new code no matter how we approach it. We might want to do that someday --- in particular, if we ever try to extend the plan inval mechanism to react to redefinitions of non-table objects, we'd likely need some such thing anyway. I'm disinclined to try to do it for 8.3 though. The use-case for temp sequences seems a bit narrow and there are several workarounds (see followups to bug report), so I'm feeling this is a fix-some-other-day kind of issue. Given that sequences are in fact relations is there some way to work around the issue at least in this case by stuffing the sequence's relid someplace which the plan invalldation code can check for it? Hm... couldn't this be worked around by doing create or replace function dynamic_oid(text) returning regclass as 'select $1::regclass' language 'pl/pgsql' stable; And then writing nextval(dynamic_oid('mysequence')). I didn't test this, but it it actually works, maybe we should just stick this into the docs somewhere. It's probably too late to add that function to the backend, though... As long as mysequence is really a temporary sequence, this wont even have searchpath issues I think, because those are always on top of the searchpatch anyway, aren't they? greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] full text search in 8.3
andy wrote: Is there any chance there is an easier way to backup/restore? On one hand, its not too bad, and it'll only be once (correct?). Now that fts is in core future backup/restores will work, right? I think it's analogous to telling someone they are updating from tsearch2 to tsearch3, and it might be a little more painful than just a backup/restore. On the other hand I think a backup/restore will pollute the new db with a bunch of functions and types that wont ever be used, so it's so much cleaner to build it by hand. Are there other fts users that might have opinions on that? I'm not really a tsearch user (just played with it a bit once). But I wondered if you are aware that you can prevent certain objects from being restored quite easiy if you use pg_dump and pg_restore together with custom format (-Fc). There is some option to pg_restore that reads the dump, and ouputs a table of contents. You can then remove some entries from that list, and pass the modified list to pg_restore which will skip entries that do not show up on your modified list. Maybe we could document some regexp, awk script, or similar that strips the tsearch stuff from such a table of contents? regards, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Including Snapshot Info with Indexes
Csaba Nagy wrote: Can we frame a set of guidelines, or may be some test procedure, which can declare a certain function as deterministic? You mean postgres should check your function if it is really immutable ? I can't imagine any way to do it correctly in reasonable time :-) Imagine a function of 10 parameters which returns the sum of the parameters all the time except for parameters all 1 it will randomly return a value _once in a thousand executions_... please find a generic algorithm which spots this function as not immutable in reasonable execution time ;-) So this example is a bit extreme, but don't underestimate the user ;-) I think you're overly pessimistic here ;-) This classification can be done quite efficiently as long as your language is static enough. The trick is not to execute the function, but to scan the code to find all other functions and SQL statements a given function may possibly call. If your function calls no SQL statements, and only other functions already marked IMMUTABLE, then it must be IMMUTABLE itself. It does seem that only pl/pgsql is static enough for this to work, though, making this idea rather unappealing. I am just saying from the top of my mind. Even otherwise, if we can even restrict this indexing to only Built-in deterministic functions., don't you think it would help the cause of a majority? I have just made the proposal to create the index with snapshot a optional one. Restrictions like this are always confusing for the end user (i.e. why can I use built-ins here and not my own ?). I leave to the actual coders to say anything about code maintenance concerns... Yes, and some built-ins have gotten that classification wrong too in the past IIRC. Which probably is a good reason not to trust our users to get it right ;-) greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Including Snapshot Info with Indexes
Andrew Dunstan wrote: Florian G. Pflug wrote: I think you're overly pessimistic here ;-) This classification can be done quite efficiently as long as your language is static enough. The trick is not to execute the function, but to scan the code to find all other functions and SQL statements a given function may possibly call. If your function calls no SQL statements, and only other functions already marked IMMUTABLE, then it must be IMMUTABLE itself. It does seem that only pl/pgsql is static enough for this to work, though, making this idea rather unappealing. How would you propose to analyse C functions, for which you might not have the C code? Scanning the binary, together with symbol annotations for immutability of course ;-)) No, seriously. I do *not* advocate that we actually autoclassify functions, for a lot of reasons. I just wanted to refute the statement that doing so is generally impossible - it's not. It's trivial for some languages (In haskhell for example all functions that don't use monads are immutable, and their signature tell if they do use monads or or), realistic for others (pl/pgsql, where we do have the sourcecode), and utterly impossible for others (pl/{ruby,python,perl,...}, pl/c, ...). Besides - AFAICS *anything* that makes VACUUM depend on IMMUTABLE to be correct would instantly break tsearch, no? At least as long as we allow changing stopwords and the like of dictionaries used by an index - which we'd better allow, unless we want the DBAs to come with pitchforks after us... regards, Florian Pflug, who shudders when imagining DBAs with pitchforks... ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Including Snapshot Info with Indexes
Gokulakannan Somasundaram wrote: Hi Heikki, I am always slightly late in understanding things. Let me try to understand the use of DSM. It is a bitmap index on whether all the tuples in a particular block is visible to all the backends, whether a particular block contains tuples which are invisible to everyone. But i think this will get subjected to the same limitations of Bitmap index. Even Oracle suggests the use of Bitmap index for only data warehousing tables, where the Bitmap indexes will be dropped and recreated after every bulk load. This is not a viable alternative for OLTP transactions. But i think i am late in the game as i haven't participated in those discussions While the DSM might be similar in spirit to a bitmap index, the actual implementation has a lot more freedome I'd say, since you can tailor it exactly to the need of tracking some summarized visibility info. So not all shortcomings of bitmap indices must necessarily apply to the DSM also. But of course thats mostly handwavering... One Bitmap index block usually maps to lot of blocks in the heap. So locking of one page to update the DSM for update/delete/insert would hit the concurrency. But again all these are my observation w.r.t oracle bitmap indexes. May be i am missing something in DSM. A simple DSM would probably contain a bit per page that says all xmin GlobalXmin, and all xmax unset or aborted. That bit would only get SET during VACUUM, and only unset during INSERT/UPDATE/DELETE. If setting it is protected by a VACUUM-grade lock on the page, we might get away with no locking during the unset, making the locking overhead pretty small. I couldn't get that piece of discussion in the archive, which discusses the design of Retail Vacuum. So please advise me again here. Let's take up Retail Vacuuming again. The User defined function which would return different values at different time can be classified as non-deterministic functions. We can say that this index cannot be created on a non-deterministic function. This is the way it is implemented in Oracle. What they have done is they have classified certain built-in operators and functions as deterministic. Similarly they have classified a few as non-deterministic operators and functions. Can we follow a similar approach? Postgres already distinguishes VOLATILE,STABLE and IMMUTABLE functions. It doesn't, however, risk physical data corruption, even if you get that classification wrong. The worst that happens AFAIK are wrong query results - but fixing your function, followed by a REINDEX always corrects the problme. If you start poking holes into that safety net, there'll be a lot of pushback I believe - and IMHO rightly so, because people do, and always will, get such classifications wrong. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Kevin Grittner wrote: I omitted the code I was originally considering to have it work against files in place rather than as a filter. It seemed much simpler this way, we didn't actually have a use case for the additional functionality, and it seemed safer as a filter. Thoughts? A special non-filter mode could save some IO and diskspace by not actually writing all those zeros, but instead just seek to SizeOfWal-1 after writing the last valid byte, and writing one more zero. Of course, if you're gonna compress the WAL anyway, there is no point... greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] GCC builtins for atomic-test-and-set, memory barries, and such
Hi When reading Tom's comment about the bug in my use latestCompletedXid to slightly speed up TransactionIdIsInProgress patch, I remembered that I recently stumbled across GCC builtins for atomic test-and-test and read/write reordering barriers... Has anyone looked into those? It seems that those could be used to provide a fallback spinlock implementation - though maybe we already cover all interesting targets, and it's not worth the effort. Anyway, here is the link to the GCC docu. It says that the naming of these follows some Intel Spec, so presumably the Intel compiler supports the same builtins... http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] SPI access to PostgreSQL query plan
Cristiano Duarte wrote: 2007/9/17, Tom Lane [EMAIL PROTECTED]: Cristiano Duarte [EMAIL PROTECTED] writes: Is there a way to have access to PostgreSQL query plan and/or predicates inside a function using spi (or any other way)? No. Hi Tom, No means: there is no way since the query plan is stored in a secret/safe/protected/non-visible/fort-knox like place :) or it means, there is no friendly way to do it with spi or casual c language programming? No as in: You function is not told by the executor which filters are applied to it's results, and since it might be called multiple times within a query you cannot figure that out yourself, even if you somehow got hold of the currently executed plan. So unless you start to hack the executor in serious ways, you'll either have to pass the filter condition manually to your function, or live with it producing unnecessary output rows. Thats only holds true for functions in languages other than pl/sql (Which is *not* the same as pl/pgsql) - SQL functions can be inlined by the executor, and then are subject to the usual optimizations. (So they essentially behave like views). greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] A small mistake in the initial latestCompletedXid idea
Hi When I initially proposed to use the latest *committed* xid as the xmax instead of ReadNewTransactionId(), I believed that this would cause tuples created by a later aborted transaction not to be vacuumed until another transaction (with a higher xid) commits later. The idea was therefore modified to store the latest *completed* xid, instead of the latest committed one. I just realized that my fear was unjustified. AFAICS, VACUUM will aways remove tuples created by aborted transactions, even if the xid is = OldestXmin. Therefore, I suggest that we rename latestCompletedXid to latestCommittedXid, and update it only on commits. Admittedly, this won't bring any measurable performance benefit in itself (it will slightly reduce the average snapshot size, though), but not doing so might stand in the way of possible future optimizations in that area. I'll submit a patch to the patches list shortly. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] A small mistake in the initial latestCompletedXid idea
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: Therefore, I suggest that we rename latestCompletedXid to latestCommittedXid, and update it only on commits. Admittedly, this won't bring any measurable performance benefit in itself (it will slightly reduce the average snapshot size, though), but not doing so might stand in the way of possible future optimizations in that area. This is a bad idea. As you say, it doesn't directly save anything, and the downside is that it may result in RecentGlobalXmin not moving forward. Consider a situation where there's a long string of aborts and nary a commit. latestCommittedXid won't advance, therefore each new transaction continues to compute xmin = xmax = latestCommittedXid+1, and so the window between global xmin and the newest active XIDs gets wider and wider. That puts performance stress on pg_clog and pg_subtrans buffers --- if it goes on long enough, we get into a context swap storm caused by pg_subtrans buffer thrashing. We need to be sure that xmin/xmax move forward when XIDs exit the ProcArray, whether they commit or not. Hm.. Ok.. I see your point. Maybe we should then make your initial argument against this hold, by adding a check for xid latestCompletedXid into TransactionIdIsInProgress. That is cheap, might save some unnecessary proc array scanning, and justifies advancing latestCommittedXid during subxact abort (where your argument above doesn't hold, as you say yourself later on). Your post made me think for awhile about whether we really need to serialize aborts at all. From a transactional correctness standpoint I think maybe we don't, but again the difficulty is with xmin tracking. Agreed - For transaction correctness, aborted and in-progress transactions are similar enough that it doesn't matter much in which pot our snapshot puts them. They cases where the difference matters rechecks with TransactionIdIsInProgress anyway and/or waits on the xid's lock. If an aborting xact can remove its XID from ProcArray without locking, then it is possible that two concurrent scans of ProcArray arrive at different xmin values, which means that GetOldestXmin might deliver an incorrectly large answer, and that's fatal. (One of the possible consequences is truncating pg_subtrans too soon, but I believe there are other ones too.) I'm not yet sure if that deviation in the xmin calculations poses any real risk, or not. After all, even the transaction ending up with the larger xmin won't actually see xids between the smaller and the larger xid as in-progress. I haven't yet been able to come up with either a counterexample, or an argument that this is indeed safe. Subtransactions don't affect xmin, of course, so there may be an argument here that we don't have to do this stuff for a subtransaction abort. But I remain unconvinced that optimizing subtransaction abort will really buy a performance gain worth taking any risk for. That depends largely on the workload, I would think. If there is any benefit, I'd expect to see it for workloads involving deeply nested BEGIN/END/EXCEPTION blocks. Especially because we currently roll back each sub-transaction seperatly AFAICS, meaning we might take that exclusive lock many times in short succession. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Final Thoughts for 8.3 on LWLocking and Scalability
Simon Riggs wrote: On Tue, 2007-09-11 at 10:21 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: 1. The ProcArrayLock is acquired Exclusive-ly by only one remaining operation: XidCacheRemoveRunningXids(). Reducing things to that level is brilliant work, Florian and Tom. It would be brilliant if it were true, but it isn't. Better look again. On the more detailed explanation, I say in normal operation. My analytical notes attached to the original post show ProcArrayLock is acquired exclusively during backend start, exit and while making a prepared (twophase) commit. So yes, it is locked Exclusively in other places, but they happen rarely and they actually add/remove procs from the array, so its unlikely anything can change there anyhow. Well, and during normal during COMMIT and ABORT, which might happen rather frequently ;-) I do agree, however, that XidCacheRemoveRunningXids() is the only site left where getting rid of it might be possible, and might bring measurable benefit for some workloads. With more effort, we might not even need it during ABORT, but I doubt that the effort would be worth it. While some (plpgsql intensive) workloads might abort subxacts rather frequently, I doubt that same holds true for toplevel aborts. I'm actually working on a patch to remove that lock from XidCacheRemoveRunningXids(), but I'm not yet completely sure that my approach is safe. Tom had some objections that I take rather seriously. We'll see ;-) greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: Currently, we do not assume that either the childXids array, nor the xid cache in the proc array are sorted by ascending xid order. I believe that we could simplify the code, further reduce the locking requirements, and enabled a transaction to de-overflow it's xid cache if we assume that those arrays are in ascending xid order. de-overflowing the cache sounds completely unsafe, as other backends need that state to determine whether they need to look into pg_subtrans. We'd only de-overflow if we abort *all* xids that are missing from the xid cache. And only after marking them as aborted in the clog. If someone concurrently checks for an overflow, and already sees the new (non-overflowed) state, than he'll assume the xid is not running if he hasn't found it in the array. Which is correct - we just aborted it. Plus, removing the exclusive lock doesn't depend on de-overflowing. It's just something that seems rather easy to do once the subxid handling is in a state that allows concurrent removal of entries. If it turns out that it's not that easy, than I'll just drop the idea again. I still don't believe you can avoid taking exclusive lock, either; your argument here did not address latestCompletedXid. Sorry, not addressing latestCompletedXid was an oversight :-(. My point is the we only *need* to advance latestCompletedXid on COMMITS. We do so for aborts only to avoid running with unnecessarily low xmins after a transaction ABORT. That corner case can only happen after a toplevel ABORT, though - aborting subxacts cannot change the xmin, because the toplevel xact will have a lower xid than any of it's subtransactions anyway. We can therefore just remember the largest assigned xid for a given transaction, and update latestCompletedXid to that on toplevel commit or abort. That prevents that corner-case too, without updating latestCompletedXid during subxact abort. But the main point remains this: there is no evidence whatsoever that these code paths are sufficiently performance-critical to be worth speeding up by making the code more fragile. The gain will be less than that of the locking improvements done so far. It will be a win for heavy users of plpgsql BEGIN/END/EXCEPTION blocks, though I think. We'll also save some cycles in TransactionIdIsInProgress, because we can use a binary search, but that's just an added bonus. I'm currently trying to code up a patch, since it's easier to judge the correctness of actual code than that of a mere proposals. I'll do some benchmarking when the patch is done to see if it brings measurable benefits. greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.
Hi I've already posted this idea, but I feel that I did explain it rather badly. So here comes a new try. Currently, we do not assume that either the childXids array, nor the xid cache in the proc array are sorted by ascending xid order. I believe that we could simplify the code, further reduce the locking requirements, and enabled a transaction to de-overflow it's xid cache if we assume that those arrays are in ascending xid order. ** Sortedness ** Presumably, the existing code *already* guarantees that both the childXids and the xid cache *are* in fact sorted by ascending xid order - due to the following reasons: A transaction (including subtransactions) form a tree. Each xact (or node) may have any number of children - assume that they are added left to right in the order of their creation. For ever (sub)xact, all subtrees but the rightmost one is closed (meaning that no new nodes will ever be created inside them). That last assertion follows from the fact that each non-rightmost subtree is either aborted, or subcommitted. Since adding a new sibling to a (sub) xact is only possible if all the other subtrees are subcommitted or aborted, it will surely get a higher xid than any node in any sibling subtree. Since the xids in the subcommitted siblings where already added to the original (sub) xact upon subcommit, the childXids of that xact will have the following structure: xids of 1. subcommited child xids of 2. subcommited child ... Each of the sublists are guaranteed to include only xids larger than those inside their predecessors - which leads to a completely sorted array in the whole. The proc-array cache is surely sorted by ascending xid order, because we add entries while we generate them (until we overflow). ** Benefits *** Since we know that both the childXids and the xid cache array are sorted, removing entries from the cache becomes much easier. Since we can only abort a rightmost subtree (All others are either already aborted, or subcommitted), we know that the to-be-aborted xids are the N largest xids in our whole transactions. To remove them from the proc array we overwrite their xids with InvalidTransactionId from right-to-left, and afterwards set the new array length. This allows as to replace the ExclusiveLock with a ShareLock. Since we overwrite the xids right-to-left, a snapshot take concurrently will miss some of the right-most xids. This is exactly the same situation as if we had aborted those xids one-by-one, instead of all in one go. (Xmin calculation are completely unaffected - the toplevel xid *not* removed this way, and that one is surely smaller than any subxact xid) The rather easy xid-cache remove algorithms will also detect if a previously overflowed cache will be de-overflowed by the removal. (This needs a little bit more checking - de-overflowing the xid cache is a bit trickey because we must not race with TransactionIdIsInProgress. But that seems doable, it'll just need the right ordering of things - maybe updating the CLOG *before* updating the proc array is already sufficient even). If there is interest in doing this, I can come up with a patch. greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Simon Riggs wrote: On Fri, 2007-09-07 at 06:36 +0200, Florian G. Pflug wrote: Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: - I actually think with just a little bit of more work, we can go even further, and get rid of the ReadNewTransactionId() call completely during snapshotting. [ squint... ] This goes a bit far for me. In particular, I think this will fail in the edge case when there are no live XIDs visible in ProcArray. You cannot go back and do ReadNewTransactionId afterward, at least not without re-scanning the ProcArray a second time, which makes it at best a questionable win. Why would it? I think the additional suggestion goes a bit too far. You may be right, but I don't want to change the transaction system in advanced ways this close to the next release. We may have difficulty spotting bugs in that thinking during beta. Ok, those were two clear votes against doing this, so I'll stop arguing ;-). I do think that we should have another look at this when 8.4 opens, though. greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Tom Lane wrote: I've spent the past hour or so trying to consolidate the comments in GetSnapshotData and related places into a single chunk of text to be added to src/backend/access/transam/README. Attached is what I have so far --- this incorporates the idea of not taking ProcArrayLock to exit an XID-less transaction, but not yet Florian's idea. I think it'd get simpler if we changed to that, but am posting this for comments. Interlocking transaction begin, transaction end, and snapshots -- We try hard to minimize the amount of overhead and lock contention involved in the frequent activities of beginning/ending a transaction and taking a snapshot. Unfortunately, we must have some interlocking for this, because it is critical that all backends agree on the commit order of transactions. This is actually still a slightly stronger requirement than what we really need I think. To simplify the following discussion, A - B shall mean that transaction A saw B as committed. Conversely, A ! B shall mean that A treats B as in-progress. If A was in read-committed mode, the visibility refers to the latest snapshot that A used. Now assume A and B commit at nearly the same time, and for two other transactions C and D the following holds: C - A, C ! B but D ! A, D - B. This would violate the requirement that the commit order is globally agreed upon, yet as long as both A ! B and B ! A holds, there is no conflict. (Note that if A and B are serializable, A ! B B ! A implies that A and B cannot have touched the same record and have both committed - one would have been aborted due to a SerializationError). I must admit, though, that this is a quite academic case, since the prerequisite A ! B and B ! A is something we have no control over for read-committed transactions - who knows when they might have taken their last snapshot... Still, I wanted to mention this because I believe that the minimal requirement that we actually *need* to enforce is A - B and B - C imply A - C. (T1) The actual implementation will probably always have to enforce something slightly stronger, but it's still nice to know the minimal guarantee needed to be able to judge correctness. For example, suppose an UPDATE in xact A is blocked by xact B's prior update of the same row, and xact B is doing commit while xact C gets a snapshot. Xact A can complete and commit as soon as B releases its locks. If xact C's GetSnapshotData sees xact B as still running, then it had better see xact A as still running as well, or it will be able to see two tuple versions - one deleted by xact B and one inserted by xact A. In my notation this becomes: A - B and C ! B implies C ! A. This then follows from (T1) - Assume that A - B, C ! B but C - A, then with (A) C - B follows from C - A and A - B, which contradicts C ! B. We enforce this by not allowing any transaction to exit the set of running transactions while a snapshot is being taken. (This rule is probably stronger than necessary, but see the next problem.) The implementation of this is that GetSnapshotData takes the ProcArrayLock in shared mode (thereby allowing multiple backends to take snapshots in parallel), but xact.c must take the ProcArrayLock in exclusive mode while clearing MyProc-xid at transaction end (either commit or abort). Agreed. We *do* enforce a strict global ordering of committs and snapshots. This then guarantees (T1) because if A - B and B - C, than A *must* have taken it's snapshot after B committed, and B in turn *must* have taken it's snapshot after C committed, so surely A - C will hold too. Here is another variant of the risk scenario: 1. Xact A is running (in Read Committed mode). 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is swapped out before it can acquire ProcArrayLock. 3. Xact B gets new XID (= C's xmax), makes changes and commits. 4. Xact A changes some row R changed by xact B and commits. 5. Xact C finishes getting its snapshot data. It sees xact A as done, but sees xact B as still running (since B = xmax). Now C will see R changed by xact B and then xact A, *but* does not see other changes made by xact B. If C is supposed to be in Serializable mode, this is wrong. I never really grasped why we need to assume serializable here - this seems wrong if C is read-committed too. Seeing only half of a transaction's changes can never be right, can it? To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock before it calls ReadNewTransactionId. This prevents xact A from exiting the set of running transactions seen by xact C. Therefore both A and B will be seen as still running = no inconsistency. Another point of view is that determining the xmax of a snapshot really *is* part of taking the snapshot. Since we already obtained that we need to serialize snapshotting and committing, it follows that we must not allow committs to happen between the
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Tom Lane wrote: Here's some revised text for the README file, based on using Florian's idea of a global latestCompletedXid variable. As I worked through it I realized that in this design, XidGenLock gates entry of new XIDs into the ProcArray while ProcArrayLock gates their removal. Which is an interesting sort of symmetry property. It also turns out that the reason we need to gate entry with XidGenLock is to keep from breaking GetOldestXmin, rather than to ensure correctness of snapshots per se. I believe it would break both, no? If an xid = latestCompletedXid is not included in the snapshot, but later used for updates, the snapshot will see those changes as committed when they really are not. But other than that, it really sounds fine. It certainly explains things much better than the comments in the existing code. I noticed two rather cosmetic issues .) latestCompletedXid sounds as it might refer to the *last* completed xid, but it actually refers to the largest / highest completed xid. So maybe we should call it highestCompletedXid or largestCompletedXid. .) Since you mention that we assume reading and writing int4s are atomic operations, maybe we should mention that for safety's sake we mark the corresponding pointers with volatile? greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: So I believe you're right, and we can skip taking the lock in the no xid case - I actually think with just a little bit of more work, we can go even further, and get rid of the ReadNewTransactionId() call completely during snapshotting. [ squint... ] This goes a bit far for me. In particular, I think this will fail in the edge case when there are no live XIDs visible in ProcArray. You cannot go back and do ReadNewTransactionId afterward, at least not without re-scanning the ProcArray a second time, which makes it at best a questionable win. Why would it? The idea was to remember the largest committed xid, and that won't go away just because the proc array is rather empty xid-wise. Actually, in that case the largest comitted xid+1 will (nearly) be what ReadNewTransactionId() returns. (Nearly because the transaction with the xid ReadNewTransactionId()-1 might have aborted, so largestCommittedXid might be a bit further behind ReadNewTransactionId().) (That slightly lagging of largestCommittedXid might cause some tuples not to be VACUUMED though, so we might want to update largestCommittedXid for ABORTS too, and probably rename it to largestNonRunningXid or whatever ;-) ). I would go as far as saying that largestCommittedXid+1 is the natural choice for xmax - after all, xmax is the cutoff point after which a xid *cannot* be seen as committed, and largestCommittedXid+1 is the smallest xmax that guarantees that we see xacts committed before the snapshot as committed. The xmin computation won't change - apart from using some other initial value. This would rid us of the rather complicated entanglement of XidGenLock and the ProcArrayLock, lessen the lock contention, and reduce the average snapshot size a bit. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?
Tom Lane wrote: Simon was complaining a bit ago that we still have problems with excessive contention for the ProcArrayLock, and that much of this stems from the need for transaction exit to take that lock exclusively. The lazy-XID patch, as committed, doesn't help that situation at all, saying /* * Lock ProcArrayLock because that's what GetSnapshotData uses. * You might assume that we can skip this step if we had no * transaction id assigned, because the failure case outlined * in GetSnapshotData cannot happen in that case. This is true, * but we *still* need the lock guarantee that two concurrent * computations of the *oldest* xmin will get the same result. */ I think the comment is correct in principle - If we remove the oldest xmin without locking, then two concurrent OldestXmin calculations will get two different results. The question is if that has any negative effects, though. That leaves xmin, which AFAICS is only interesting for the computations of GetOldestXmin() and RecentGlobalXmin. And I assert it doesn't matter if those numbers advance asynchronously, so long as they never go backward. Yes, the xmin is surely the only field that might need need the locking. It was this comment in GetSnapshotData that made me keep the locking in the first place: * It is sufficient to get shared lock on ProcArrayLock, even if we are * computing a serializable snapshot and therefore will be setting * MyProc-xmin. This is because any two backends that have overlapping * shared holds on ProcArrayLock will certainly compute the same xmin * (since no xact, in particular not the oldest, can exit the set of * running transactions while we hold ProcArrayLock --- see further * discussion just below). So it doesn't matter whether another backend * concurrently doing GetSnapshotData or GetOldestXmin sees our xmin as * set or not; he'd compute the same xmin for himself either way. * (We are assuming here that xmin can be set and read atomically, * just like xid.) But now that I read this again, I think that comment is just missleading - especially the part So it doesn't matter whether another backend concurrently doing GetSnapshotData or GetOldestXmin sees our xmin as set or not; he'd compute the same xmin for himself either way. This sounds as if the Proc-xmin that *one* backend announces had influence over the Proc-xmin that *another* backend might compute. Which isn't true - it only influences the GlobalXmin that another backend might compute. So I believe you're right, and we can skip taking the lock in the no xid case - I actually think with just a little bit of more work, we can go even further, and get rid of the ReadNewTransactionId() call completely during snapshotting. There are two things we must ensure when I comes to snapshots, commits and xid assignment. 1) A transaction must either be not in progress, be in our snapshot, or have an xid = xmax. 2) If transaction A sees B as committed, and B sees C as committed, then A must see C as committed. ad 1): We guarantee that by storing the xid in the proc array before releasing the XidGenLock. Therefore, when we later obtain our xmax value, we can be sure that we see all xacts in the proc array that have an xid xmax and are in progress. ad 2): We guarantee that by serializing snapshotting against committing. Since we use ReadNewTransactionId() as the snapshot's xmax this implies that we take the ProcArrayLock *before* reading our xmax value. Now, ReadNewTransactionId() is actually larger than necessary as a xmax. The minimal xmax that we can set is largest committed xid+1. We can easily track that value during commit when we hold the ProcArrayLock (If we have no xid, and therefor don't have to hold the lock, we also don't need to update that value). If we used this LatestCommittedXid as xmax, we'd still guarantee (2), but without having to hold the XidGenLock during GetSnapshotData(). I wouldn't have dared to suggest this for 8.3, but since you came up with locking improvements in the first place... ;-) greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] loose ends in lazy-XID-assigment patch
Tom Lane wrote: I've committed Florian's patch, but there remain a couple of things that need work: * Should CSV-mode logging include the virtual transaction ID (VXID) in addition to, or instead of, XID? There will be many situations where there is no XID. Maybe make %x show both, or only the xid if that is set, and the vxid otherwise? That would probably be what most existing users of %x want. For those who want them seperated, we'd have %v (vxid), and maybe %X (xid only). Seems a bit like overkills, though... * As things stand, when a two-phase transaction is prepared, it drops its lock on the original VXID; this seems necessary for reasons previously discussed. I made the code put an invalid VXID into the gxact structure for the prepared xact, which means that pg_locks shows things like this: regression=# select * from pg_locks; locktype| database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted ---+--+--+--+---++---+-+---+--++---+-+- transactionid | | | | || 21774 | | | | -1/0 | | ExclusiveLock | t relation | 126093 | 126124 | | || | | | | -1/0 | | AccessShareLock | t relation | 126093 |10969 | | || | | | | 1/260 | 20592 | AccessShareLock | t virtualxid| | | | | 1/260 | | | | | 1/260 | 20592 | ExclusiveLock | t (4 rows) This seems fairly undesirable :-( not least because you can't tell one prepared xact from another and thus can't see which locks belong to each. But I'm unsure what to do about it. We could have the prepared xact continue to display the original VXID, but there would be no certainty about the VXID remaining unique, which seems bad. Another possibility is to put back the transaction ID column, but since that's not unique for read-only transactions, we still don't have anything usable as a join key. The best idea I can think of is to make the virtualtransaction column read out the VXID for regular transactions and the transaction ID for prepared transactions, or maybe the transaction ID for any transaction that has one and VXID just for read-only xacts. We can get away with that because the column is only text and not any better-defined datatype. It seems mighty ugly though; and changing the ID shown for a transaction mid-stream isn't very pleasant either. We could make the VXID in the gxact struct be backendId=InvalidBackendId, lxid=xid. That'd be still an invalid vxid, but not the same for every prepared transaction. If we take this further, we could get rid of the lock on the xid completely, I believe. We'd define some PermanentBackendId (lets say, -2, since -1 is taken). When preparing the xact, we'd drop the lock on the old VXID, and instead acquire one on (PermanentBackendId, xid). Waiting on an xid would become a bit tricky, but doable I think. We'd have to first check the procarray - if we find the xid there, we translate it to a vxid, and wait on that. Aftwards (whether we found a vxid, or not) we wait on (PermanentBackendId, xid). That doesn't exactly make XactLockTableWait cheaper, but that might be OK, since we I haven't really thought this through, though. I think that with carefull ordering of things we can control the race conditions this might posses - but I'm not sure at this point. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Per-function GUC settings: trickier than it looked
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: Tom Lane wrote: So it seems that only SET LOCAL within a function with per-function GUC settings is at issue. I think that there is a pretty strong use-case for saying that if you have a per-function setting of a particular variable foo, then any SET LOCAL foo within the function ought to vanish at function end --- for instance a function could want to try a few different search_path settings and automatically revert to the caller's setting on exit. Agreed. The question is what about SET LOCAL on a variable that *hasn't* been explicitly SET by the function definition. Either approach we take with it could be surprising, but probably having it revert at function end is more surprising... At least for me, the least surprising behaviour would be to revert it too. Than the rule becomes a function is always executed in a pseudo-subtransaction that affects only GUCs Since at least for pl/pgsql, a function body *alreay* is a BEGIN/END block - and therefore syntactically even looks like a subtransaction - this seems quite logical. And it would mean that the semantics of SET LOCAL won't change, just because you add an EXCEPTION clause to the function's toplevel BEGIN/END block. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Per-function GUC settings: trickier than it looked
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: At least for me, the least surprising behaviour would be to revert it too. Than the rule becomes a function is always executed in a pseudo-subtransaction that affects only GUCs Only if it has at least one SET clause. The overhead is too high to insist on this for every function call. In that case, I agree that only variables specified in a SET-clause should be reverted. Otherwise, adding or removing SET-clauses (e.g, because you chose a different implementation of a function that suddenly doesn't need regexps anymore) will cause quite arbitrary behavior changes. And the rule becomes (I tend to forget things, so I like simple rules that I can remember ;-) ) For each SET-clause, there is a pseudo-subtransaction affecting only *this* GUC. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Per-function GUC settings: trickier than it looked
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: And the rule becomes (I tend to forget things, so I like simple rules that I can remember ;-) ) For each SET-clause, there is a pseudo-subtransaction affecting only *this* GUC. The other question is whether we want to change the behavior of SET LOCAL even in the absence of function SET-clauses. The current rule is that a LOCAL setting goes away at subtransaction commit, leading to this behavior: regression=# show regex_flavor; regex_flavor -- advanced (1 row) regression=# begin; BEGIN regression=# savepoint x; SAVEPOINT regression=# set local regex_flavor to basic; SET regression=# release x; RELEASE regression=# show regex_flavor; regex_flavor -- advanced (1 row) which makes some sense if you think of release as subtransaction end, but not a lot if you think of it as forgetting a savepoint. Likewise, SET LOCAL within a plpgsql exception block goes away at successful block exit, which is not the first thing you'd expect. Neither of these behaviors are documented anywhere AFAIR; certainly the SET reference page doesn't explain 'em. I think we should probably take this opportunity to fix that, and make SET LOCAL mean persists until end of current top-level transaction, unless rolled back earlier or within a function SET clause. So: * Plain SET takes effect immediately and persists unless rolled back or overridden by another explicit SET. In particular the value will escape out of a function that has a SET-clause for the same variable. * SET LOCAL takes effect immediately and persists until rolled back, overridden by another SET, or we exit a function that has a SET-clause for the same variable. * Rollback of a transaction or subtransaction cancels any SET or SET LOCAL within it. Otherwise, the latest un-rolled-back SET or SET LOCAL determines the active value within a transaction, and the latest un-rolled-back SET determines the value that will prevail after the transaction commits. * A function SET clause saves the entry-time value, and restores it at function exit, except when overridden by an un-rolled-back SET (but not SET LOCAL) within the function. Clear to everyone? Any objections? That makes SET LOCAL completely equivalent to SET, except when used inside a function that has a corresponding SET-clause, right? So I think *if* this is done, SET LOCAL should be renamed to SET FUNCTION. This would also prevent confusion, because everyone who currently uses SET LOCAL will have to change his code anyway, since the semantics change for every use-case apart from functions with SET-clauses, which don't exist in 8.2. Or am I overlooking something? And renaming SET LOCAL also emphasized that point that we are taking away functionality here - even if that functionality might not seem very useful. BTW, I *did* check the documentation before responding to Simon's original mail, and I *did* read it as SET LOCAL goes away a subtransaction end. I figured that since there is no word on subtransactions in that part of the documentation, transaction will apply equally to both toplevel and subtransaction. It might very well be that I'm the only one who read it that way, though ;-) And I must admin that I wasn't completely sure, so I *did* try it out before I posted... I'd strong prefer SET LOCAL to kept it's current semantics, only that SET LOCAL changes will now be rolled back if the function has a matching SET-clause. For multiple reasons: .) It's useful to be able to temporarily change GUCs from a client, and being able to reset them afterwards. Using a subtransaction for this is maybe a bit wastefull, but at least it works. .) In pl/pgsql, that fact that SET LOCAL goes away after the current BEGIN/END block seems entirely logical. .) It doesn't take away existing functionality greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Per-function GUC settings: trickier than it looked
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: Tom Lane wrote: Clear to everyone? Any objections? That makes SET LOCAL completely equivalent to SET, except when used inside a function that has a corresponding SET-clause, right? Maybe it wasn't clear :-(. They aren't equivalent because in the absence of rollback, SET's effects persist past main-transaction end; SET LOCAL's don't. That's the way they were defined originally (pre-subtransactions) and it still seems to make sense. Ah, OK - things make much more sense now. So I think *if* this is done, SET LOCAL should be renamed to SET FUNCTION. This would also prevent confusion, because everyone who currently uses SET LOCAL will have to change his code anyway, since the semantics change for every use-case apart from functions with SET-clauses, which don't exist in 8.2. I'm not sure how many people have really written code that depends on the behavior of SET LOCAL rolling back at successful subtransaction end. I think we'd have heard about it if very many people had noticed, because it's not what the manual says. For the one use we've actually advocated (setting a temporary value within a function and then reverting to the old setting before exit), there isn't any visible change in behavior, since abandonment of the restored value at subtransaction end still ends up with the same result. And renaming SET LOCAL also emphasized that point that we are taking away functionality here - even if that functionality might not seem very useful. We can't break the officially advocated solution for secure search_path. However, that particular coding pattern will still work with the change I'm proposing. It's only where you *don't* manually restore the prior value that you might notice a difference. BTW, I *did* check the documentation before responding to Simon's original mail, and I *did* read it as SET LOCAL goes away a subtransaction end. I figured that since there is no word on subtransactions in that part of the documentation, transaction will apply equally to both toplevel and subtransaction. Yeah, but you know that it's subtransactions under the hood, whereas someone who's thinking in terms of SAVEPOINT/RELEASE and BEGIN/EXCEPTION probably hasn't a clue about that. I plead guilty here ;-). That whole SAVEPOINT/RELEASE thing always seemed strange to me - I just accepted it at some point, but still translated it into something hierarchical I guess .) In pl/pgsql, that fact that SET LOCAL goes away after the current BEGIN/END block seems entirely logical. I don't think so ... your other side-effects such as table updates don't disappear, so why should SET's I guess because LOCAL to me implies some lexical locality - like the surrounding BEGIN/END block. I'm not necessarily averse to inventing a third version of SET, but I don't see a well-thought-out proposal here. In particular, we should be making an effort to *not* expose the concept of subtransaction at the SQL level at all, because that's not what the spec has. Thanks for your explanation - I can see your point now, after realizing why the spec has SAVEPOINT/RELEASE and *not* nested BEGIN/COMMIT blocks. So, at least on the SQL-level, I guess I agree - your new semantics fit better with the sql spec, even if they seemed quite strange to me at first sight. Though maybe we should add SET TRANSACTION as a synonym for SET LOCAL? - the former seems to convey your new semantics much better than the later. It still seems a bit strange that SET LOCAL is undone at function-exit, if the function has a matching SET-clause. But we need that for backwards- compatibility of the secure-search_path workaround, right? Maybe we could make SET TRANSACTION different from SET LOCAL in pl/pgsql, and warn if SET LOCAL is used? That would enable us either get rid of SET LOCAL in the long term, or to really make it local to the surrounding BEGIN/END block. So, to reiterate, my idea is .) Make SET TRANSACTION a synonym for SET LOCAL at the SQL-Level. .) In pl/pgsql, SET TRANSACTION sets a new value that is kept after the function exits, even if the function has a matching SET-clause. .) SET LOCAL in pl/pgsql set a new value that is kept if the function has no matching SET-clause. If it has one, the value is restored. In any case, we emit a warning that SET LOCAL is going away. .) One day, make SET LOCAL in pl/pgsql mean local to the surrounding BEGIN/END block. Independent of any SET-clauses the function might or might not have. The last idea might seem to create a inconsistency between the SQL-level and pl/pgsql, but I think it does not. SET LOCAL is local to the surrounding BEGIN/{END|COMMIT} block in both cases - it's just that you have nested such blocks in pl/pgsql, but not in plain SQL. greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Per-function GUC settings: trickier than it looked
Tom Lane wrote: So, to reiterate, my idea is .) Make SET TRANSACTION a synonym for SET LOCAL at the SQL-Level. .) In pl/pgsql, SET TRANSACTION sets a new value that is kept after the function exits, even if the function has a matching SET-clause. .) SET LOCAL in pl/pgsql set a new value that is kept if the function has no matching SET-clause. If it has one, the value is restored. In any case, we emit a warning that SET LOCAL is going away. .) One day, make SET LOCAL in pl/pgsql mean local to the surrounding BEGIN/END block. Independent of any SET-clauses the function might or might not have. I don't think it's a good idea to change SET LOCAL now and plan on changing it again later ;-). If we really want BEGIN-block-local SET capability, I'd prefer to think of some new keyword for that. But I'm not convinced it's interesting --- given the proposed behavior of function SET-clauses, attaching a SET to your function seems like it'll cover the need for restoring outer values. Hm... could we still have SET TRANSACTION as a synonym for SET LOCAL? That would blend nicely with SET TRANSACTION ISOLATION LEVEL and SET TRANSACTION READ ONLY. [ thinking... ] Hey, wait a moment. Regarding SET TRANSACTION READ ONLY - This is not strictly speaking a GUC, but still, if we pretend that there are no subtransaction, that command should too propage to the outermost transaction on release, shouldn't it? This is what happens currently (CVS HEAD with at least your initial function-SET-clause patch already in) regression=# begin ; BEGIN regression=# savepoint s1 ; SAVEPOINT regression=# set transaction read only ; SET regression=# release s1 ; RELEASE regression=# create table test (id int) ; CREATE TABLE regression=# commit ; COMMIT compared to: regression=# begin ; BEGIN regression=# set transaction read only ; SET regression=# create table test (id int) ; ERROR: transaction is read-only I believe that for consistencies sake, the set transaction read only should have propagated to the outermost transaction on release s1. greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PATCH] Lazy xid assingment V2
Heikki Linnakangas wrote: Tom Lane wrote: I had an idea this morning that might be useful: back off the strength of what we try to guarantee. Specifically, does it matter if we leak a file on crash, as long as it isn't occupying a lot of disk space? (I suppose if you had enough crashes to accumulate many thousands of leaked files, the directory entries would start to be a performance drag, but if your DB crashes that much you have other problems.) This leads to the idea that we don't really need to protect the open(O_CREAT) per se. Rather, we can emit a WAL entry *after* successful creation of a file, while it's still empty. This eliminates all the issues about logging an action that might fail. The WAL entry would need to include the relfilenode and the creating XID. Crash recovery would track these until it saw the commit or abort or prepare record for the XID, and if it didn't find any, would remove the file. That idea, like all other approaches based on tracking WAL records, fail if there's a checkpoint after the WAL record (and that's quite likely to happen if the file is large). WAL replay wouldn't see the file creation WAL entry, and wouldn't know to track the xid. We'd need a way to carry the information over checkpoints. Yes, checkpoints would need to include a list of created-but-yet-uncommitted files. I think the hardest part is figuring out a way to get that information to the backend doing the checkpoint - my idea was to track them in shared memory, but that would impose a hard limit on the number of concurrent file creations. Not nice :-( But wait... I just had an idea. We already got such a central list of created-but-uncommited files - pg_class itself. There is a small window between file creation and inserting the name into pg_class - but as Tom says, if we leak it then, it won't use up much space anyway. So maybe we should just scan pg_class on VACUUM, and obtain a list of files that are referenced only from DEAD tuples. Those files we can than safely delete, no? If we *do* want a strict no-leakage guarantee, than we'd have to update pg_class before creating the file, and flush the WAL. If we take Alvaro's idea of storing temporary relations in a seperate directory, we could skip the flush for those, because we can just clean out that directory after recovery. Having to flush the WAL when creating non-temporary relations doesn't sound too bad - those operations won't occur very often, I'd say. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PATCH] Lazy xid assingment V2
August Zajonc wrote: Yes, checkpoints would need to include a list of created-but-yet-uncommitted files. I think the hardest part is figuring out a way to get that information to the backend doing the checkpoint - my idea was to track them in shared memory, but that would impose a hard limit on the number of concurrent file creations. Not nice :-( I'm confused about this. As long as we assert the rule that the file name can't change on the move, then after commit the file can be in only one of two places. The name of the file is known (ie, pg_class). The directories are known. What needs to be carried forwarded past a checkpoint? We don't even look at WAL, so checkpoints are irrelevant it seems If there is a crash just after commit and before the move, no harm. You just move on startup. If the move fails, no harm, you can emit warning and open in /pending (or simply error, even easier). If you're going to open the file from /pending, whats the point of moving it in the first place? The idea would have to be that you move on commit (Or on COMMIT-record replay, in case of a crash), and then, after recovering the whole wal, you could remove leftover files in /pending. The main problem is that you have to do the move *after* flushing the COMMIT record to disk - otherwise you're gonna leak the file if you crash between moving and flushing. But that implies that the transaction is *already* committed when you do the move. Others won't know that yet (You do the move *after* flushing, but *before* updating the CLOG) - but still, since the COMMIT-record is on disk, you cannot rollback anymore (Since if you crash, and replay the COMMIT record, the transaction *will* be committed). So, what are you going to do if the move fails? You cannot roll back, and you cannot update the CLOG (because than others would see your new table, but no datafile). The only option is to PANIC. This will lead to a server restart, WAL recovery, and probably another PANIC once the COMMIT-record is replayed (Since the move probably still won't be possible). It might be even worse - I'm not sure that a rename is an atomic operation on most filesystems. If it's not, then you might end up with two files if power fails *just* as you rename, or, worse with no file at all. Even a slight possibility of the second case seems unacceptable - I means loosing a committed transaction. I agree that we should eventually find a way to guarantee either no file leakage, or at least an upper bound on the amount of wasted space. But doing so at the cost of PANICing if the move fails seems like a bad tradeoff... greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCH] Lazy xid assingment V2
Tom Lane wrote: Florian G. Pflug [EMAIL PROTECTED] writes: It might be even worse - I'm not sure that a rename is an atomic operation on most filesystems. rename(2) is specified to be atomic by POSIX, but relinking a file into a different directory can hardly be --- it's not even provided as a single kernel call, is it? I'd have thought that they only guarantee that if the new name already exists it's atomically replaced. But I might be wrong And there's still the problem that changing the filename on-the-fly is going to break tons of low-level stuff, most of which is not supposed to know about transactions at all, notably bgwriter. Good point - I thought that we wouldn't have to care about this because we could close the relation before renaming in the committing backend and be done with it, because other backends won't see the new file before we update the clog. But you're right, bgwriter is a problem and one not easily solved... So that rename-on-commit idea seems to be quite dead... What I was thinking about was a flag file separate from the data file itself, a bit like what we use for archiver signaling. If is the new data file, then touch .new to mark the file as needing to be deleted on restart. Remove these files just *before* commit. This leaves you with a narrow window between removing the flag file and actually committing, but there's no risk of having to PANIC --- if the remove fails, you just abort the transaction. Hm.. we could call the file nnn.xid.new, and delete it after the commit, silently ignoring any failures. During both database-wide VACUUM and after recovery we'd remove any leftover *.xid.new files, but only if the xid is marked committed in the clog. After that cleanup step, we'd delete any files which still have an associated flag file. Processing those nnn.xid.new files during VACUUM is just needed to avoid any problems because of xid wraparound - it could maybe be replaced by maybe naming the file nnn.epoch.xid.new However, this has nonzero overhead compared to the current behavior. I'm still dubious that we have a problem that needs solving ... I agree that file leakage is not a critical problem - if it were, they'd be much more complaints... But it's still something that a postgres DBA has to be aware of, because it might bite you quite badly. Since IMHO admin friendlyness is one of the strengths of postgresql, removing the possibility of leakage would be nice in the long term. Nothing that needs any rushing, though - and nothing that we'd want to pay for in terms of performance. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings