Re: [HACKERS] How to avoid base backup in automated failover
On Wednesday, October 17, 2012 11:22 AM chinnaobi wrote: > Hey Haas, > > What does the standby server have to wait for replication to catch up > before > promoting ?? Is there any parameter to configure this ?? > > Few more questions on this part > > 1. How could we ensure the standby has received all transactions sent by > primary till the point primary server is dead. (Meaning the dead primary > and > standby server are exactly same, so that the dead primary comes back it > can > be turned to standby without any issues). > 2. When the dead primary is turned to standby the streaming is not > happening > due to current_wal_location is ahead in the standby server is ahead of > wal_sent_location. In this case how can I start streaming without taking > a > fresh base backup ?? > > 3. When the dead primary comes back the DB still accepts data and it > goes to > out of sync with the current primary and streaming won't start. Is there > any > solution for this case ?? I think points 2 & 3 can be addressed with new feature getting implemented by Heikki provided standby has received all WAL of primary till the point it goes dead. https://commitfest.postgresql.org/action/patch_view?id=933 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility
On 16 October 2012 23:03, Josh Berkus wrote: > Can you explain in more detail how this would be used on the receiving > side? I'm unable to picture it from your description. This will allow implementation of pgq in core, as discussed many times at cluster hackers meetings. > I'm also a bit reluctant to call this a "message queue", since it lacks > the features required for it to be used as an application-level queue. It's the input end of an application-level queue. In this design the queue is like a table, so we need SQL grammar to support this new type of object. Replication message doesn't describe this, since it has little if anything to do with replication and if anything its a message type, not a message. You're right that Hannu needs to specify the rest of the design and outline the API. The storage of the queue is "in WAL", which raises questions about how the API will guarantee we read just once from the queue and what happens when queue overflows. The simple answer would be we put everything in a table somewhere else, but that needs more careful specification to show we have both ends of the queue and a working design. Do we need a new object at all? Can we not just define a record type, then define messages using that type? At the moment I think the named-object approach works better, but we should consider that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to avoid base backup in automated failover
Hey Haas, What does the standby server have to wait for replication to catch up before promoting ?? Is there any parameter to configure this ?? Few more questions on this part 1. How could we ensure the standby has received all transactions sent by primary till the point primary server is dead. (Meaning the dead primary and standby server are exactly same, so that the dead primary comes back it can be turned to standby without any issues). 2. When the dead primary is turned to standby the streaming is not happening due to current_wal_location is ahead in the standby server is ahead of wal_sent_location. In this case how can I start streaming without taking a fresh base backup ?? 3. When the dead primary comes back the DB still accepts data and it goes to out of sync with the current primary and streaming won't start. Is there any solution for this case ?? Reddy. -- View this message in context: http://postgresql.1045698.n5.nabble.com/How-to-avoid-base-backup-in-automated-failover-tp5711147p5728518.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to avoid base backup in automated failover
Hey Haas,What does the standby server have to wait for replication to catch up before promoting ?? Is there any parameter to configure this ??Few more questions on this part 1. How could we ensure the standby has received all transactions sent by primary till the point primary server is dead. (Meaning the dead primary and standby server are exactly same, so that the dead primary comes back it can be turned to standby without any issues).2. When the dead primary is turned to standby the streaming is not happening due to current_wal_location is ahead in the standby server is ahead of wal_sent_location. In this case how can I start streaming without taking a fresh base backup ??3. When the dead primary comes back the DB still accepts data and it goes to out of sync with the current primary and streaming won't start. Is there any solution for this case ??Reddy. -- View this message in context: http://postgresql.1045698.n5.nabble.com/How-to-avoid-base-backup-in-automated-failover-tp5711147p5728517.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Re: [HACKERS] WebSphere Application Server support for postgres
On Mon, Oct 15, 2012 at 10:28 AM, Dave Page wrote: > [Removing all lists except -hackers. Please do not cross-post to every > list again!] > > On Mon, Oct 15, 2012 at 9:22 AM, John Nash > wrote: >> Hi, >> >> Our IT Company systems architecture is based on IBM Websphere >> Application Server, we would like to migrate our databases to >> Postgres, the main problem which stops us from doing that is Postgres >> is not supported by IBM Websphere Application Server. >> There is a Request for Enhancement that has been opened in IBM Web in >> order to solve this issue, if you are interested in this enhancement >> to be done, please vote for the Enhancement in the following link: >> >> http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=27313 > > A login is required to access that site. Can you provide a link or > info that doesn't require login please? The relevant content of the ticket is basically just: Description: WebSphere Application Server support for JDBC access to PosgreSQL databases. Use case: In WAS, be able to define and use JDBC providers and datasources pointing to PosgreSQL databases. Florent -- Florent Guillaume, Director of R&D, Nuxeo Open Source, Java EE based, Enterprise Content Management (ECM) http://www.nuxeo.com http://www.nuxeo.org +33 1 40 33 79 87 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Doc patch "only relevant" -> "relevant only"
Hi, As long as I'm sending in trivial fixes to the docs here's a bit of wording that's been bugging me. In a number of places the docs read "only relevant", this patch reverses this to read "relevant only". I believe this reads better because it quickly answers the question "is what?" with "is relevant", making the sentence less of a strain to read. "Only relevant" would be better if you really wanted to emphasize the "only", which I don't think is called for. (Sending in such trivial patches makes me feel like I'm bikeshedding. Feel free to ignore them without comment.) Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 71cf59e..02763b5 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -3596,7 +3596,7 @@ SELECT person.name, holidays.num_weeks FROM person, holidays as the canonical form for MAC addresses, and specifies the first form (with colons) as the bit-reversed notation, so that 08-00-2b-01-02-03 = 01:00:4D:08:04:0C. This convention is widely - ignored nowadays, and it is only relevant for obsolete network + ignored nowadays, and it is relevant only for obsolete network protocols (such as Token Ring). PostgreSQL makes no provisions for bit reversal, and all accepted formats use the canonical LSB order. diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 4503830..b19f15e 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -427,7 +427,7 @@ PostgreSQL documentation Specify the superuser user name to use when disabling triggers. -This is only relevant if --disable-triggers is used. +This is relevant only if --disable-triggers is used. (Usually, it's better to leave this out, and instead start the resulting script as superuser.) @@ -602,7 +602,7 @@ PostgreSQL documentation --disable-triggers -This option is only relevant when creating a data-only dump. +This option is relevant only when creating a data-only dump. It instructs pg_dump to include commands to temporarily disable triggers on the target tables while the data is reloaded. Use this if you have referential diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml index 7c49c03..253ee01 100644 --- a/doc/src/sgml/ref/pg_dumpall.sgml +++ b/doc/src/sgml/ref/pg_dumpall.sgml @@ -190,7 +190,7 @@ PostgreSQL documentation Specify the superuser user name to use when disabling triggers. -This is only relevant if --disable-triggers is used. +This is relevant only if --disable-triggers is used. (Usually, it's better to leave this out, and instead start the resulting script as superuser.) @@ -283,7 +283,7 @@ PostgreSQL documentation --disable-triggers -This option is only relevant when creating a data-only dump. +This option is relevant only when creating a data-only dump. It instructs pg_dumpall to include commands to temporarily disable triggers on the target tables while the data is reloaded. Use this if you have referential diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml index b276da6..2993369 100644 --- a/doc/src/sgml/ref/pg_restore.sgml +++ b/doc/src/sgml/ref/pg_restore.sgml @@ -383,7 +383,7 @@ Specify the superuser user name to use when disabling triggers. -This is only relevant if --disable-triggers is used. +This is relevant only if --disable-triggers is used. @@ -458,7 +458,7 @@ --disable-triggers -This option is only relevant when performing a data-only restore. +This option is relevant only when performing a data-only restore. It instructs pg_restore to execute commands to temporarily disable triggers on the target tables while the data is reloaded. Use this if you have referential -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Doc patch, put commas in the right place in pg_restore docs
Hi, Simple punctuation change to pg_restore docs. Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml index b276da6..e3520c0 100644 --- a/doc/src/sgml/ref/pg_restore.sgml +++ b/doc/src/sgml/ref/pg_restore.sgml @@ -468,9 +468,9 @@ Presently, the commands emitted for ---disable-triggers must be done as superuser. So, you -should also specify a superuser name with -S, or -preferably run pg_restore as a +--disable-triggers must be done as superuser. So you +should also specify a superuser name with -S or, +preferably, run pg_restore as a PostgreSQL superuser. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suggestion for --truncate-tables to pg_restore
Hi, Attached is version 3. The convention seems to be to leave the operator at the end of the line when breaking long lines, so do that. Add extra () -- make operator precedence explicit and have indentation reflect operator precedence. On 09/23/2012 08:52:07 PM, Karl O. Pinc wrote: > On 09/23/2012 12:24:27 AM, Karl O. Pinc wrote: > > On 09/23/2012 12:19:07 AM, Karl O. Pinc wrote: > > > On 09/21/2012 10:54:05 AM, Karl O. Pinc wrote: > > > > On 09/20/2012 12:24:49 PM, Karl O. Pinc wrote: > > > > > > > > > I've had problems using pg_restore --data-only when > > > > > restoring individual schemas (which contain data which > > > > > has had bad things done to it). --clean does not work > > > > > well because of dependent objects in other schemas. > > > > > > Since there wasn't much more to do I've gone ahead > > > and written the patch. Works for me. > > > > > > Against git master. > > > Passes regression tests, but there's no regression > > > tests for pg_restore so this does not say much. > > > Since there's no regression tests I've not written one. > > > > > > Since this is a real patch for application I've given > > > it a new name (it's not a v2). > > > > > > Truncate done right before COPY, since that's what > > > the parallel restores do. Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml index b276da6..488d8dc 100644 --- a/doc/src/sgml/ref/pg_restore.sgml +++ b/doc/src/sgml/ref/pg_restore.sgml @@ -539,6 +539,26 @@ + --truncate-tables + + +This option is only relevant when performing a data-only +restore. It instructs pg_restore +to execute commands to truncate the target tables while the +data is reloaded. Use this when restoring tables or schemas +and --clean cannot be used because dependent +objects would be destroyed. + + + + The --disable-triggers will almost always + always need to be used in conjunction with this option to + disable check constraints on foreign keys. + + + + + --use-set-session-authorization diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 3b49395..0aaf1d3 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -101,6 +101,8 @@ typedef struct _restoreOptions int noTablespace; /* Don't issue tablespace-related commands */ int disable_triggers; /* disable triggers during data-only * restore */ + int truncate_tables; /* truncate tables during data-only + * restore */ int use_setsessauth;/* Use SET SESSION AUTHORIZATION commands * instead of OWNER TO */ int no_security_labels; /* Skip security label entries */ diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 722b3e9..43b5806 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -311,6 +311,11 @@ RestoreArchive(Archive *AHX) if (ropt->createDB && ropt->dropSchema) exit_horribly(modulename, "-C and -c are incompatible options\n"); + /* When the schema is dropped and re-created then no point + * truncating tables. */ + if (ropt->dropSchema && ropt->truncate_tables) + exit_horribly(modulename, "-c and --truncate-tables are incompatible options\n"); + /* * -C is not compatible with -1, because we can't create a database inside * a transaction block. @@ -412,6 +417,10 @@ RestoreArchive(Archive *AHX) } } + /* Truncate tables only when restoring data. */ + if (!ropt->dataOnly && ropt->truncate_tables) + exit_horribly(modulename, "--truncate-tables requires the --data-only option\n"); + /* * Setup the output file if necessary. */ @@ -553,6 +562,7 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, int retval = 0; teReqs reqs; bool defnDumped; + bool truncate; AH->currentTE = te; @@ -687,15 +697,22 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, * server, so no need to see if we should issue BEGIN. */ StartTransaction(AH); + truncate = 1; + } else + /* Truncate the table when asked to. */ + truncate = ropt->truncate_tables; + if (truncate) { /* * If the server version is >= 8.4, make sure we issue * TRUNCATE with ONLY so that child tables are not - * wiped. + * wiped. If we don't know the server version + * then err on the side of safety. */ ahprintf(AH, "TRUNCATE TABLE %s%s;\n\n", - (PQserverVersion(AH->connection) >= 80400 ? - "ONLY " : ""), + ((!AH->connection || + PQserverVersion(AH->connection) >= 80400) ? + "ONLY " : ""), fmtId(te->tag)); } diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c index f6c
Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility
Hannu, Can you explain in more detail how this would be used on the receiving side? I'm unable to picture it from your description. I'm also a bit reluctant to call this a "message queue", since it lacks the features required for it to be used as an application-level queue. "REPLICATION MESSAGE", maybe? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] change in LOCK behavior
On Thu, Oct 11, 2012 at 7:53 PM, Tom Lane wrote: > Maybe what we really need is to find a way to make taking a snapshot a > lot cheaper, such that the whole need for this patch goes away. We're > not going to get far with the idea of making SnapshotNow MVCC-safe > unless it becomes a lot cheaper to get an MVCC snapshot. I recall some > discussion of trying to reduce a snapshot to a WAL offset --- did that > idea crash and burn, or is it still viable? This was mostly covered in the cheaper snapshots thread. [1] Robert decided to abandon the idea after concluding that the memory overhead was untenable with very old snapshots. [2] I had a really hand-wavy idea of lazily converting snapshots from sequence number based snapshots to traditional list of xids snapshots to limit the overhead. That idea was promptly shot down because in that incarnation it needed snapshots to be stored in shared memory. [3] I have done some more thinking on this topic, although I have to admit that it has been on the backburner. It seems to me that the problems are all surmountable. To recap shortly, the idea is to define visibility and snapshots through commit sequence numbers (LSNs have problems due to async commit). The tricky part is the datastructure to support fast xid-to-csn lookup for visibility checks. To support visibility checks enough information needs to be kept so that the oldest CSN based snapshot can resolve its xmin-xmax range to csns. My idea currently is to have two fixed size shared memory buffers and an overflow log. The first ring buffer is a dense array mapping of xids to csns. The overflow entries from the dense ring buffer are checked if they might be invisible to any CSN based snapshots, and if so inserted into the sparse buffer. The sparse buffer is a sorted array containing xid-csn pairs that are still running or are concurrent with an active CSN based snapshot. Once the sparse buffer is filled up, the smallest xid-csn pairs are evicted to the CSN log. The long running CSN based snapshots then need to read this log to build up the SnapshotData->xip/subxip arrays. The backends can either discover that their snapshots CSNs values have overflowed by checking the appropriate horizon value, or be signaled via an interrupt to enable CSN log cleanup ASAP. I still have to work out some details on how to handle subtransaction overflow, how to maintain reasonably fresh values for different horizons and what are necessary ordering barriers to get lock-free visibility checks. The idea currently seems workable and will make taking snapshots really cheap, while the worst case maintenance overhead is mostly shifted to sessions that acquire lots of writing transactions and hold snapshots open for a long time. If anyone is interested I can do a slightly longer write up detailing what I have worked out so far. Ants Aasma [1] http://archives.postgresql.org/message-id/CA%2BTgmoaAjiq%3Dd%3DkYt3qNj%2BUvi%2BMB-aRovCwr75Ca9egx-Ks9Ag%40mail.gmail.com [2] http://archives.postgresql.org/message-id/CA%2BTgmoYD6EhYy1Rb%2BSEuns5smreY1_3rAMeL%3D76rX8deijy56Q%40mail.gmail.com [3] http://archives.postgresql.org/message-id/CA%2BCSw_uDfg2SBMicGNu13bpr2upbnVL_edoTbzvacR1FrNrZ1g%40mail.gmail.com -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tuplesort memory usage: grow_memtuples
On 16 October 2012 22:18, Greg Stark wrote: > That's assuming my committer bits haven't lapsed and people are ok > with me stepping back into things? I personally have no objections. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tuplesort memory usage: grow_memtuples
On Tue, Oct 16, 2012 at 9:47 PM, Peter Geoghegan wrote: > The patch will now been marked "ready for committer". Does this need > doc changes, in light of what is arguably a behavioural difference? > You only mentioned release notes. I'm happy to look at this one, probably next week at pgconf.eu. It seems like a reasonable size patch to get back into things. That's assuming my committer bits haven't lapsed and people are ok with me stepping back into things? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tuplesort memory usage: grow_memtuples
On 16 October 2012 14:24, Simon Riggs wrote: > If you describe in detail that it is a heuristic and why that is > proposed over other approaches that should be sufficient for future > generations to read and understand. I've done so, in the attached revision. Things have been simplified somewhat too. The same basic strategy for sizing the tuplesort memtuples array in also exists in tuplestore. I wonder if we should repeat this there? I suppose that that could follow later. The patch will now been marked "ready for committer". Does this need doc changes, in light of what is arguably a behavioural difference? You only mentioned release notes. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services sortmem_grow-v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal - assign result of query to psql variable
2012/10/16 Shigeru HANADA : > Hi Pavel, > > On Tue, Oct 16, 2012 at 6:59 AM, Pavel Stehule > wrote: >> here is updated patch, I moved lot of code from lexer to command.com, >> and now more \gset doesn't disable other backslash commands on same >> line. > > * lexer changes > IIUC, new function psql_scan_varlist_varname scans input and returns a > variable name or a comma at each call, and command.c handles the error > such as invalid # of variables. This new design seems better than old one. > > However, IMHO the name psql_scan_varlist_varname sounds redundant and > unintuitive. I'd prefer psql_scan_slash_varlist, because it indicates > that that function is expected to be used for arguments of backslash > commands, like psql_scan_slash_command and psql_scan_slash_option. > Thoughts? > > * multiple meta command > Now both of the command sequences > > $ SELECT 1, 2 \gset var1, var2 \g foo.txt > $ SELECT 1, 2 \g foo.txt \gset var1, var2 > > set var1 and v2 to "1" and "2" respectively, and also write the result > into foo.txt. This would be what users expected. > > * Duplication of variables > I found an issue we have not discussed. Currently \gset accepts same > variable names in the list, and stores last SELECT item in duplicated > variables. For instance, > > $ SELECT 1, 2 \gset var, var > > stores "2" into var. I think this behavior is acceptable, but it might > be worth mentioning in document. > > * extra fixes > I fixed some minor issues below. Please see attached v10 patch for details. > > * remove unused macro OT_VARLIST > * remove unnecessary #include directive for common.h > * fill comment within 80 columns > * indent short variable name with tab > * add regression test case for combination of \g and \gset > > * bug on FETCH_COUNT = 1 > When FETCH_COUNT is set to 1, and the number of rows returned is 1 too, > \gset shows extra "(1 row)". This would be a bug in > ExecQueryUsingCursor. Please see the last test case in regression test > psql_cmd. I fixed this bug Regards Pavel > > I'll mark this patch as "waiting author". > > Regards, > -- > Shigeru HANADA gset_11.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal - assign result of query to psql variable
Hello here is updated patch, I moved lot of code from lexer to command.com, and now more \gset doesn't disable other backslash commands on same line. Regards Pavel 2012/10/15 Pavel Stehule : > 2012/10/15 Pavel Stehule : >> 2012/10/15 Shigeru HANADA : >>> Hi Pavel, >>> >>> First of all, I'm sorry that my previous review was rough. I looked >>> your patch and existing code closely again. >>> >>> On 2012/10/15, at 12:57, Pavel Stehule wrote: 2012/10/14 Tom Lane : > * ExecQueryUsingCursor's concept of how to process command results > (why? surely we don't need \gset to use a cursor) There was two possibilities, but hardly non using cursor is better way >>> >>> +1 for supporting the case when FETCH_COUNT > 0, because user might set >>> so mainly for other queries, and they would want to avoid changing >>> FETCH_COUNT setting during every query followed by \gset. >>> > * the psql lexer (adding a whole bunch of stuff that probably doesn't > belong there) ?? >>> >>> I think that Tom is talking about psql_scan_slash_vars(). It seems too >>> specific to \gset command. How about to improve >>> psql_scan_slash_options() so that it can handle comma-separated variable >>> list? Although you might have tried it before. >>> # Unused OT_VARLIST macro gave me the idea. >> >> yes, it is possible - I'll look on it at evening > > a reuse of psql_scan_slash_options is not good idea - a interface of > this function is out of my purposes - and I cannot to signalise error > from this procedure. But I can minimize psql_scan_slash_var and I can > move lot of code out of lexer file. > >> >>> > * the core psql settings construct (to store something that is in > no way a persistent setting) > ?? >>> >>> I thought that having \gset arguments in pset is reasonable, since \g >>> uses pest.gfname to hold its one-shot setting. Or, we should refactor >>> \g to fit with \gset? I might be missing Tom's point... >>> > Surely there is a less ugly and invasive way to do this. The fact > that the reviewer keeps finding bizarre bugs like "another backslash > command on the same line doesn't work" seems to me to be a good > indication that this is touching things it shouldn't. - all these bugs are based on lexer construct. A little modification of lexer is possible >>> >>> IMHO those issues come from the design rather than the implementation of >>> lexer. AFAIK we don't have consensus about the case that both of \g and >>> \gset are used for a query like this: >>> >>> postgres=# SELECT 1 \gset var \\ \g foo.txt >>> >>> This seems regal. Should we store "1" into var and write the result >>> into foo.txt? Or, only either of them? It's just an idea and it >>> requires new special character, but how about use \g command for file, >>> pipe, and variable? In the case we choose '&' for variable prefix: >>> >>> postgres=# SELECT 'hello', 'wonderful', 'world!' \g &var1,,var2 >>> >>> Anyway, we've had no psql's meta command which processes query result >>> other than \g. So, we might have more considerable issues about design. >> >> a current design is rigid - a small implementation can stop parsing >> target list, when other backslash statement is detected >> >>> >>> BTW, what the word "comma_expected" means? It's in the comment above >>> psql_scan_slash_vars(). It might be a remaining of old implementation. >> >> This identifier is mistaken - etc this comment is wrong and related to >> old implementation - sorry. A first design was replaced by state >> machine described by VarListParserState >> >> >> >>> >>> Regards, >>> -- >>> Shigeru HANADA >>> shigeru.han...@gmail.com >>> >>> >>> >>> >>> gset09.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] smgrsettransient mechanism is full of bugs
Tom Lane wrote: > After further review, I have become convinced that in fact it's > completely broken and needs to be redone from scratch. The temp-file > marking at the fd.c level can easily get out of sync with the marking > at the smgr level, and that marking isn't too consistent with reality > either, which means we have all of the following problems: Oops. Sorry about this. Fortunately, as far as I can see, it only results in excessive resource consumption, not data corruption or loss. > I believe that we probably ought to revert this mechanism entirely, and > build a new implementation based on these concepts: > > * An SMgrRelation is transient if and only if it doesn't have an > "owning" relcache entry. Keep a list of all such SmgrRelations, and > close them all at transaction end. (Obviously, an SMgrRelation gets > removed from the list if it acquires an owner mid-transaction.) > > * There's no such concept as FD_XACT_TRANSIENT at the fd.c level. > Rather, we close and delete the VFD entry when told to by SmgrRelation > closure. Makes sense. It does seem simpler than the original approach. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Truncate if exists
On Mon, 15 Oct 2012 10:21:18 -0700, Robert Haas wrote: > Yeah, I think the functionality that we need is pretty much there > already today. What we need to do is to get the syntax to a point > where people can write the code they want to write without getting > tangled up by it. > > I think the invention of DO was a big step in the right direction > ... > With DO, you can write the logic you want > as an SQL statement, it's just a clunky and awkward SQL statement. In > my view the goal ought to be to refine that mechanism to remove the > clunkiness and awkwardness, rather than to invent something completely > new. As someone who has worked with a number of databases now, none of them really get this DDL integration completely right. What I'd like to see is 1) a predicate to easily test things about the schema (does this table, column, index, schema, etc. exist? does it have the right type?) and 2) a way to conditionally execute DDL (and DML, which should fall right out, but it isn't really what this discussion is covering). I would propose extending the current EXISTS / NOT EXISTS predicate as follows: [NOT] EXISTS TABLE tab [ ( col [type] [ , col [type]]... ) ] [NOT] EXISTS COLUMN tab.col [type] [NOT] EXISTS INDEX tab.idx [ ( col [ , col]... ) ] [NOT] EXISTS CONSTRAINT tab.cstrt -- not sure what else might be useful here [NOT] EXISTS ( select ) -- this is the current EXISTS predicate, of course [NOT] EXISTS ANY ROWS FROM tab [join] [WHERE predicate] [GROUP BY col [ , col]...] -- exactly the same as -- (select 1 FROM etc.) -- only because I like -- it better (the latter [which by no means am I nuts over; it's just that when extending EXISTS I can't stop neatening it up to my personal preferences] could be extended with [NOT] EXISTS MORE THAN n ROWS FROM and [NOT] EXISTS EXACTLY n ROWS FROM.) There is a new SQL statement: IF predicate true-statement [ELSE false-statement]. To actually execute this new IF statement, the executor would need an IF node that evaluates the predicate (with ANDs and ORs, just like all SQL predicates) and then executes the rest only if the predicate came out TRUE (or NOT FALSE; I forget which is usually used, and the difference with NULL could be useful, as long as it matches other predicates). This moves one more bit of procedural logic into the executor. Another wrinkle is that the dependent statement might not compile, due to missing tables or whatnot. Actually executing it while it doesn't compile is an error, but we want to defer that error until we actually decide we need to execute it. Also, it's probably good to try compiling it again at that point. So my thought would be to try planning the dependent statement(s); if they compile, hook them to the IF node; if not, hook a DEFERRED node to the IF node. The DEFERRED node has the parse tree (or raw string, whatever makes sense) of the statement; on execution it tries again to plan that statement; if it succeeds, run it; if not, error out. I'd also add a SEQUENCE node to the executor. It just runs its children in order (could be n-ary, or if fixed arity nodes are what is in the planner/executor today, could be binary, first left, then right, and right could be another SEQUENCE). The DEFERRED node means that a CREATE statement could precede use of what is created in the same sequence and all could get planned (with some deferral to execution time) in advance and run in one lump. This implements DO at the executor level. The biggest concepts left from plpgsql are looping and variables. Most variables could be modeled as a single row value; SQL already can update a row, so the planning of assignments and calculations of scalars (and arrays, I think) already fits into things the planner knows about. Table variables (which I don't know that plpgsql supports, but someday it should) are less defined. Adding plpgsql's loops to the executor would let whole functions run under one trip through the executor. This is beyond just improving the DDL support for scripts. I have written a number of database upgrade scripts. Over time we've made them less fragile, by checking for the existence of tables, indexes, and most recently, columns. The usual sequence is: 1) check the existence of an index; check that the first few columns are correct; if not, drop the index 2) repeat for other indexes that have changed definition over time 3) check the existence of the table; create with current layout if it is missing 4) check for the presence of a column; if missing, alter the table to add it (of course, we can only add new columns at the end, and occasionally delete a column) 5) repeat for more columns 6) check the existence of
Re: [HACKERS] proposal - assign result of query to psql variable
Hi Pavel, On Tue, Oct 16, 2012 at 6:59 AM, Pavel Stehule wrote: > here is updated patch, I moved lot of code from lexer to command.com, > and now more \gset doesn't disable other backslash commands on same > line. * lexer changes IIUC, new function psql_scan_varlist_varname scans input and returns a variable name or a comma at each call, and command.c handles the error such as invalid # of variables. This new design seems better than old one. However, IMHO the name psql_scan_varlist_varname sounds redundant and unintuitive. I'd prefer psql_scan_slash_varlist, because it indicates that that function is expected to be used for arguments of backslash commands, like psql_scan_slash_command and psql_scan_slash_option. Thoughts? * multiple meta command Now both of the command sequences $ SELECT 1, 2 \gset var1, var2 \g foo.txt $ SELECT 1, 2 \g foo.txt \gset var1, var2 set var1 and v2 to "1" and "2" respectively, and also write the result into foo.txt. This would be what users expected. * Duplication of variables I found an issue we have not discussed. Currently \gset accepts same variable names in the list, and stores last SELECT item in duplicated variables. For instance, $ SELECT 1, 2 \gset var, var stores "2" into var. I think this behavior is acceptable, but it might be worth mentioning in document. * extra fixes I fixed some minor issues below. Please see attached v10 patch for details. * remove unused macro OT_VARLIST * remove unnecessary #include directive for common.h * fill comment within 80 columns * indent short variable name with tab * add regression test case for combination of \g and \gset * bug on FETCH_COUNT = 1 When FETCH_COUNT is set to 1, and the number of rows returned is 1 too, \gset shows extra "(1 row)". This would be a bug in ExecQueryUsingCursor. Please see the last test case in regression test psql_cmd. I'll mark this patch as "waiting author". Regards, -- Shigeru HANADA *** a/doc/src/sgml/ref/psql-ref.sgml --- b/doc/src/sgml/ref/psql-ref.sgml *** *** 1483,1490 testdb=> way. Use \i for that.) This means that if the query ends with (or contains) a semicolon, it is immediately executed. Otherwise it will merely wait in the ! query buffer; type semicolon or \g to send it, or ! \r to cancel. --- 1483,1490 way. Use \i for that.) This means that if the query ends with (or contains) a semicolon, it is immediately executed. Otherwise it will merely wait in the ! query buffer; type semicolon, \g or ! \gset to send it, or \r to cancel. *** *** 1617,1622 Tue Oct 26 21:40:57 CEST 1999 --- 1617,1644 + \gset variable [ ,variable ... ] + + + + Sends the current query input buffer to the server and stores the + query's output into corresponding variable. The preceding query must + return only one row, and the number of variables must be same as the + number of elements in SELECT list. If you don't + need any of items in SELECT list, you can omit + corresponding variable. + Example: + + foo=> SELECT 'hello', 'wonderful', 'world!' \gset var1,,var3 + foo=> \echo :var1 :var3 + hello world! + + + + + + \h or \help [ command ] *** a/src/bin/psql/command.c --- b/src/bin/psql/command.c *** *** 71,77 static void printSSLInfo(void); static void checkWin32Codepage(void); #endif ! /*-- * HandleSlashCmds: --- 71,86 static void checkWin32Codepage(void); #endif ! /* ! * Possible states for simple state machine, that is used for ! * parsing target list - list of varnames separated by comma. ! */ ! typedef enum ! { ! VARLIST_INITIAL, ! VARLIST_EXPECTED_COMMA, ! VARLIST_EXPECTED_COMMA_OR_IDENT ! } VarlistParserState; /*-- * HandleSlashCmds: *** *** 748,753 exec_command(const char *cmd, --- 757,831 status = PSQL_CMD_SEND; } + /* \gset send query and store result */ + else if (strcmp(cmd, "gset") == 0) + { + bool value_is_valid; + char *value; + VarlistParserState state = VARLIST_INITIAL; + + /* expected valid target list */ + success = true; + + pset.gvars = NULL; + while ((value = psql_scan_varlist_varname(scan_state, &value_is_valid))) + { + if (value_is_valid && success) + { + if (strcmp(value, ",") == 0) + { + if (state == VARLIST_INITIAL || + state == VARLIST_EXPECTED_COMMA_OR_IDENT) + pset.gvars = tglist_add(pset.gvars, NULL); + state = VARLIST_EXPECTED_COMMA_OR_IDENT; + } + else + { + if (state == VARLIST_INITIAL || + state == VARLIST_EXPECTED_COMMA_OR_IDENT) + { + pset.gvars = tglist_add(pset.gvars, value); +
Re: [HACKERS] Global Sequences
Simon Riggs writes: > On 16 October 2012 17:15, Tom Lane wrote: >> So I fully >> expect that we're going to need something different from bog-standard >> CREATE SEQUENCE. > There's no point in that at all, as explained. It's sequences that > need to work. We can already call my_nextval() rather than nextval() > if we want a roll-your own sequence facility and can rewrite > applications to call that, assuming UUID isn't appropriate. I wasn't objecting to the concept of allowing nextval() to have overloaded behaviors; more saying that that wasn't where to start the design process. In particular, the reason proposing a hook first seems backwards is that if we have a catalog-level representation that some sequences are local and others not, we should be using that to drive the determination of whether to call a substitute function --- and maybe which one to call. For instance, I could see attaching a function OID to each sequence and then having nextval() call that function, instead of a hook per se. Or maybe better, invent a level of indirection like a "sequence access method" (comparable to index access methods) that provides a compatible set of substitute functions for sequence operations. If you want to override nextval() for a sequence, don't you likely also need to override setval(), currval(), etc? Not to mention overriding ALTER SEQUENCE's behavior. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On 16 October 2012 17:17, Peter Eisentraut wrote: > On 10/16/12 9:20 AM, Simon Riggs wrote: >> I've proposed a plugin for the allocation only. So the allocation >> looks like anything you want. > > Are you planning to provide a reference implementation of some kind? I'll provide hooks and a stub for testing. A full implementation relies upon the physical transport used. For BDR, there will be a full working version that relies upon that. TPL, PGDG. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On 16 October 2012 17:15, Tom Lane wrote: > Simon Riggs writes: >> On 16 October 2012 15:15, Tom Lane wrote: >>> I think this is a fundamentally wrong way to go about doing what you >>> want to do. It presumes that DDL-level manipulation of global sequences >>> is exactly like local sequences; an assumption that is obviously false. > >> The message queue concept doesn't exist at all yet, so when we create >> it we can specify anything we want. That is a different situation and >> hence a different solution. CREATE SEQUENCE is SQL Standard and used >> by SERIAL, many people's SQL, SQL generation tools etc.. My objective >> is to come up with something that makes the standard code work >> correctly in a replicated environment. > > I challenge you to find anything in the SQL standard that suggests that > sequences have any nonlocal behavior. No need for challenge, I agree, the SQL standard doesn't speak about that. I didn't say it did. > If anything, what you propose > violates the standard, And so that doesn't follow, but anyway... > it doesn't make us follow it more closely. I wasn't arguing that my proposal did that (made us follow standard more closely). My point is that application code exists that expects sequences to Just Work, and so the aim of the proposal was to do that in a replicated environment as well as single node. > Furthermore, I find it hard to believe that people won't want both > local and global sequences in the same database --- so one way or the > other we need some DDL-level reflection of the difference. Agreed, which is why I proposed some DDL-level syntax. Was that OK? > A larger point though is that the various implementation choices you > mentioned probably need to be configurable by DDL options. I doubt that > it will work well to say "install plugin A to get behavior X, or install > plugin B to get behavior Y, and whichever you choose is not further > configurable, it'll be the same behavior for all sequences". Again, I accept that as of my last post, and I proposed syntax to provide it. > So I fully > expect that we're going to need something different from bog-standard > CREATE SEQUENCE. There's no point in that at all, as explained. It's sequences that need to work. We can already call my_nextval() rather than nextval() if we want a roll-your own sequence facility and can rewrite applications to call that, assuming UUID isn't appropriate. Please don't force people to rewrite their applications; it might not go in the direction we want. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On Tue, Oct 16, 2012 at 5:54 AM, Peter Eisentraut wrote: > On 10/15/12 5:33 PM, Simon Riggs wrote: >> There are a few options >> 1) Manual separation of the value space, so that N1 has 50% of >> possible values and N2 has 50%. That has problems when we reconfigure >> the cluster, and requires complex manual reallocation of values. So it >> starts good but ends badly. >> 2) Automatic separation of the value space. This could mimic the >> manual operation, so it does everything for you - but thats just >> making a bad idea automatic >> 3) Lazy allocation from the value space. When a node is close to >> running out of values, it requests a new allocation and coordinates >> with all nodes to confirm the new allocation is good. > > What would the allocation service look like? Is it another PostgreSQL > server? What's the communication protocol? How would backups work? > Crash recovery? As a reasonable proxy to look at the first question, one may look at how twitter uses their home-grown software snowflake. https://github.com/twitter/snowflake A colleague, Blake Mizerany, wrote a smaller version called "noeqd", based on the same ideas, but he wanted something with fewer dependencies. Unless you are very Java-library-and-toolchain adept you might find this more fun to play with. https://github.com/bmizerany/noeqd -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix for log_line_prefix and session display
Applied. --- On Mon, Oct 15, 2012 at 02:22:33PM -0400, Bruce Momjian wrote: > On Mon, Oct 15, 2012 at 10:01:29AM +0200, Albe Laurenz wrote: > > Bruce Momjian wrote: > > > Currently, our session id, displayed by log_line_prefix and CSV > > output, > > > is made up of the session start time epoch seconds and the process id. > > > The problem is that the printf mask is currently %lx.%x, causing a > > > process id less than 4096 to not display a full four hex digits after > > > the decimal point. I think this is confusing because the number .423 > > > appears higher than .1423, though it is not. Here is what our current > > > output looks like with log_line_prefix="%c: ": > > > > > > 50785b3e.7ff9: ERROR: syntax error at or near "test" at > > character 1 > > > 50785b3e.7ff9: STATEMENT: test > > > 50785b3e.144: ERROR: syntax error at or near "test" at > > character 1 > > > 50785b3e.144: STATEMENT: test > > > > > > With my fix, here is the updated output: > > > > > > 507864d3.7ff2: ERROR: syntax error at or near "test" at > > character 1 > > > 507864d3.7ff2: STATEMENT: test > > > 507864d3.013d: ERROR: syntax error at or near "test" at > > character 1 > > > 507864d3.013d: STATEMENT: test > > > > > > Patch attached. > > > > Do you think that anybody wants to apply a linear ordering on > > the second part of the session ID? If you need the pid, you > > can use %p. > > > > I would say that this change makes sense if it causes disturbance > > that the part after the period can be than 4 characters long > > (it did not disturb me when I wrote a log file parser). > > > > If that need is not urgent enough, maybe it would be better to > > preserve the current behaviour in the (unlikely) event that somebody > > relies on it. > > I don't think anyone is picking apart the session id, but I do think the > current output is confusing because the session id string length is > pretty variable. Anyone who is parsing the current session id will > easily be able to parse the more consistent output. > > -- > Bruce Momjian http://momjian.us > EnterpriseDB http://enterprisedb.com > > + It's impossible for everything to be true. + > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On 10/16/12 9:20 AM, Simon Riggs wrote: > I've proposed a plugin for the allocation only. So the allocation > looks like anything you want. Are you planning to provide a reference implementation of some kind? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
Simon Riggs writes: > On 16 October 2012 15:15, Tom Lane wrote: >> I think this is a fundamentally wrong way to go about doing what you >> want to do. It presumes that DDL-level manipulation of global sequences >> is exactly like local sequences; an assumption that is obviously false. > The message queue concept doesn't exist at all yet, so when we create > it we can specify anything we want. That is a different situation and > hence a different solution. CREATE SEQUENCE is SQL Standard and used > by SERIAL, many people's SQL, SQL generation tools etc.. My objective > is to come up with something that makes the standard code work > correctly in a replicated environment. I challenge you to find anything in the SQL standard that suggests that sequences have any nonlocal behavior. If anything, what you propose violates the standard, it doesn't make us follow it more closely. Furthermore, I find it hard to believe that people won't want both local and global sequences in the same database --- so one way or the other we need some DDL-level reflection of the difference. A larger point though is that the various implementation choices you mentioned probably need to be configurable by DDL options. I doubt that it will work well to say "install plugin A to get behavior X, or install plugin B to get behavior Y, and whichever you choose is not further configurable, it'll be the same behavior for all sequences". So I fully expect that we're going to need something different from bog-standard CREATE SEQUENCE. Exactly what isn't clear --- but I think modifying nextval's behavior is way down the list of concerns. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Bugs in planner's equivalence-class processing
I looked into the problem reported in bug #7604 (Bill MacArthur was kind enough to send me a reproducer off-list). The error can be demonstrated by this example in the regression test database: select f1, unique2, case when unique2 is null then f1 else 0 end from int4_tbl a left join tenk1 b on f1 = unique2 where (case when unique2 is null then f1 else 0 end) = 0; If you look at the output of the basic join: regression=# select f1, unique2, case when unique2 is null then f1 else 0 end regression-# from int4_tbl a left join tenk1 b on f1 = unique2; f1 | unique2 |case -+-+- 0 | 0 | 0 123456 | | 123456 -123456 | | -123456 2147483647 | | 2147483647 -2147483647 | | -2147483647 (5 rows) it's obvious that only the first of these rows passes the added WHERE condition, so the query should produce just that row; but what you actually get in 9.2 is all five rows! EXPLAIN gives a clue what's going wrong: Nested Loop Left Join (cost=0.00..22.51 rows=1 width=8) -> Seq Scan on int4_tbl a (cost=0.00..1.05 rows=5 width=4) -> Index Only Scan using tenk1_unique2 on tenk1 b (cost=0.00..4.28 rows=1 w idth=4) Index Cond: (unique2 = a.f1) Filter: (CASE WHEN (unique2 IS NULL) THEN a.f1 ELSE 0 END = 0) (5 rows) The filter condition has been dropped down to the scan of tenk1 (relying on a.f1 being passed in as a parameter), which is wrong since there it is unable to filter null-extended rows produced by the left join operator. At first I didn't see how this could be happening: the code for placing quals looks at RestrictInfo.nullable_relids, and this qual should have tenk1 in its nullable_relids since it's above a left join to tenk1. But gdb soon told me that that wasn't what join_clause_is_movable_to was seeing. Eventually I figured out that the equivalence-class machinery was eating the clause (forming an EquivalenceClass consisting of the CASE expression and the constant 0) and reconstructing a new clause that didn't have nullable_relids set. So this is essentially an oversight in the patch that added tracking of nullable_relids. I got confused about the difference between outerjoin_delayed (this clause, as a whole, is not outerjoin_delayed because its natural semantic level would be at the join anyway) and having nonempty nullable_relids, and thought that equivalence-class processing would never see such clauses because it doesn't accept outerjoin_delayed clauses. So there's no code in equivclass.c to compute nullable_relids sets for constructed clauses. At some point it might be worth adding same, but for now I'm just going to tweak distribute_qual_to_rels to not believe that clauses with nonempty nullable_relids can be equivalence clauses. It gets worse though. Tracking clauses' nullable_relids is new in 9.2, but variants of this example can be demonstrated to fail as far back as 7.4. Here's one: select q1, unique2, thousand, hundred from int8_tbl a left join tenk1 b on q1 = unique2 where coalesce(thousand,123) = q1 and q1 = coalesce(hundred,123); If you look at just the bare join output, there are no rows that should pass the WHERE clause, but actually what you get in 7.4-9.1 is q1 | unique2 | thousand | hundred -+-+--+- 123 | | | 123 | | | (2 rows) which is completely broken because it's not even a subset of the bare join output :-(. 9.1 shows this EXPLAIN output: Nested Loop Left Join (cost=0.00..42.48 rows=1 width=20) Filter: (a.q1 = COALESCE(b.thousand, 123)) -> Seq Scan on int8_tbl a (cost=0.00..1.05 rows=5 width=8) -> Index Scan using tenk1_unique2 on tenk1 b (cost=0.00..8.27 rows=1 width= 12) Index Cond: (a.q1 = unique2) Filter: (COALESCE(thousand, 123) = COALESCE(hundred, 123)) What's happening here is that because the two WHERE clauses are not outerjoin_delayed, they're being deconstructed as equivalence clauses, and then we synthesize an equivalence constraint on the two COALESCE expressions that can be applied at the tenk1 scan level. That gets rid of the row that should join to q1=123, allowing the bogus null-extended join rows to be formed. So this shows that, although 9.2 has a more extensive form of the disease, the use of outerjoin_delayed to decide whether a clause can be an equivalence clause is really quite wrong. I believe what I need to do to fix this is back-port the 9.2 logic that computes a nullable_relids set for each clause. We won't need to store it, just check whether it's empty before letting the clause be treated as an equivalence clause. Is anybody concerned about the compatibility implications of fixing this bug in the back branches? I'm worried about people complaining that we broke their application in a minor release. Maybe they were depending on incorrect behavior, but they mig
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2
2012/10/16 2:40, Jeff Janes wrote: On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane wrote: Satoshi Nagayasu writes: (2012/10/14 13:26), Fujii Masao wrote: The tracing lwlock usage seems to still cause a small performance overhead even if reporting is disabled. I believe some users would prefer to avoid such overhead even if pg_stat_lwlocks is not available. It should be up to a user to decide whether to trace lwlock usage, e.g., by using trace_lwlock parameter, I think. Frankly speaking, I do not agree with disabling performance instrument to improve performance. DBA must *always* monitor the performance metrix when having such heavy workload. This brings up a question that I don't think has been honestly considered, which is exactly whom a feature like this is targeted at. TBH I think it's of about zero use to DBAs (making the above argument bogus). It is potentially of use to developers, but a DBA is unlikely to be able to do anything about lwlock-level contention even if he has the knowledge to interpret the data. Waiting on BufFreelistLock suggests increasing shared_buffers. Waiting on ProcArrayLock perhaps suggests use of a connection pooler (or does it?) WALWriteLock suggests doing something about IO, either moving logs to different disks, or getting BBU, or something. WALInsertLock suggests trying to adapt your data loading process so it can take advantage of the bulk, or maybe increasing wal_buffers. And a lot of waiting on any of the locks gives a piece of information the DBA can use when asking the mailing lists for help, even if it doesn't allow him to take unilateral action. So I feel it isn't something that should be turned on in production builds. I'd vote for enabling it by a non-default configure option, and making sure that it doesn't introduce any overhead when the option is off. I think hackers would benefit from getting reports from DBAs in the field with concrete data on bottlenecks. If the only way to get this is to do some non-standard compile and deploy it to production, or to create a "benchmarking" copy of the production database system including a realistic work-load driver and run the non-standard compile there; either of those is going to dramatically cut down on the participation. Agreed. The hardest thing to investigate performance issue is reproducing a situation in the different environment from the production environment. I often see people struggling to reproduce a situation with different hardware and (similar but) different workload. It is very time consuming, and also it often fails. So, we need to collect any piece of information, which would help us to understand what's going on within the production PostgreSQL, without any changes of binaries and configurations in the production environment. That's the reason why I stick to a "built-in" instrument, and I disagree to disable such instrument even if it has minor performance overhead. A flight-recorder must not be disabled. Collecting performance data must be top priority for DBA. Regards, Cheers, Jeff -- Satoshi Nagayasu Uptime Technologies, LLC. http://www.uptime.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
On 10/15/2012 3:25 PM, Andres Freund wrote: On Monday, October 15, 2012 09:18:57 PM Peter Geoghegan wrote: On 15 October 2012 19:19, Bruce Momjian wrote: > I think Robert is right that if Slony can't use the API, it is unlikely > any other replication system could use it. I don't accept that. Clearly there is a circular dependency, and someone has to go first - why should the Slony guys invest in adopting this technology if it is going to necessitate using a forked Postgres with an uncertain future? Well. I don't think (hope) anybody proposed making something release worthy for slony but rather a POC patch that proofs the API is generic enough to be used by them. If I (or somebody else familiar with this) work together with somebody familiar with with slony internals I think such a POC shouldn't be too hard to do. I think some more input from that side is a good idea. I plan to send out an email to possibly interested parties in about two weeks... What Slony essentially sends to the receiver node is a COPY stream in the format, Christopher described. That stream is directly copied into the receiving node's sl_log_N table and picked up there by an apply trigger BEFORE INSERT, that performs the corresponding INSERT/UPDATE/DELETE operation via prepared plans to the user tables. For a POC I think it is sufficient to demonstrate that this copy stream can be generated out of the WAL decoding. Note that Slony today does not touch columns in an UPDATE, that have not changed in the original UPDATE on the origin. Sending toasted column values, that haven't changed, would be a substantial change to the storage efficiency on the replica. The consequence of this is that the number of colums that need to be in the UPDATE's SET clause varies. The log_cmdupdncols is to separate the new column/value pairs from the column/key pairs of the updated row. The old row "key" in Slony is based on a unique index (preferably a PK, but any unique key will do). This makes that cmdupdncols simply the number of column/value pairs minus the number of key columns. So it isn't too hard to figure out. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On 16 October 2012 15:15, Tom Lane wrote: > Simon Riggs writes: >> So, proposal is to allow nextval() allocation to access a plugin, >> rather than simply write a WAL record and increment. If the plugin is >> loaded all sequences call it (not OIDs). > > I think this is a fundamentally wrong way to go about doing what you > want to do. It presumes that DDL-level manipulation of global sequences > is exactly like local sequences; an assumption that is obviously false. > What you really want is something vaguely like nextval but applied to > a distinct type of object. That is, I think we first need a different > kind of object called a "global sequence" with its own DDL operations. > > The nearby thread about "message queue" objects seems rather closely > related. Perhaps it would be fruitful to think about the commonalities > involved in two (or more?) new relkinds for global objects. The message queue concept doesn't exist at all yet, so when we create it we can specify anything we want. That is a different situation and hence a different solution. CREATE SEQUENCE is SQL Standard and used by SERIAL, many people's SQL, SQL generation tools etc.. My objective is to come up with something that makes the standard code work correctly in a replicated environment. If rewriting the application was acceptable, we could just do as Peter suggests and say "use UUIDs". Many other people who think rewriting everything is OK spell that "CouchDB" etc.. But that doesn't solve the problem at hand, which is making existing things work, rather than force people to rethink and rewrite. So CREATE GLOBAL SEQUENCE as a new kind of object altogether wouldn't solve the problem I'm trying to address. I guess we could use a decoration syntax on a sequence, like this... ALTER SEQUENCE foo GLOBAL or ALTER SEQUENCE foo ALLOCATION FUNCTION myglobalalloc(); -- if we cared to specify the alloc function on a per object basis. or ALTER SEQUENCE foo WITH (allocation=global); So that we can explicitly specify which sequences to replicate globally and which locally. We'd need to have a userset GUC default_sequence_allocation = local (default) | global so that SERIAL and other new sequences could pick up the required behaviour when created. In any case, we need to have a plugin/user definable allocation function for next few years at least. Thoughts? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in -c CLI option of pg_dump/pg_restore
On Sat, 2012-10-13 at 16:47 +0200, Guillaume Lelarge wrote: > Hi, > > One of my colleagues, Jehan-Guillaume de Rorthais, found a weird > behaviour of the "-c" command line option in the pg_restore tool while > doing a training. Here is the following steps he followed: > > createdb foo > > pg_dump -Fc foo > foo.dump > createdb bar > pg_restore -c -d bar foo.dump > > bar contains the same objects as foo (nothing unusual here), but... foo > is no longer present. Actually, if you use the "-c" command line option, > you get a "DROP DATABASE" statement. To me, it feels like a quite > terrible bug. > > It's quite easy to reproduce. Just create a database, and use pg_dump > with the "-c" option: > > createdb foo > pg_dump -s -c foo | grep DATABASE > > and you end up with this: > > DROP DATABASE foo; > > I tried from 8.3 till 9.2, and only 9.2 has this behaviour. > > You'll find attached a patch that fixes this issue. Another colleague, > Gilles Darold, tried it in every possible way, and it works. I'm not > sure the test I added makes it a very good patch, but it fixes the bug. > Any comments on this? -- Guillaume http://blog.guillaume.lelarge.info http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
On 10/15/2012 4:43 PM, Simon Riggs wrote: Jan spoke at length at PgCon, for all to hear, that what we are building is a much better way than the trigger logging approach Slony uses. I don't take that as carte blanche for approval of everything being done, but its going in the right direction with an open heart, which is about as good as it gets. The mechanism you are building for capturing changes is certainly a lot better than what Bucardo, Londiste and Slony are doing today. That much is true. The flip side of the coin however is that all of today's logical replication systems are designed Postgres version agnostic to a degree. This means that the transition time from the existing, trigger based approach to the new WAL based mechanism will see both technologies in parallel, which is no small thing to support. And that transition time may last for a good while. We still have people installing Slony 1.2 because 2.0 (3 years old by now) requires Postgres 8.3 minimum. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
Simon Riggs writes: > So, proposal is to allow nextval() allocation to access a plugin, > rather than simply write a WAL record and increment. If the plugin is > loaded all sequences call it (not OIDs). I think this is a fundamentally wrong way to go about doing what you want to do. It presumes that DDL-level manipulation of global sequences is exactly like local sequences; an assumption that is obviously false. What you really want is something vaguely like nextval but applied to a distinct type of object. That is, I think we first need a different kind of object called a "global sequence" with its own DDL operations. The nearby thread about "message queue" objects seems rather closely related. Perhaps it would be fruitful to think about the commonalities involved in two (or more?) new relkinds for global objects. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tuplesort memory usage: grow_memtuples
On 16 October 2012 13:42, Peter Geoghegan wrote: > On 14 October 2012 09:19, Simon Riggs wrote: >> This is a very useful optimisation, for both the low and the high end. > > Well, I'm about ready to mark this one "ready for committer". There is > this outstanding issue in my revision of August 17th, though: > > + /* > +* XXX: This feels quite brittle; is there a better > principled approach, > +* that does not violate modularity? > +*/ > + newmemtupsize = (int) floor(oldmemtupsize * allowedMem / > memNowUsed); > + state->fin_growth = true; > > I suppose that I should just recognise that this *is* nothing more > than a heuristic, and leave it at that. It's a simple and reasonable heuristic, and a great improvement on the previous situation. If you describe in detail that it is a heuristic and why that is proposed over other approaches that should be sufficient for future generations to read and understand. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On 16 October 2012 13:54, Peter Eisentraut wrote: > On 10/15/12 5:33 PM, Simon Riggs wrote: >> There are a few options >> 1) Manual separation of the value space, so that N1 has 50% of >> possible values and N2 has 50%. That has problems when we reconfigure >> the cluster, and requires complex manual reallocation of values. So it >> starts good but ends badly. >> 2) Automatic separation of the value space. This could mimic the >> manual operation, so it does everything for you - but thats just >> making a bad idea automatic >> 3) Lazy allocation from the value space. When a node is close to >> running out of values, it requests a new allocation and coordinates >> with all nodes to confirm the new allocation is good. > > What would the allocation service look like? Is it another PostgreSQL > server? What's the communication protocol? How would backups work? > Crash recovery? I've proposed a plugin for the allocation only. So the allocation looks like anything you want. Crash recovery and backups would not need changes. > Option 4 is of course to use UUIDs. That is a user level option. If user chooses sequences, then we must support them. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On Tuesday, October 16, 2012 02:58:11 PM Andrew Dunstan wrote: > On 10/16/2012 08:54 AM, Peter Eisentraut wrote: > > Option 4 is of course to use UUIDs. > > Yeah, I was wondering what this would really solve that using UUIDs > wouldn't solve. Large indexes over random values perform notably worse than mostly/completely ordered ones as they can be perfectly packed. Beside the fact that uuids have 2/4 times the storage overhead of int4/int8. That has influences both in query and modification performance. Also, not allowing plainly numeric pk's makes porting an application pretty annoying... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On 10/16/2012 08:54 AM, Peter Eisentraut wrote: Option 4 is of course to use UUIDs. Yeah, I was wondering what this would really solve that using UUIDs wouldn't solve. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On 10/15/12 5:33 PM, Simon Riggs wrote: > There are a few options > 1) Manual separation of the value space, so that N1 has 50% of > possible values and N2 has 50%. That has problems when we reconfigure > the cluster, and requires complex manual reallocation of values. So it > starts good but ends badly. > 2) Automatic separation of the value space. This could mimic the > manual operation, so it does everything for you - but thats just > making a bad idea automatic > 3) Lazy allocation from the value space. When a node is close to > running out of values, it requests a new allocation and coordinates > with all nodes to confirm the new allocation is good. What would the allocation service look like? Is it another PostgreSQL server? What's the communication protocol? How would backups work? Crash recovery? Option 4 is of course to use UUIDs. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
On 15.10.2012 19:31, Fujii Masao wrote: On Mon, Oct 15, 2012 at 11:27 PM, Heikki Linnakangas wrote: On 15.10.2012 13:13, Heikki Linnakangas wrote: Oh, I didn't remember that we've documented the specific structs that we pass around. It's quite bogus anyway to explain the messages the way we do currently, as they are actually dependent on the underlying architecture's endianess and padding. I think we should refactor the protocol to not transmit raw structs, but use pq_sentint and friends to construct the messages. This was discussed earlier (see http://archives.postgresql.org/message-id/4fe2279c.2070...@enterprisedb.com), I think there's consensus that 9.3 would be a good time to do that as we changed the XLogRecPtr format anyway. This is what I came up with. The replication protocol is now architecture-independent. The WAL format itself is still architecture-independent, of course, but this is useful if you want to e.g use pg_receivexlog to back up a server that runs on a different platform. I chose the int64 format to transmit timestamps, even when compiled with --disable-integer-datetimes. Please review if you have the time.. Thanks for the patch! When I ran pg_receivexlog, I encountered the following error. Yeah, clearly I didn't test this near enough... I fixed the bugs you bumped into, new version attached. + hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64); + hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char); These should be macro, to avoid calculation overhead? The compiler will calculate this at compilation time, it's going to be a constant at runtime. - Heikki diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 3d72a16..5a32517 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -1366,7 +1366,8 @@ The commands accepted in walsender mode are: WAL data is sent as a series of CopyData messages. (This allows other information to be intermixed; in particular the server can send an ErrorResponse message if it encounters a failure after beginning - to stream.) The payload in each CopyData message follows this format: + to stream.) The payload of each CopyData message from server to the + client contains a message of one of the following formats: @@ -1390,34 +1391,32 @@ The commands accepted in walsender mode are: - Byte8 + Int64 - The starting point of the WAL data in this message, given in - XLogRecPtr format. + The starting point of the WAL data in this message. - Byte8 + Int64 - The current end of WAL on the server, given in - XLogRecPtr format. + The current end of WAL on the server. - Byte8 + Int64 - The server's system clock at the time of transmission, - given in TimestampTz format. + The server's system clock at the time of transmission, as + microseconds since midnight on 2000-01-01. @@ -1445,25 +1444,12 @@ The commands accepted in walsender mode are: continuation records can be sent in different CopyData messages. - Note that all fields within the WAL data and the above-described header - will be in the sending server's native format. Endianness, and the - format for the timestamp, are unpredictable unless the receiver has - verified that the sender's system identifier matches its own - pg_control contents. - - If the WAL sender process is terminated normally (during postmaster shutdown), it will send a CommandComplete message before exiting. This might not happen during an abnormal shutdown, of course. - The receiving process can send replies back to the sender at any time, - using one of the following message formats (also in the payload of a - CopyData message): - - - @@ -1495,12 +1481,23 @@ The commands accepted in walsender mode are: - Byte8 + Int64 - The server's system clock at the time of transmission, - given in TimestampTz format. + The server's system clock at the time of transmission, as + microseconds since midnight on 2000-01-01. + + + + + + Byte1 + + + + 1 means that the client should reply to this message as soon as + possible, to avoid a timeout disconnect. 0 otherwise. @@ -1512,6 +1509,12 @@ The commands accepted in walsender mode are: + The receiving process can send re
Re: [HACKERS] tuplesort memory usage: grow_memtuples
On 14 October 2012 09:19, Simon Riggs wrote: > This is a very useful optimisation, for both the low and the high end. Well, I'm about ready to mark this one "ready for committer". There is this outstanding issue in my revision of August 17th, though: + /* +* XXX: This feels quite brittle; is there a better principled approach, +* that does not violate modularity? +*/ + newmemtupsize = (int) floor(oldmemtupsize * allowedMem / memNowUsed); + state->fin_growth = true; I suppose that I should just recognise that this *is* nothing more than a heuristic, and leave it at that. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On 16 October 2012 13:26, Markus Wanner wrote: > Why does a "Global Sequences" API necessarily hook at the nextval() and > setval() level? That sounds like it yields an awkward amount of > duplicate work. Reading this thread, so far it looks like we agree that > option 3) is the most feasible optimization (the strict ordering being > the un-optimized starting point). Do we really need an API that allows > for implementations of options 1) and 2)? Where else would you put the hook? The hook's location as described won't change whether you decide you want 1, 2 or 3. > What I'd appreciate more is a common implementation for option 3) with > an API to plug in different solutions to the underlying consensus problem. Implementations will be similar, differing mostly in the topology and transport layer, which means its not going to be possible to provide such a thing initially without slowing it down to the point we don't actually get it at all. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On 10/16/2012 12:47 AM, Josh Berkus wrote: > I'd also love to hear from the PostgresXC folks on whether this solution > works for them. Postgres-R too. In Postgres-R, option 3) is implemented. Though, by default sequences work just like on a single machine, giving you monotonically increasing sequence values - independent from the node you call nextval() from. IMO that's the user's expectation. (And yes, this has a performance penalty. But no, there's no compromise in availability). It is implemented very much like the per-backend cache we already have in vanilla Postgres, but taken to the per-node level. This gives the user a nice compromise between strongly ordered and entirely random values, allowing fine-tuning the trade off between performance and laziness in the ordering (think of CACHE 10 vs. CACHE 1). > If it works for all three of those > tools, it's liable to work for any potential new tool. In Postgres-R, this per-node cache uses additional attributes in the pg_sequence system catalog to store state of this cache. This is something I'm sure is not feasible to do from within a plugin. Why does a "Global Sequences" API necessarily hook at the nextval() and setval() level? That sounds like it yields an awkward amount of duplicate work. Reading this thread, so far it looks like we agree that option 3) is the most feasible optimization (the strict ordering being the un-optimized starting point). Do we really need an API that allows for implementations of options 1) and 2)? What I'd appreciate more is a common implementation for option 3) with an API to plug in different solutions to the underlying consensus problem. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] First draft of snapshot snapshot building design document
Hi All, On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote: > The design document [2] really just explains the problem (which is the > need for catalog metadata at a point in time to make sense of heap > tuples), without describing the solution that this patch offers with > any degree of detail. Rather, [2] says "How we build snapshots is > somewhat intricate and complicated and seems to be out of scope for > this document", which is unsatisfactory. I look forward to reading the > promised document that describes this mechanism in more detail. Here's the first version of the promised document. I hope it answers most of the questions. Input welcome! Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services = Snapshot Building = :author: Andres Freund, 2nQuadrant Ltd == Why do we need timetravel catalog access == When doing wal decoding (see DESIGN.txt for reasons to do so) we need to know how the catalog looked at the point a record was inserted into the WAL because without that information we don't know much more about the record but its length. Its just an arbitrary bunch of bytes without further information. Unfortunately due the possibility of the table definition changing we cannot just access a newer version of the catalog and assume the table definition is just the same. If only the type information were required it might be enough to annotate the wal records with a bit more information (table oid, table name, column name, column type) but as we want to be able to convert the output to more useful formats like text we need to be able to call output functions. Those need a normal environment including the usual caches and normal catalog access to lookup operators, functions and other types. Our solution to this is to add the capability to access the catalog in a way that makes it look like it did when the record was inserted into the WAL. The locking used during WAL generation guarantees the catalog is/was in a consistent state at that point. Interesting cases include: - enums - composite types - extension types - non-C functions - relfilenode to table oid mapping Due to postgres' MVCC nature regular modification of a tables contents are theoretically non-destructive. The problem is that there is no way to access arbitrary points in time even if the data for it is there. This module adds the capability to do so in the very limited set of circumstances we need it in for wal decoding. It does *not* provide a facility to do so in general. A 'Snapshot' is the datastructure used in postgres to describe which tuples are visible and which are not. We need to build a Snapshot which can be used to access the catalog the way it looked when the wal record was inserted. Restrictions: * Only works for catalog tables * Snapshot modifications are somewhat expensive * it cannot build initial visibility information for every point in time, it needs a specific set of circumstances for that * limited window in which we can build snapshots == How do we build timetravel snapshots == Hot Standby added infrastructure to build snapshots from WAL during recovery in the 9.0 release. Most of that can be reused for our purposes. We cannot reuse all of the HS infrastructure because: * we are not in recovery * we need to look *inside* transactions * we need the capability to have multiple different snapshots arround at the same time We need to provide two kinds of snapshots that are implemented rather differently in their plain postgres incarnation: * SnapshotNow * SnapshotMVCC We need both because if any operators use normal functions they will get executed with SnapshotMVCC semantics while the catcache and related things will rely on SnapshotNow semantics. Note that SnapshotNow here cannot be a normal SnapshotNow because we wouldn't access the old version of the catalog in that case. Instead something like an MVCC snapshot with the correct visibility information. That also means that snapshot won't have some race issues normal SnapshotNow has. Everytime a transaction that changed the catalog commits all other transactions will need a new snapshot that marks that transaction (and its subtransactions) as visible. Our snapshot representation is a bit different from normal snapshots, but we still reuse the normal SnapshotData struct: * Snapshot->xip contains all transaction we consider committed * Snapshot->subxip contains all transactions belonging to our transaction, including the toplevel one The meaning of ->xip is inverted in comparison with non-timetravel snapshots because usually only a tiny percentage of comitted transactions will have modified the catalog between xmin and xmax. It also makes subtransaction handling easier (we cannot query pg_subtrans). == Building of initial snapshot == We can start building an initial snapshot as soon as we find either an XLOG_RUNNING_XACTS or an XLOG_CHECKPOINT
Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility
On 10/16/2012 11:29 AM, Hannu Krosing wrote: On 10/16/2012 11:18 AM, Simon Riggs wrote: On 16 October 2012 09:56, Hannu Krosing wrote: Hallo postgresql and replication hackers This mail is an additional RFC which proposes a simple way to extend the new logical replication feature so it can cover most usages of skytools/pgq/londiste While the current work for BDR/LCR (bi-directional replication/logical replication) using WAL is theoretically enought to cover _replication_ offered by Londiste it falls short in one important way - there is currently no support for pure queueing, that is for "streams" of data which does not need to be stored in the source database. Fortunately there is a simple solution - do not store it in the source database :) The only thing needed for adding this is to have a table type which a) generates a INSERT record in WAL and b) does not actually store the data in a local file If implemented in userspace it would be a VIEW (or table) with a before/instead trigger which logs the inserted data and then cancels the insert. I'm sure this thing could be implemented, but I leave the tech discussion to those who are currently deep in WAL generation/reconstruction . If we implement logged only tables / queues we would not only enable a more performant pgQ replacement for implementing full Londiste / skytools functionality but would also become a very strong player to be used as persistent basis for message queueing solutions like ActiveMQ, StorMQ, any Advanced Message Queuing Protocol (AMQP) and so on. Hmm, I was assuming that we'd be able to do that by just writing extra WAL directly. But now you've made me think about it, that would be very ugly. Doing it this was, as you suggest, would allow us to write WAL records for queuing/replication to specific queue ids. It also allows us to have privileges assigned. So this looks like a good idea and might even be possible for 9.3. I've got a feeling we may want the word QUEUE again in the future, so I think we should call this a MESSAGE QUEUE. CREATE MESSAGE QUEUE foo; DROP MESSAGE QUEUE foo; I would like this to be very similar to a table, so it would be CREATE MESSAGE QUEUE(fieldname type, ...) foo; perhaps even allowing defaults and constraints. again, this depends on how complecxt the implementation would be. for the receiving side it would look like a table with only inserts, and in this case there could even be a possibility to use it as a remote log table. To clarify - this is intended to be a mirror image of UNLOGGED table That is , as much as possible a full table, except that no data gets written, which means that a) indexes do not make any sense b) exclusion and unique constraints dont make any sense c) select, update and delete always see an empty table all these should probably throw and error, analogous to how VIEWs currently work. It could be also described as a write-only table, except that it is possible to materialise it as a real table on the receiving side GRANT INSERT ON MESSAGE QUEUE foo TO ...; REVOKE INSERT ON MESSAGE QUEUE foo TO ...; Rules wouldn't. DELETE and UPDATE wouldn't work, nor would SELECT. Things for next release: Triggers, SELECT sees a stream of changes, CHECK clauses to constrain what can be written. One question: would we require the INSERT statement to parse against a tupledesc, or would it be just a single blob of TEXT or can we send any payload? I'd suggest just a single blob of TEXT, since that can be XML or JSON etc easily enough. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility
On 16 October 2012 10:29, Hannu Krosing wrote: > I would like this to be very similar to a table, so it would be > > CREATE MESSAGE QUEUE(fieldname type, ...) foo; > > perhaps even allowing defaults and constraints. again, this > depends on how complecxt the implementation would be. Presumably just CHECK constraints, not UNIQUE or FKs. Indexes would not be allowed. > for the receiving side it would look like a table with only inserts, > and in this case there could even be a possibility to use it as > a remote log table. The queue data would be available via the API, so it can look like anything. It would be good to identify this with a new rmgr id. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility
On 10/16/2012 11:18 AM, Simon Riggs wrote: On 16 October 2012 09:56, Hannu Krosing wrote: Hallo postgresql and replication hackers This mail is an additional RFC which proposes a simple way to extend the new logical replication feature so it can cover most usages of skytools/pgq/londiste While the current work for BDR/LCR (bi-directional replication/logical replication) using WAL is theoretically enought to cover _replication_ offered by Londiste it falls short in one important way - there is currently no support for pure queueing, that is for "streams" of data which does not need to be stored in the source database. Fortunately there is a simple solution - do not store it in the source database :) The only thing needed for adding this is to have a table type which a) generates a INSERT record in WAL and b) does not actually store the data in a local file If implemented in userspace it would be a VIEW (or table) with a before/instead trigger which logs the inserted data and then cancels the insert. I'm sure this thing could be implemented, but I leave the tech discussion to those who are currently deep in WAL generation/reconstruction . If we implement logged only tables / queues we would not only enable a more performant pgQ replacement for implementing full Londiste / skytools functionality but would also become a very strong player to be used as persistent basis for message queueing solutions like ActiveMQ, StorMQ, any Advanced Message Queuing Protocol (AMQP) and so on. Hmm, I was assuming that we'd be able to do that by just writing extra WAL directly. But now you've made me think about it, that would be very ugly. Doing it this was, as you suggest, would allow us to write WAL records for queuing/replication to specific queue ids. It also allows us to have privileges assigned. So this looks like a good idea and might even be possible for 9.3. I've got a feeling we may want the word QUEUE again in the future, so I think we should call this a MESSAGE QUEUE. CREATE MESSAGE QUEUE foo; DROP MESSAGE QUEUE foo; I would like this to be very similar to a table, so it would be CREATE MESSAGE QUEUE(fieldname type, ...) foo; perhaps even allowing defaults and constraints. again, this depends on how complecxt the implementation would be. for the receiving side it would look like a table with only inserts, and in this case there could even be a possibility to use it as a remote log table. GRANT INSERT ON MESSAGE QUEUE foo TO ...; REVOKE INSERT ON MESSAGE QUEUE foo TO ...; Rules wouldn't. DELETE and UPDATE wouldn't work, nor would SELECT. Things for next release: Triggers, SELECT sees a stream of changes, CHECK clauses to constrain what can be written. One question: would we require the INSERT statement to parse against a tupledesc, or would it be just a single blob of TEXT or can we send any payload? I'd suggest just a single blob of TEXT, since that can be XML or JSON etc easily enough. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility
On 16 October 2012 09:56, Hannu Krosing wrote: > Hallo postgresql and replication hackers > > This mail is an additional RFC which proposes a simple way to extend the > new logical replication feature so it can cover most usages of > skytools/pgq/londiste > > While the current work for BDR/LCR (bi-directional replication/logical > replication) using WAL is theoretically enought to cover _replication_ > offered by > Londiste it falls short in one important way - there is currently no support > for pure > queueing, that is for "streams" of data which does not need to be stored in > the source > database. > > Fortunately there is a simple solution - do not store it in the source > database :) > > The only thing needed for adding this is to have a table type which > > a) generates a INSERT record in WAL > > and > > b) does not actually store the data in a local file > > If implemented in userspace it would be a VIEW (or table) with a > before/instead > trigger which logs the inserted data and then cancels the insert. > > I'm sure this thing could be implemented, but I leave the tech discussion to > those who are currently deep in WAL generation/reconstruction . > > If we implement logged only tables / queues we would not only enable a more > performant pgQ replacement for implementing full Londiste / skytools > functionality > but would also become a very strong player to be used as persistent basis > for message queueing solutions like ActiveMQ, StorMQ, any Advanced Message > Queuing Protocol (AMQP) and so on. Hmm, I was assuming that we'd be able to do that by just writing extra WAL directly. But now you've made me think about it, that would be very ugly. Doing it this was, as you suggest, would allow us to write WAL records for queuing/replication to specific queue ids. It also allows us to have privileges assigned. So this looks like a good idea and might even be possible for 9.3. I've got a feeling we may want the word QUEUE again in the future, so I think we should call this a MESSAGE QUEUE. CREATE MESSAGE QUEUE foo; DROP MESSAGE QUEUE foo; GRANT INSERT ON MESSAGE QUEUE foo TO ...; REVOKE INSERT ON MESSAGE QUEUE foo TO ...; Rules wouldn't. DELETE and UPDATE wouldn't work, nor would SELECT. Things for next release: Triggers, SELECT sees a stream of changes, CHECK clauses to constrain what can be written. One question: would we require the INSERT statement to parse against a tupledesc, or would it be just a single blob of TEXT or can we send any payload? I'd suggest just a single blob of TEXT, since that can be XML or JSON etc easily enough. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility
Hallo postgresql and replication hackers This mail is an additional RFC which proposes a simple way to extend the new logical replication feature so it can cover most usages of skytools/pgq/londiste While the current work for BDR/LCR (bi-directional replication/logical replication) using WAL is theoretically enought to cover _replication_ offered by Londiste it falls short in one important way - there is currently no support for pure queueing, that is for "streams" of data which does not need to be stored in the source database. Fortunately there is a simple solution - do not store it in the source database :) The only thing needed for adding this is to have a table type which a) generates a INSERT record in WAL and b) does not actually store the data in a local file If implemented in userspace it would be a VIEW (or table) with a before/instead trigger which logs the inserted data and then cancels the insert. I'm sure this thing could be implemented, but I leave the tech discussion to those who are currently deep in WAL generation/reconstruction . If we implement logged only tables / queues we would not only enable a more performant pgQ replacement for implementing full Londiste / skytools functionality but would also become a very strong player to be used as persistent basis for message queueing solutions like ActiveMQ, StorMQ, any Advanced Message Queuing Protocol (AMQP) and so on. comments ? Hannu Krosing -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Global Sequences
On 2012-10-15 23:33, Simon Riggs wrote: So, proposal is to allow nextval() allocation to access a plugin, rather than simply write a WAL record and increment. If the plugin is loaded all sequences call it (not OIDs). +1. It is currently impossible to alter nextvals behaviour, without making changes in core. It is possible to define an alternative implementation and try to force to use it by using the search_path, but serial datatypes are always bound to pg_catalog.nextval(). This would enable every distributed PostgreSQL system to make a cleaner implementation for global sequences than they currently have, and would also encourage reuse of distributed nextval implementations. regards, Yeb -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers