[HACKERS] preproc.c compilation error
Hello all: This is my first time mail to all, yesterday I tried to compile postgresin my linux, but an error keep bother me. env: Ubuntu 13.10 Linux rugal-TM8473 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu9) bison (GNU Bison) 2.7.12-4996 version might be update to dated [rugal@rugal-TM8473 postgresql] git pull Already up-to-date. { make -C preproc all make[4]: Entering directory `/home/rugal/workspace/postgres ql/src/interfaces/ecpg/preproc' gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -pthread -D_REENTRANT -D_THREAD_SAFE -D_POSIX_PTHREAD_SEMANTICS -DECPG_COMPILE -I../include -I../../../../src/interfaces/ecpg/include -I. -I. -DMAJOR_VERSION=4 -DMINOR_VERSION=10 -DPATCHLEVEL=0 -I../../../../src/include -D_GNU_SOURCE -c -o preproc.o preproc.c -MMD -MP -MF .deps/preproc.Po In file included from preproc.y:15004:0: pgc.l: In function ‘base_yylex’: pgc.l:403:24: error: ‘ET_FATAL’ undeclared (first use in this function) BEGIN(state_before); ^ pgc.l:403:24: note: each undeclared identifier is reported only once for each function it appears in In file included from preproc.y:15004:0: pgc.l: In function ‘parse_include’: pgc.l:1366:28: error: ‘ET_FATAL’ undeclared (first use in this function) if (!yyin) ^ make[4]: *** [preproc.o] Error 1 } This problem keep bothering even after I use [./configure --enable-depend] and I find a solution --- a/src/interfaces/ecpg/preproc/type.hhttp://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=b7768fd66146e16547f385d30272e2a42e4d6832;hb=b7768fd66146e16547f385d30272e2a42e4d6832 +++ b/src/interfaces/ecpg/preproc/type.hhttp://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d;hb=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d @@ -186,7http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=b7768fd66146e16547f385d30272e2a42e4d6832;hb=b7768fd66146e16547f385d30272e2a42e4d6832#l186 +186,7http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d;hb=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d#l186@@ struct assignment enum errortype { - ET_WARNING, ET_ERROR, ET_FATAL + ET_WARNING, ET_ERROR }; After add this enum, ET_FATAL . compilation become successful! Is there any mistakes or typos with the latest version? regard!
Re: [HACKERS] preproc.c compilation error
and even successfully compiled PG, I got warning preproc.y: In function ‘vmmerror’: preproc.y:76:2: warning: enumeration value ‘ET_FATAL’ not handled in switch [-Wswitch] switch(type) ^ 2014/1/9 Rugal Bernstein ryujinwr...@gmail.com Hello all: This is my first time mail to all, yesterday I tried to compile postgres in my linux, but an error keep bother me. env: Ubuntu 13.10 Linux rugal-TM8473 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu9) bison (GNU Bison) 2.7.12-4996 version might be update to dated [rugal@rugal-TM8473 postgresql] git pull Already up-to-date. { make -C preproc all make[4]: Entering directory `/home/rugal/workspace/postgres ql/src/interfaces/ecpg/preproc' gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -pthread -D_REENTRANT -D_THREAD_SAFE -D_POSIX_PTHREAD_SEMANTICS -DECPG_COMPILE -I../include -I../../../../src/interfaces/ecpg/include -I. -I. -DMAJOR_VERSION=4 -DMINOR_VERSION=10 -DPATCHLEVEL=0 -I../../../../src/include -D_GNU_SOURCE -c -o preproc.o preproc.c -MMD -MP -MF .deps/preproc.Po In file included from preproc.y:15004:0: pgc.l: In function ‘base_yylex’: pgc.l:403:24: error: ‘ET_FATAL’ undeclared (first use in this function) BEGIN(state_before); ^ pgc.l:403:24: note: each undeclared identifier is reported only once for each function it appears in In file included from preproc.y:15004:0: pgc.l: In function ‘parse_include’: pgc.l:1366:28: error: ‘ET_FATAL’ undeclared (first use in this function) if (!yyin) ^ make[4]: *** [preproc.o] Error 1 } This problem keep bothering even after I use [./configure --enable-depend] and I find a solution --- a/src/interfaces/ecpg/preproc/type.hhttp://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=b7768fd66146e16547f385d30272e2a42e4d6832;hb=b7768fd66146e16547f385d30272e2a42e4d6832 +++ b/src/interfaces/ecpg/preproc/type.hhttp://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d;hb=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d @@ -186,7http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=b7768fd66146e16547f385d30272e2a42e4d6832;hb=b7768fd66146e16547f385d30272e2a42e4d6832#l186 +186,7http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d;hb=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d#l186@@ struct assignment enum errortype { - ET_WARNING, ET_ERROR, ET_FATAL + ET_WARNING, ET_ERROR }; After add this enum, ET_FATAL . compilation become successful! Is there any mistakes or typos with the latest version? regard!
Re: [HACKERS] [bug fix] multibyte messages are displayed incorrectly on the client
On Tue, Jan 7, 2014 at 8:56 AM, MauMau maumau...@gmail.com wrote: I suppose we know (or at least believe) those encodings during backend startup: * client encoding - the client_encoding parameter passed in the startup packet, or if that's not present, client_encoding GUC value. * server encoding - the encoding of strings gettext() returns. That is what GetPlatformEncoding() returns. Suppose the startup packet itself is malformed. How will you report the error? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
From: Andres Freund and...@2ndquadrant.com On 2014-01-08 14:42:37 -0800, Joshua D. Drake wrote: If we have the following: db0-db1:down Using the model (as I understand it) that is being discussed we have increased our failure rate because the moment db1:down we also lose db0. The node db0 may be up but if it isn't going to process transactions it is useless. I can tell you that I have exactly 0 customers that would want that model because a single node failure would cause a double node failure. That's why you should configure a second standby as another (candidate) synchronous replica, also listed in synchronous_standby_names. Let me ask a (probably) stupid question. How is the sync rep different from RAID-1? When I first saw sync rep, I expected that it would provide the same guarantees as RAID-1 in terms of durability (data is always mirrored on two servers) and availability (if one server goes down, another server continues full service). The cost is reasonable with RAID-1. The sync rep requires high cost to get both durability and availability --- three servers. Am I expecting too much? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failed assertion root-hasLateralRTEs on initsplan.c
On Tue, Jan 7, 2014 at 1:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Emre Hasegeli e...@hasegeli.com writes: I get assertion failure on initsplan.c line 1325 while executing following query on HEAD (edc43458d797a5956f4bf39af18cf62abb0077db). It works fine without --enable-cassert. update subscriber set properties = hstore(a) from (select firstname, lastname from player where player.id = subscriber.id) as a; Hm, AFAICS this query should absolutely *not* work; the reference to subscriber.id inside the sub-select is illegal. It might be legal with LATERAL, but not otherwise. So I think this is a parser bug, and there's nothing wrong with the planner's Assert. 9.2 and earlier throw the error I'd expect, so probably something in the LATERAL patches broke this case; will look. The next question is if we should allow it with LATERAL. That would essentially be treating subscriber as having implicitly appeared at the start of the FROM list, which I guess is all right ... but does anyone want to argue against it? I seem to recall some old discussions about allowing the update target to be explicitly shown in FROM, in case you wanted say to left join it against something else. Allowing this implicit appearance might limit our options if we ever get around to trying to do that. On the other hand, those discussions were a long time back, so maybe it'll never happen anyway. I still think that would be a good thing to do, but I don't see a problem. The way I imagine it would work is: if the alias used for the update target also appears in the FROM clause, then we treat them as the same thing (after checking that they refer to the same table in both cases). Otherwise, we add the update target as an additional from-list item. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] multibyte messages are displayed incorrectly on the client
From: Robert Haas robertmh...@gmail.com Suppose the startup packet itself is malformed. How will you report the error? I think we have no choice but to report the error in English, because we don't know what the client wants. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] multibyte messages are displayed incorrectly on the client
From: Robert Haas robertmh...@gmail.com Suppose the startup packet itself is malformed. How will you report the error? I think we have no choice but to report the error in English, because we don't know what the client wants. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preproc.c compilation error
You have to rebuild the auto-generated pgc.c, preproc.c and preproc.y or simply remove them to force a rebuild. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org Jabber: michael.meskes at gmail dot com VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Recovery to backup point
On 12/09/2013 03:05 PM, MauMau wrote: From: Heikki Linnakangas hlinnakan...@vmware.com Thanks. Looks sane, although I don't much like the proposed interface to trigger this, setting recovery_target_time='backup_point'. What the code actually does is to stop recovery as soon as you reach consistency, which might not have anything to do with a backup. If you set it on a warm standby server, for example, it will end recovery as soon as it reaches consistency, but there was probably no backup taken at that point. Thank you for reviewing so rapidly. I thought I would check the end of backup in recoveryStopsHere(), by matching XLOG_BACKUP_END and ControlFile-backupStartPoint for backups taken on the primary, and comparing the current redo location with ControlFile-backupEndPoint for backups taken on the standby. However, that would duplicate much code in XLOG_BACKUP_END redo processing and checkRecoveryConsistency(). Besides, the code works only when the user explicitly requests recovery to backup point, not when he starts the warm standby server. (I wonder I'm answering correctly.) I was thinking that you have a warm standby server, and you decide to stop using it as a warm standby, and promote it. You'd do that by stopping it, modifying recovery.conf to remove standby_mode, and set a recovery target, and then restart. After some refactoring and fixing bugs in the existing code, I came up with the attached patch. I called the option simply recovery_target, with the only allowed value of immediate. IOW, if you want to stop recovery as early as possible, you add recovery_target='immediate' to recovery.conf. Now that we have four different options to set the recovery target with, I rearranged the docs slightly. How does this look to you? - Heikki diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index a2361d7..854b5fd 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -1124,7 +1124,7 @@ restore_command = 'cp /mnt/server/archivedir/%f %p' para If you want to recover to some previous point in time (say, right before the junior DBA dropped your main transaction table), just specify the -required stopping point in filenamerecovery.conf/. You can specify +required link linkend=recovery-target-settingsstopping point/link in filenamerecovery.conf/. You can specify the stop point, known as the quoterecovery target/, either by date/time, named restore point or by completion of a specific transaction ID. As of this writing only the date/time and named restore point options diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml index 550cdce..a723338 100644 --- a/doc/src/sgml/recovery-config.sgml +++ b/doc/src/sgml/recovery-config.sgml @@ -199,8 +199,33 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p' # Windows sect1 id=recovery-target-settings titleRecovery Target Settings/title + para + By default, recovery will recover to the end of the WAL log. The + following parameters can be used to specify an earlier stopping point. + At most one of varnamerecovery_target/, + varnamerecovery_target_name/, varnamerecovery_target_time/, or + varnamerecovery_target_xid/ can be specified. + /para variablelist + varlistentry id=recovery-target xreflabel=recovery_target_name + termvarnamerecovery_target/varnameliteral = 'immediate'/literal/term + indexterm +primaryvarnamerecovery_target/ recovery parameter/primary + /indexterm + listitem + para +This parameter specifies that recovery should end as soon as a +consistency is reached, ie. as early as possible. When restoring from an +online backup, this means the point where taking the backup ended. + /para + para +Technically, this is a string parameter, but literal'immediate'/l +is currently the only allowed value. + /para + /listitem + /varlistentry + varlistentry id=recovery-target-name xreflabel=recovery_target_name termvarnamerecovery_target_name/varname (typestring/type) @@ -212,10 +237,6 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p' # Windows para This parameter specifies the named restore point, created with functionpg_create_restore_point()/ to which recovery will proceed. -At most one of varnamerecovery_target_name/, -xref linkend=recovery-target-time or -xref linkend=recovery-target-xid can be specified. The default is to -recover to the end of the WAL log. /para /listitem /varlistentry @@ -231,10 +252,6 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p' # Windows para This parameter specifies the time stamp up to which recovery will proceed. -At most one of varnamerecovery_target_time/, -xref linkend=recovery-target-name or -
Re: [HACKERS] Planning time in explain/explain analyze
Andreas, Robert, * Andreas Karlsson (andr...@proxel.se) wrote: A patch with updated documentation is attached. Thanks for working on this! On 01/02/2014 04:08 AM, Robert Haas wrote: I'm wondering whether the time should be stored inside the PlannedStmt node instead of passing it around separately. One possible problem with the way you've done things here is that, in the case of a prepared statement, EXPLAIN ANALYZE will emit the time needed to call GetCachedPlan(), even if that function didn't do any replanning. Now you could argue that this is the correct behavior, but I think there's a decent argument that what we ought to show there is the amount of time that was required to create the plan that we're displaying at the time it was created, rather than the amount of time that was required to figure out that we didn't need to replan. Agreed, and really, it'd be nice to know *both*. If we're worried about the timing cost when going through a no-op GetCachedPlan(), then perhaps we don't add that, but if we actually *do* re-plan, it'd be handy to know that and to know the timing it took. A minor side benefit of this approach is that you wouldn't need to change the signature for ExplainOnePlan(), which would avoid breaking extensions that may call it. Agreed. A possible argument against printing the time to create the plan is that unless it was created when running EXPLAIN we will not know it. This is perhaps the biggest point against- if we keep it this way... I do not think we want to always measure the time it took to generate a plan due to slow clocks on some architectures. Also I feel that such a patch would be more invasive. The slow-clock argument is really quite frustrating for those of us who are looking to add more and more metrics to PG. We're nowhere near the level that we need to be and it shows (particularly for users coming from $OTHER-RDBMS). Perhaps we should try and come up with a solution to address those cases (turn off metrics ala turning off stats?) while not preventing us from gathering metrics on more reasonable systems. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] array_length(anyarray)
On 19 December 2013 08:05, Pavel Stehule pavel.steh...@gmail.com wrote: 2013/12/19 David Fetter da...@fetter.org On Wed, Dec 18, 2013 at 09:27:54PM +0100, Marko Tiikkaja wrote: Hi, Attached is a patch to add support for array_length(anyarray), which only works for one-dimensional arrays, returns 0 for empty arrays and complains if the array's lower bound isn't 1. In other words, does the right thing when used with the arrays people use 99% of the time. +1 for adding this. +1 I think that having 2 functions called array_length() that each behave differently for empty arrays would just lead to confusion. The SQL standard defines a function called cardinality() that returns the number of elements of a collection (array or multiset), so why don't we call it that? length should be irrelevant to fact so array starts from 1, 0 or anything else Yes, this should just return the number of elements, and 0 for an empty array. How it should behave for multi-dimensional arrays is less clear, but I'd argue that it should return the total number of elements, i.e. cardinality('{{1,2},{3,4}}'::int[][]) = 4. That would make it consistent with the choices we've already made for unnest() and ordinality: - cardinality(foo) = (select count(*) from unnest(foo)). - unnest with ordinality would always result in ordinals in the range [1, cardinality]. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] nested hstore patch
On 01/08/2014 04:29 PM, Oleg Bartunov wrote: Attached is a new version of patch, which addresses most issues raised by Andres. It's long holidays in Russia now and it happened that Teodor is traveling with family, so Teodor asked me to reply. Comments in code will be added asap. Oleg, Please merge in the jsonb work and resubmit. See https://github.com/feodor/postgres/commits/jsonb_and_hstore I not that this repo does not apparently contain any of your latest changes. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preproc.c compilation error
Thanks, seems it is because the first time pgc.c and others are generated, but [make distclean] did not clean them, which lead to this problem! after [rm pgc.c preproc.h preproc.c preproc.y] it is now successfully compiled without any warning! thank you! Java Developer; Mysql/Oracle DBA; C/Java/Python/Bash; Vim; Linux; Pianist; Hadoop/Spark/Storm blog http://rugal.ml github https://github.com/Rugal twitter https://twitter.com/ryujinwrath On Thu, Jan 9, 2014 at 9:45 PM, Michael Meskes mes...@postgresql.orgwrote: You have to rebuild the auto-generated pgc.c, preproc.c and preproc.y or simply remove them to force a rebuild. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org Jabber: michael.meskes at gmail dot com VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL
Re: [HACKERS] preproc.c compilation error
On 2014-01-09 22:12:53 +0800, Rugal Bernstein wrote: Thanks, seems it is because the first time pgc.c and others are generated, but [make distclean] did not clean them, which lead to this problem! after [rm pgc.c preproc.h preproc.c preproc.y] it is now successfully compiled without any warning! thank you! maintainer-clean removes those, distclean doesn't because they are distributed so people without flex/bison can compile postgres. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Fri, Dec 6, 2013 at 6:41 PM, Amit Kapila amit.kapil...@gmail.com wrote: Agreed, summarization of data for LZ/Chunkwise encoding especially for non-compressible (hundred tiny fields, all changed/half changed) or less compressible data (hundred tiny fields, half nulled) w.r.t CPU usage is as below: a. For hard disk, there is an overhead of 7~16% with LZ delta encoding and there is an overhead of 5~8% with Chunk wise encoding. b. For Tempfs (which means operate on RAM as disk), there is an overhead of 19~26% with LZ delta encoding and there is an overhead of 9~18% with Chunk wise encoding. There might be some variation of data (in your last mail the overhead for chunkwise method for Tempfs was 12%), but in general the data suggests that chunk wise encoding has less overhead than LZ encoding for non-compressible data and for others it is better or equal. Now, I think we have below options for this patch: a. If the performance overhead for worst case is acceptable (we can try to reduce some more, but don't think it will be something drastic), then this can be done without any flag/option. b. Have it with table level option to enable/disable WAL compression c. Drop this patch, as for worst cases there is some performance overhead. d. Go back and work more on it, if there is any further suggestions for improvement. Based on data posted previously for both approaches (lz_delta, chunk_wise_encoding) and above options, I have improved the last version of patch by keeping chunk wise approach and provided a table level option to user. Changes in this version of patch: -- 1. Implement decoding, it is almost similar to pglz_decompress as the format to store encoded data is not changed much. 2. Provide a new reloption to specify Wal compression for update operation on table Create table tbl(c1 char(100)) With (compress_wal = true); Alternative options: a. compress_wal can take input as operation, e.g. 'insert', 'update', b. use alternate syntax: Create table tbl(c1 char(100)) Compress Wal For Update; c. anything better? 3. Fixed below 2 defects in encoding: a. In function pgrb_find_match(), if last byte of chunk matches, it consider whole chunk as match. b. If there is no match, it copies chunk as it is to encoded data, while copying, it is ignoring last byte. Due to defect fixes, data can vary, but I don't think there can be any major change. Points to consider - 1. As the current algorithm store the entry for same chunks at head of list, it will always find last but one chunk (we don't store last 4 bytes) for long matching string during match phase in encoding (pgrb_delta_encode). We can improve it either by storing same chunks at end of list instead of at head or by trying to find a good_match technique used in lz algorithm. Finding good_match technique can have overhead in some of the cases when there is actually no match. 2. Another optimization that we can do in pgrb_find_match(), is that currently if it doesn't find the first chunk (chunk got by hash index) matching, it continues to find the match in other chunks. I am not sure if there is any benefit to search for other chunks if first one is not matching. 3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c, if we want to move, then we need to either duplicate some common macros like pglz_out_tag or keep it common, but might be change the name. 4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively). The point to consider is that if we keep bigger chunk sizes, then it can save us on CPU cycles, but less reduction in Wal, on the other side if we keep it small it can have better reduction in Wal but consume more CPU cycles. 5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we can remove it before commit. 7. docs needs to be updated, tab completion needs some work. 8. We can extend Alter Table to set compress option for table. Thoughts/Suggestions? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pgrb_delta_encoding_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/09/2014 05:09 AM, Robert Treat wrote: On Wed, Jan 8, 2014 at 6:15 PM, Josh Berkus j...@agliodbs.com wrote: Stephen, I'm aware, my point was simply that we should state, up-front in 25.2.7.3 *and* where we document synchronous_standby_names, that it requires at least three servers to be involved to be a workable solution. It's a workable solution with 2 servers. That's a low-availability, high-integrity solution; the user has chosen to double their risk of not accepting writes against never losing a write. That's a perfectly valid configuration, and I believe that NTT runs several applications this way. In fact, that can already be looked at as a kind of auto-degrade mode: if there aren't two nodes, then the database goes read-only. Might I also point out that transactions are synchronous or not individually? The sensible configuration is for only the important writes being synchronous -- in which case auto-degrade makes even less sense. I really think that demand for auto-degrade is coming from users who don't know what sync rep is for in the first place. The fact that other vendors are offering auto-degrade as a feature instead of the ginormous foot-gun it is adds to the confusion, but we can't help that. I think the problem here is that we tend to have a limited view of the right way to use synch rep. If I have 5 nodes, and I set 1 synchronous and the other 3 asynchronous, I've set up a known successor in the event that the leader fails. But there is no guarantee that the synchronous replica actually is ahead of async ones. In this scenario though, if the successor fails, you actually probably want to keep accepting writes; since you weren't using synchronous for durability but for operational simplicity. I suspect there are probably other scenarios where users are willing to trade latency for improved and/or directed durability but not at the extent of availability, don't you? Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/09/2014 12:05 AM, Stephen Frost wrote: * Andres Freund (and...@2ndquadrant.com) wrote: On 2014-01-08 17:56:37 -0500, Stephen Frost wrote: * Andres Freund (and...@2ndquadrant.com) wrote: That's why you should configure a second standby as another (candidate) synchronous replica, also listed in synchronous_standby_names. Perhaps we should stress in the docs that this is, in fact, the *only* reasonable mode in which to run with sync rep on? Where there are multiple replicas, because otherwise Drake is correct that you'll just end up having both nodes go offline if the slave fails. Which, as it happens, is actually documented. I'm aware, my point was simply that we should state, up-front in 25.2.7.3 *and* where we document synchronous_standby_names, that it requires at least three servers to be involved to be a workable solution. Perhaps we should even log a warning if only one value is found in synchronous_standby_names... You can have only one name in synchronous_standby_names and have multiple slaves connecting with that name Also, I can attest that I have had clients who want exactly that - a system stop until admin intervention in case of a designated sync standby failing. And they actually run more than one standby, they just want to make sure that sync rep to 2nd data center always happens. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/08/2014 11:49 PM, Tom Lane wrote: Joshua D. Drake j...@commandprompt.com writes: On 01/08/2014 01:55 PM, Tom Lane wrote: Sync mode is about providing a guarantee that the data exists on more than one server *before* we tell the client it's committed. If you don't need that guarantee, you shouldn't be using sync mode. If you do need it, it's not clear to me why you'd suddenly not need it the moment the going actually gets tough. As I understand it what is being suggested is that if a subscriber or target goes down, then the master will just sit there and wait. When I read that, I read that the master will no longer process write transactions. If I am wrong in that understanding then cool. If I am not then that is a serious problem with a production scenario. There is an expectation that a master will continue to function if the target is down, synchronous or not. Then you don't understand the point of sync mode, and you shouldn't be using it. The point is *exactly* to refuse to commit transactions unless we can guarantee the data's been replicated. For single host scenario this would be similar to asking for a mode which turns fsync=off in case of disk failure :) Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/09/2014 01:57 PM, MauMau wrote: From: Andres Freund and...@2ndquadrant.com On 2014-01-08 14:42:37 -0800, Joshua D. Drake wrote: If we have the following: db0-db1:down Using the model (as I understand it) that is being discussed we have increased our failure rate because the moment db1:down we also lose db0. The node db0 may be up but if it isn't going to process transactions it is useless. I can tell you that I have exactly 0 customers that would want that model because a single node failure would cause a double node failure. That's why you should configure a second standby as another (candidate) synchronous replica, also listed in synchronous_standby_names. Let me ask a (probably) stupid question. How is the sync rep different from RAID-1? When I first saw sync rep, I expected that it would provide the same guarantees as RAID-1 in terms of durability (data is always mirrored on two servers) and availability (if one server goes down, another server continues full service). What you describe is most like A-sync rep. Sync rep makes sure that data is always replicated before confirming to writer. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/09/2014 02:01 AM, Jim Nasby wrote: On 1/8/14, 6:05 PM, Tom Lane wrote: Josh Berkusj...@agliodbs.com writes: On 01/08/2014 03:27 PM, Tom Lane wrote: What we lack, and should work on, is a way for sync mode to have M larger than one. AFAICS, right now we'll report commit as soon as there's one up-to-date replica, and some high-reliability cases are going to want more. Sync N times is really just a guarantee against data loss as long as you lose N-1 servers or fewer. And it becomes an even lower-availability solution if you don't have at least N+1 replicas. For that reason, I'd like to see some realistic actual user demand before we take the idea seriously. Sure. I wasn't volunteering to implement it, just saying that what we've got now is not designed to guarantee data survival across failure of more than one server. Changing things around the margins isn't going to improve such scenarios very much. It struck me after re-reading your example scenario that the most likely way to figure out what you had left would be to see if some additional system (think Nagios monitor, or monitors) had records of when the various database servers went down. This might be what you were getting at when you said logging, but the key point is it has to be logging done on an external server that could survive failure of the database server. postmaster.log ain't gonna do it. Yeah, and I think that the logging command that was suggested allows for that *if configured correctly*. *But* for relying on this, we would also need to make logging *synchronous*, which would probably not go down well with many people, as it makes things even more fragile from availability viewpoint (and slower as well). Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Recovery to backup point
From: Heikki Linnakangas hlinnakan...@vmware.com After some refactoring and fixing bugs in the existing code, I came up with the attached patch. I called the option simply recovery_target, with the only allowed value of immediate. IOW, if you want to stop recovery as early as possible, you add recovery_target='immediate' to recovery.conf. Now that we have four different options to set the recovery target with, I rearranged the docs slightly. How does this look to you? I'm almost comfortable with your patch. There are two comments: C1. The following parts seem to be mistakenly taken from my patch. These are not necessary for your patch, aren't they? @@ -6238,6 +6277,10 @@ StartupXLOG(void) ereport(LOG, (errmsg(starting point-in-time recovery to XID %u, recoveryTargetXid))); + else if (recoveryTarget == RECOVERY_TARGET_TIME + recoveryTargetTime == 0) + ereport(LOG, + (errmsg(starting point-in-time recovery to backup point))); else if (recoveryTarget == RECOVERY_TARGET_TIME) ereport(LOG, (errmsg(starting point-in-time recovery to %s, @@ -6971,6 +7017,22 @@ StartupXLOG(void) if (switchedTLI AllowCascadeReplication()) WalSndWakeup(); +/* + * If we have reached the end of base backup during recovery + * to the backup point, exit redo loop. + */ +if (recoveryTarget == RECOVERY_TARGET_TIME + recoveryTargetTime == 0 reachedConsistency) +{ + if (recoveryPauseAtTarget) + { + SetRecoveryPause(true); + recoveryPausesHere(); + } + reachedStopPoint = true; + break; +} + /* Exit loop if we reached inclusive recovery target */ if (recoveryStopsAfter(record)) { @@ -7116,6 +7178,9 @@ StartupXLOG(void) %s transaction %u, recoveryStopAfter ? after : before, recoveryStopXid); + else if (recoveryTarget == RECOVERY_TARGET_TIME + recoveryStopTime == 0) + snprintf(reason, sizeof(reason), at backup point); else if (recoveryTarget == RECOVERY_TARGET_TIME) snprintf(reason, sizeof(reason), %s %s\n, C2. recovery_target = 'immediate' sounds less intuitive than my suggestion recovery_target_time = 'backup_point', at least for those who want to recover to the backup point. Although I don't have a good naming sense in English, the value should be a noun, not an adjective like immediate, because the value specifies the target (point) of recovery. Being related to C2, I wonder if users would understand the following part in the documentation. +This parameter specifies that recovery should end as soon as a +consistency is reached, ie. as early as possible. The subsequent sentence clarifies the use case for recovery from an online backup, but in what use cases do they specify this parameter? For example, when do the users face the following situation? I was thinking that you have a warm standby server, and you decide to stop using it as a warm standby, and promote it. You'd do that by stopping it, modifying recovery.conf to remove standby_mode, and set a recovery target, and then restart. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] ECPG app crashes due to SIGBUS on SPARC Solaris
On Sun, Jan 05, 2014 at 03:42:42PM +0900, MauMau wrote: I ran the ECPG regression test with the unpatched 64-bit PostgreSQL 9.2.4 on SPARC Solaris, and it succeeded (all 54 tests passed). For ... Thanks a lot. Patch applied to HEAD and all the backbranches. Will push shortly. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org Jabber: michael.meskes at gmail dot com VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On 15 December 2013 01:57, Tom Lane t...@sss.pgh.pa.us wrote: Josh Berkus j...@agliodbs.com writes: I think even the FLOAT case deserves some consideration. What's the worst-case drift? Complete loss of all significant digits. The case I was considering earlier of single-row windows could be made safe (I think) if we apply the negative transition function first, before incorporating the new row(s). Then for example if you've got float8 1e20 followed by 1, you compute (1e20 - 1e20) + 1 and get the right answer. It's not so good with two-row windows though: Table correct sum of negative-transition this + next value result 1e201e201e20 + 1 = 1e20 1 1 1e20 - 1e20 + 0 = 0 0 In general, folks who do aggregate operations on FLOATs aren't expecting an exact answer, or one which is consistent beyond a certain number of significant digits. Au contraire. People who know what they're doing expect the results to be what an IEEE float arithmetic unit would produce for the given calculation. They know how the roundoff error ought to behave, and they will not thank us for doing a calculation that's not the one specified. I will grant you that there are plenty of clueless people out there who *don't* know this, but they shouldn't be using float arithmetic anyway. And Dave is right: how many bug reports would we get about NUMERIC is fast, but FLOAT is slow? I've said this before, but: we can make it arbitrarily fast if we don't have to get the right answer. I'd rather get it's slow complaints than this is the wrong answer complaints. Hi, Reading over this, I realised that there is a problem with NaN handling --- once the state becomes NaN, it can never recover. So the results using the inverse transition function don't match HEAD in cases like this: create table t(a int, b numeric); insert into t values(1,1),(2,2),(3,'NaN'),(4,3),(5,4); select a, b, sum(b) over(order by a rows between 1 preceding and current row) from t; which in HEAD produces: a | b | sum ---+-+- 1 | 1 | 1 2 | 2 | 3 3 | NaN | NaN 4 | 3 | NaN 5 | 4 | 7 (5 rows) but with this patch produces: a | b | sum ---+-+- 1 | 1 | 1 2 | 2 | 3 3 | NaN | NaN 4 | 3 | NaN 5 | 4 | NaN (5 rows) Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.
Hi, Somebody is reading this thread? This problem seems still remaining on REL9_3_STABLE. Many users would face this problem, so we should resolve this in next release. I think his patch is reasonable to fix this problem. Please check this again. regards, -- Tomonari Katsumata 2013/12/12 Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp Hello, we happened to see server crash on archive recovery under some condition. After TLI was incremented, there should be the case that the WAL file for older timeline is archived but not for that of the same segment id but for newer timeline. Archive recovery should fail for the case with PANIC error like follows, | PANIC: record with zero length at 0/1820D40 Replay script is attached. This issue occured for 9.4dev, 9.3.2, and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the TLI before trying archive for older TLIs. This occurrs during fetching checkpoint redo record in archive recovery. if (checkPoint.redo RecPtr) { /* back up to find the record */ record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false); And this is caused by that the segment file for older timeline in archive directory is preferred to that for newer timeline in pg_xlog. Looking into pg_xlog before trying the older TLIs in archive like 9.2- fixes this issue. The attached patch is one possible solution for 9.4dev. Attached files are, - recvtest.sh: Replay script. Step 1 and 2 makes the condition and step 3 causes the issue. - archrecvfix_20131212.patch: The patch fixes the issue. Archive recovery reads pg_xlog before trying older TLI in archive similarly to 9.1- by this patch. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
From: Hannu Krosing ha...@2ndquadrant.com On 01/09/2014 01:57 PM, MauMau wrote: Let me ask a (probably) stupid question. How is the sync rep different from RAID-1? When I first saw sync rep, I expected that it would provide the same guarantees as RAID-1 in terms of durability (data is always mirrored on two servers) and availability (if one server goes down, another server continues full service). What you describe is most like A-sync rep. Sync rep makes sure that data is always replicated before confirming to writer. Really? RAID-1 is a-sync? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch (v2) for updatable security barrier views
Dean Rasheed dean.a.rash...@gmail.com writes: My first thought was that it should just preprocess any security barrier quals in subquery_planner() in the same way as other quals are preprocessed. But thinking about it further, those quals are destined to become the quals of subqueries in the range table, so we don't actually want to preprocess them at that stage --- that will happen later when the new subquery is planned by recursion back into subquery_planner(). So I think the right answer is to make adjust_appendrel_attrs() handle recursion into sublink subqueries. TBH, this sounds like doubling down on a wrong design choice. I see no good reason that updatable security views should require any fundamental rearrangements of the order of operations in the planner; and I doubt that this is the last bug you'll have if you insist on doing that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.
On Thu, Dec 12, 2013 at 11:00 AM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, we happened to see server crash on archive recovery under some condition. After TLI was incremented, there should be the case that the WAL file for older timeline is archived but not for that of the same segment id but for newer timeline. Archive recovery should fail for the case with PANIC error like follows, | PANIC: record with zero length at 0/1820D40 Replay script is attached. This issue occured for 9.4dev, 9.3.2, and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the TLI before trying archive for older TLIs. This occurrs during fetching checkpoint redo record in archive recovery. if (checkPoint.redo RecPtr) { /* back up to find the record */ record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false); And this is caused by that the segment file for older timeline in archive directory is preferred to that for newer timeline in pg_xlog. Looking into pg_xlog before trying the older TLIs in archive like 9.2- fixes this issue. The attached patch is one possible solution for 9.4dev. Attached files are, - recvtest.sh: Replay script. Step 1 and 2 makes the condition and step 3 causes the issue. - archrecvfix_20131212.patch: The patch fixes the issue. Archive recovery reads pg_xlog before trying older TLI in archive similarly to 9.1- by this patch. regards, Horiguchi-san, Wonder if the following commit in 9.2 branch was to address a similar/same problem? http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4aed94f1660fb55bc825bf7f3135379dab28eb55 -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failed assertion root-hasLateralRTEs on initsplan.c
Robert Haas robertmh...@gmail.com writes: On Tue, Jan 7, 2014 at 1:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: The next question is if we should allow it with LATERAL. That would essentially be treating subscriber as having implicitly appeared at the start of the FROM list, which I guess is all right ... but does anyone want to argue against it? I seem to recall some old discussions about allowing the update target to be explicitly shown in FROM, in case you wanted say to left join it against something else. Allowing this implicit appearance might limit our options if we ever get around to trying to do that. On the other hand, those discussions were a long time back, so maybe it'll never happen anyway. I still think that would be a good thing to do, but I don't see a problem. The way I imagine it would work is: if the alias used for the update target also appears in the FROM clause, then we treat them as the same thing (after checking that they refer to the same table in both cases). Otherwise, we add the update target as an additional from-list item. Um, well, no; this does make it harder. Consider update t1 ... from lateral (select...) ss join (t1 left join ...) You propose that we identify t1 in the sub-JOIN clause with the target table. What if we have already resolved some outer references in subselect ss as belonging to t1? Now we have an illegal reference structure in the FROM clause, which is likely to lead to all sorts of grief. I'm sure we could forbid this combination of features, with some klugy parse-time check or other, but it feels like we started from wrong premises somewhere. It might be better if we simply didn't allow lateral references to the target table for now. We could introduce them in combination with the other feature, in which case we could say that the lateral reference has to be to an explicit reference to the target table in FROM, ie, if you want a lateral reference to t1 in ss you must write update t1 ... from t1 join lateral (select...) ss; The fly in the ointment is that we've already shipped a couple of 9.3.x releases that allowed lateral references to the target table. Even though this wasn't suggested or documented anywhere, somebody might be relying on it already. I'm inclined though to pull it back anyway, now that I've thought about it some more. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 01/09/2014 04:15 PM, MauMau wrote: From: Hannu Krosing ha...@2ndquadrant.com On 01/09/2014 01:57 PM, MauMau wrote: Let me ask a (probably) stupid question. How is the sync rep different from RAID-1? When I first saw sync rep, I expected that it would provide the same guarantees as RAID-1 in terms of durability (data is always mirrored on two servers) and availability (if one server goes down, another server continues full service). What you describe is most like A-sync rep. Sync rep makes sure that data is always replicated before confirming to writer. Really? RAID-1 is a-sync? Not exactly, as there is no master just controller writing to two equal disks. But having a degraded mode makes it more like async - it continues even with single disk and syncs later if and when the 2nd disk comes back. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
Dean Rasheed dean.a.rash...@gmail.com writes: Reading over this, I realised that there is a problem with NaN handling --- once the state becomes NaN, it can never recover. So the results using the inverse transition function don't match HEAD in cases like this: Ouch! That takes out numeric, float4, and float8 in one fell swoop. Given the relative infrequency of NaNs in most data, it seems like it might still be possible to get a speedup if we could use inverse transitions until we hit a NaN, then do it the hard way until the NaN is outside the window, then go back to inverse transitions. I'm not sure though if this is at all practical from an implementation standpoint. We certainly don't want the core code knowing about anything as datatype-specific as a NaN, but maybe the inverse transition function could have an API that allows reporting I can't do it here, fall back to the hard way. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan9, 2014, at 17:15 , Tom Lane t...@sss.pgh.pa.us wrote: Dean Rasheed dean.a.rash...@gmail.com writes: Reading over this, I realised that there is a problem with NaN handling --- once the state becomes NaN, it can never recover. So the results using the inverse transition function don't match HEAD in cases like this: Ouch! That takes out numeric, float4, and float8 in one fell swoop. For numeric, it seems that this could be overcome by having the state be a pair (s numeric, n numeric). s would track the sum of non-NaNs summands and n would track the number of NaN summands. The final function would return NaN if n 0 and s otherwise. The pair could be represented as a value of type numeric[] to avoid having to invent a new type for this. For float 4 and float8, wasn't the consensus that the potential lossy-ness of addition makes this impossible anyway, even without the NaN issue? But... Given the relative infrequency of NaNs in most data, it seems like it might still be possible to get a speedup if we could use inverse transitions until we hit a NaN, then do it the hard way until the NaN is outside the window, then go back to inverse transitions. I'm not sure though if this is at all practical from an implementation standpoint. We certainly don't want the core code knowing about anything as datatype-specific as a NaN, but maybe the inverse transition function could have an API that allows reporting I can't do it here, fall back to the hard way. that sounds like it might be possible to make things work for float4 and float8 afterall, if we can determine whether a particular addition was lossy or not. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array_length(anyarray)
On Jan9, 2014, at 14:57 , Dean Rasheed dean.a.rash...@gmail.com wrote: On 19 December 2013 08:05, Pavel Stehule pavel.steh...@gmail.com wrote: length should be irrelevant to fact so array starts from 1, 0 or anything else Yes, this should just return the number of elements, and 0 for an empty array. +1. Anything that complains about arrays whose lower bound isn't 1 really needs a *way* less generic name than array_length(). How it should behave for multi-dimensional arrays is less clear, but I'd argue that it should return the total number of elements, i.e. cardinality('{{1,2},{3,4}}'::int[][]) = 4. That would make it consistent with the choices we've already made for unnest() and ordinality: - cardinality(foo) = (select count(*) from unnest(foo)). - unnest with ordinality would always result in ordinals in the range [1, cardinality]. +1 best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS]
On Tue, Jan 7, 2014 at 10:55 PM, Dilip kumar dilip.ku...@huawei.com wrote: Below attached patch is implementing following todo item.. machine-readable pg_controldata? http://www.postgresql.org/message-id/4b901d73.8030...@agliodbs.com Possible approaches: 1. Implement as backend function and provide a view to user. I think this would be useful. - But In this approach user can only get this information when server is running. That is true, but we also have the command-line tool for when it isn't. 2. Extend pg_controldata tool to provide value of an individual attribute. A first draft version of the patch is attached in the mail, implemented using approach 2. I think this is really ugly, and I don't see what it accomplishes. If the user wants only one setting from pg_controldata, they can just grep for the line that contains that value. You mention that the patch skips printing the name of the field, avoiding parsing difficulties for the user, but the parsing here is trivial: just skip everything up to the first colon, plus any subsequent whitespace. It's worth keeping in mind that the information in pg_controldata is, by and large, not something we expect users to need all the time. Are there use cases for wanting to look at it? Sure. But not a ton of them. If we thought that everyone who works with PostgreSQL would need to write a script to fetch, say, the catalog version number, then it might be worth having pg_controldata --catalog-version to make that easy so that everyone doesn't have to write pg_controldata | grep '^Catalog version' | sed 's/.*: *//' , but considering that it's only something that comes up rarely, I don't really think it's worth the extra code and documentation to add an option for it. And similarly for the other pg_controldata fields. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs si...@2ndquadrant.com wrote: VACUUM cleans up blocks, which is nice because it happens offline in a lazy manner. We also make SELECT clean up blocks as it goes. That is useful in OLTP workloads, but it means that large SQL queries and pg_dump effectively do much the same work as VACUUM, generating huge amounts of I/O and WAL on the master, the cost and annoyance of which is experienced directly by the user. That is avoided on standbys. On a pgbench workload, though, essentially all page cleanup happens as a result of HOT cleanups, like 99.9%. It might be OK to have that happen for write operations, but it would be a performance disaster if updates didn't try to HOT-prune. Our usual argument for doing HOT pruning even on SELECT cleanups is that not doing so pessimizes repeated scans, but there are clearly cases that end up worse off as a result of that decision. I'm not entirely wild about adding a parameter in this area because it seems that we're increasingly choosing to further expose what arguably ought to be internal implementation details. The recent wal_log_hints parameter is another recent example of this that I'm not thrilled with, but in that case, as in this one, I can see the value of it. Still, I think it'd be loads better to restrict what you're talking about here to the SELECT-only case; I have a strong feeling that this will be a disaster on write workloads. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array_length(anyarray)
On 1/9/14 5:44 PM, Florian Pflug wrote: On Jan9, 2014, at 14:57 , Dean Rasheed dean.a.rash...@gmail.com wrote: On 19 December 2013 08:05, Pavel Stehule pavel.steh...@gmail.com wrote: length should be irrelevant to fact so array starts from 1, 0 or anything else Yes, this should just return the number of elements, and 0 for an empty array. +1. Anything that complains about arrays whose lower bound isn't 1 really needs a *way* less generic name than array_length(). Problem is, if you're operating on an array which could have a lower bound that isn't 1, why would you look at the length in the first place? You can't access any elements by index, you'd need to look at array_lower(). You can't iterate over the array by index, you'd need to do array_lower() .. array_lower() + array_length(), which doesn't make sense. And then there's the myriad of stuff you can do with unnest() without actually having to look at the length. Same goes for multi-dimensional arrays: you have even less things you can do there with only a length. So if we give up these constraints, we also make this function completely useless. Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
Florian Pflug f...@phlo.org writes: For float 4 and float8, wasn't the consensus that the potential lossy-ness of addition makes this impossible anyway, even without the NaN issue? But... Well, that was my opinion, I'm not sure if it was consensus ;-). But NaN is an orthogonal problem I think. I'm not sure whether it has analogues in other data types. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On Thu, Jan 9, 2014 at 04:55:22PM +0100, Hannu Krosing wrote: On 01/09/2014 04:15 PM, MauMau wrote: From: Hannu Krosing ha...@2ndquadrant.com On 01/09/2014 01:57 PM, MauMau wrote: Let me ask a (probably) stupid question. How is the sync rep different from RAID-1? When I first saw sync rep, I expected that it would provide the same guarantees as RAID-1 in terms of durability (data is always mirrored on two servers) and availability (if one server goes down, another server continues full service). What you describe is most like A-sync rep. Sync rep makes sure that data is always replicated before confirming to writer. Really? RAID-1 is a-sync? Not exactly, as there is no master just controller writing to two equal disks. But having a degraded mode makes it more like async - it continues even with single disk and syncs later if and when the 2nd disk comes back. I think RAID-1 is a very good comparison because it is successful technology and has similar issues. RAID-1 is like Postgres synchronous_standby_names mode in the sense that the RAID-1 controller will not return success until writes have happened on both mirrors, but it is unlike synchronous_standby_names in that it will degrade and continue writes even when it can't write to both mirrors. What is being discussed is to allow the RAID-1 behavior in Postgres. One issue that came up in discussions is the insufficiency of writing a degrade notice in a server log file because the log file isn't durable from server failures, meaning you don't know if a fail-over to the slave lost commits. The degrade message has to be stored durably against a server failure, e.g. on a pager, probably using a command like we do for archive_command, and has to return success before the server continues in degrade mode. I assume degraded RAID-1 controllers inform administrators in the same way. I think RAID-1 controllers operate successfully with this behavior because they are seen as durable and authoritative in reporting the status of mirrors, while with Postgres, there is no central authority that can report that degrade status of master/slaves. Another concern with degrade mode is that once Postgres enters degrade mode, how does it get back to synchronous_standby_names mode? We could have each commit wait for the timeout before continuing, but that is going to make degrade mode unusably slow. Would there be an admin command? With a timeout to force degrade mode, a temporary network outage could cause degrade mode, while our current behavior would recover synchronous_standby_names mode once the network was repaired. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Wed, Jan 8, 2014 at 2:39 PM, knizhnik knizh...@garret.ru wrote: I wonder what is the intended use case of dynamic shared memory? Is is primarly oriented on PostgreSQL extensions or it will be used also in PosatgreSQL core? My main motivation is that I want to use it to support parallel query. There is unfortunately quite a bit of work left to be done before we can make that a reality, but that's the goal. May be I am wrong, but I do not see some reasons for creating multiple DSM segments by the same extension. Right. And total number of DSM segments is expected to be not very large (10). The same is true for synchronization primitives (LWLocks for example) needed to synchronize access to this DSM segments. So I am not sure if possibility to place locks in DSM is really so critical... We can just reserved some space for LWLocks which can be used by extension, so that LWLockAssign() can be used without RequestAddinLWLocks or RequestAddinLWLocks can be used not only from preloaded extension. If you're doing all of this at postmaster startup time, that all works fine. If you want to be able to load up an extension on the fly, then it doesn't. You can only RequestAddinLWLocks() at postmaster start time, not afterwards, so currently any extension that wants to use lwlocks has to be loaded at postmaster startup time, or you're out of luck. Well. Technically we reserve something like 3 extra lwlocks that could be assigned later. But relying on those to be available is not very reliable, and also, 3 is not very many, considering that we have something north of 32k core lwlocks in the default configuration. IMHO the main trouble with DSM is lack of guarantee that segment is always mapped to the same virtual address. Without such guarantee it is not possible to use direct (normal) pointers inside DSM. But there seems to be no reasonable solution. Yeah, that basically sucks. But it's very hard to do any better. At least on a 64-bit platform, there's an awful lot of address space available, and in theory it ought to be possible to find a portion of that address space that isn't in use by any Postgres process and have all of the backends map the shared memory segment there. But there's no portable way to do that, and it seems like it would require an awful lot of IPC to achieve consensus on where to put a new mapping. On non-Windows platforms, Noah had the idea that could reserve a large chunk of address space mapped as PROT_NONE and then overwrite it with mappings later as needed. However, I'm not sure how portable that is or whether it'll cause performance consequences (like page table bloat) if the space doesn't end up getting used (or if it does). And unless you have an awful lot of space available, it's hard to be sure that new mappings are going to fit. And then there's Windows. It would be nice to have better operating system support for this. For example, IIUC, 64-bit Linux has 128TB of address space available for user processes. When you clone(), it can either share the entire address space (i.e. it's a thread) or none of it (i.e. it's a process). There's no option to, say, share 64TB and not the other 64TB, which would be ideal for us. We could then map dynamic shared memory segments into the shared portion of the address space and do backend-private allocations in the unshared part. Of course, even if we had that, it wouldn't be portable, so who knows how much good it would do. But it would be awfully nice to have the option. I haven't given up hope that we'll some day find a way to make same-address mappings work, at least on some platforms. But I don't expect it to happen soon. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
Robert Haas robertmh...@gmail.com writes: On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs si...@2ndquadrant.com wrote: We also make SELECT clean up blocks as it goes. That is useful in OLTP workloads, but it means that large SQL queries and pg_dump effectively do much the same work as VACUUM, generating huge amounts of I/O and WAL on the master, the cost and annoyance of which is experienced directly by the user. That is avoided on standbys. On a pgbench workload, though, essentially all page cleanup happens as a result of HOT cleanups, like 99.9%. It might be OK to have that happen for write operations, but it would be a performance disaster if updates didn't try to HOT-prune. Our usual argument for doing HOT pruning even on SELECT cleanups is that not doing so pessimizes repeated scans, but there are clearly cases that end up worse off as a result of that decision. My recollection of the discussion when HOT was developed is that it works that way not because anyone thought it was beneficial, but simply because we didn't see an easy way to know when first fetching a page whether we're going to try to UPDATE some tuple on the page. (And we can't postpone the pruning, because the query will have tuple pointers into the page later.) Maybe we should work a little harder on passing that information down. It seems reasonable to me that SELECTs shouldn't be tasked with doing HOT pruning. I'm not entirely wild about adding a parameter in this area because it seems that we're increasingly choosing to further expose what arguably ought to be internal implementation details. I'm -1 for a parameter as well, but I think that just stopping SELECTs from doing pruning at all might well be a win. It's at least worthy of some investigation. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Thu, Jan 9, 2014 at 12:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs si...@2ndquadrant.com wrote: We also make SELECT clean up blocks as it goes. That is useful in OLTP workloads, but it means that large SQL queries and pg_dump effectively do much the same work as VACUUM, generating huge amounts of I/O and WAL on the master, the cost and annoyance of which is experienced directly by the user. That is avoided on standbys. On a pgbench workload, though, essentially all page cleanup happens as a result of HOT cleanups, like 99.9%. It might be OK to have that happen for write operations, but it would be a performance disaster if updates didn't try to HOT-prune. Our usual argument for doing HOT pruning even on SELECT cleanups is that not doing so pessimizes repeated scans, but there are clearly cases that end up worse off as a result of that decision. My recollection of the discussion when HOT was developed is that it works that way not because anyone thought it was beneficial, but simply because we didn't see an easy way to know when first fetching a page whether we're going to try to UPDATE some tuple on the page. (And we can't postpone the pruning, because the query will have tuple pointers into the page later.) Maybe we should work a little harder on passing that information down. It seems reasonable to me that SELECTs shouldn't be tasked with doing HOT pruning. I'm not entirely wild about adding a parameter in this area because it seems that we're increasingly choosing to further expose what arguably ought to be internal implementation details. I'm -1 for a parameter as well, but I think that just stopping SELECTs from doing pruning at all might well be a win. It's at least worthy of some investigation. Unfortunately, there's no categorical answer. You can come up with workloads where HOT pruning on selects is a win; just create a bunch of junk and then read the same pages lots of times in a row. And you can also come up with workloads where it's a loss; create a bunch of junk and then read them just once. I don't know how easy it's going to be to set that parameter in a useful way for some particular environment, and I think that's possibly an argument against having it. But the argument that we don't need a parameter because one behavior is best for everyone is not going to fly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] newlines at end of generated SQL
On Wed, Jan 8, 2014 at 10:17 PM, Peter Eisentraut pete...@gmx.net wrote: Is there a reason why the programs in src/bin/scripts all put newlines at the end of the SQL commands they generate? This produces useless empty lines in the server log (and client output, if selected). If you're asking whether you can go ahead and fix that, +1 from me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On Wed, Jan 8, 2014 at 3:00 PM, Josh Berkus j...@agliodbs.com wrote: On 01/08/2014 01:49 PM, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: If we really want auto-degrading sync rep, then we'd (at a minimum) need a way to determine *from the replica* whether or not it was in degraded mode when the master died. What good do messages to the master log do you if the master no longer exists? How would it be possible for a replica to know whether the master had committed more transactions while communication was lost, if the master dies without ever restoring communication? It sounds like pie in the sky from here ... Oh, right. Because the main reason for a sync replica degrading is that it's down. In which case it isn't going to record anything. This would still be useful for sync rep candidates, though, and I'll document why below. But first, lemme demolish the case for auto-degrade. So here's the case that we can't possibly solve for auto-degrade. Anyone who wants auto-degrade needs to come up with a solution for this case as a first requirement: It seems like the only deterministically useful thing to do is to send a NOTICE to the *client* that the commit has succeeded, but in degraded mode, so keep your receipts and have your lawyer's number handy. Whether anyone is willing to add code to the client to process that message is doubtful, as well as whether the client will even ever receive it if we are in the middle of a major disruption. But I think there is a good probabilistic justification for an auto-degrade mode. (And really, what else is there? There are never any real guarantees of anything. Maybe none of your replicas ever come back up. Maybe none of your customers do, either.) 1. A data center network/power event starts. 2. The sync replica goes down. 3. A short time later, the master goes down. 4. Data center power is restored. 5. The master is fried and is a permanent loss. The replica is ok, though. Question: how does the DBA know whether data has been lost or not? What if he had a way of knowing that some data *has* been lost? What can he do about it? What is the value in knowing it was lost after the fact, but without the ability to do anything about it? But let's say that instead of a permanent loss, the master can be brought back up in a few days after replacing a few components, or in a few weeks after sending the drives out to clean-room data recovery specialists. Writing has already failed over to the replica, because you couldn't wait that long to bring things back up. Once you get your old master back, you can see if transaction have been lost, and if they have been you can dump the tables out to a human readable format, use PITR and restore a copy of the replica to the point just before the failover (although I'm not really sure exactly how to identify that point) and dump that out, then use 'diff' tools to figure out what changes to the database were lost, consult with the application specialists to figure out what the application was doing that lead to those changes (if that is not obvious) and business operations people to figure out how to apply the analogous changes to the top of the database, and customer service VP or someone to figure how to retroactively fix transactions that were done after the failover which would have been differently had the lost transactions not been lost. Or instead of all that, you could look at the recovered data and learn that in fact nothing had been lost, so nothing further needs to be done. If you were running in asyn replication mode on a busy server, there is a virtual certainty that some transactions have been lost. If you were running in sync mode with possibility of auto-degrade, it is far from certain. That depends on how long the power event lasted, compared to how long you had the timeout set to. Or rather than a data-center-wide power spike, what if your master just done fell over with no drama to the rest of the neighborhood? Inspection after the fail-over to the replica shows the RAID controller card failed. There is no reason to think that a RAID controller, in the process of failing, would have caused the replication to kick into degraded mode. You know from the surviving logs that the master spent 60 seconds total in degraded mode over the last 3 months, so there is a 99.999% chance no confirmed transactions were lost. To be conservative, let's drop it to 99.99% because maybe some unknown mechanism did allow a failing RAID controller to blip the network card without leaving any evidence behind. That's a lot better than the chances of lost transactions while in async replication mode, which could be 99.9% in the other direction. Cheers, Jeff
Re: [HACKERS] Add CREATE support to event triggers
On Wed, Jan 8, 2014 at 10:27 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Craig Ringer escribió: Instead, can't we use your already proposed subclause structure? {authorization:{authorization_role:some guy, output:AUTHORIZATION %i{authorization_role}}, if_not_exists: {output: IF NOT EXISTS}, name:some schema, output:CREATE SCHEMA %{if_not_exists} %i{name} %{authorization}} i.e. if_not_exists becomes an object. All clauses are objects, all non-object values are user data. (right?). If the clause is absent, the output key is the empty string. The issue with that (and with your original proposal) is that you can't tell what these clauses are supposed to be if they're not present in the original query. You can't *enable* IF NOT EXISTS without pulling knowledge of that syntax from somewhere else. Depending on the problem you intend to solve there, that might be fine. Hmm. This seems like a reasonable thing to do, except that I would like the output to always be the constant, and have some other way to enable the clause or disable it. With your present boolean: so if_not_exists: {output: IF NOT EXISTS, present: true/false} Why not: if_not_exists: true/false -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] integrate pg_upgrade analyze_new_cluster.sh into vacuumdb
On Wed, Jan 8, 2014 at 10:32 PM, Tom Lane t...@sss.pgh.pa.us wrote: Peter Eisentraut pete...@gmx.net writes: pg_upgrade creates a script analyze_new_cluster.{sh|bat} that runs vacuumdb --analyze-only in three stages with different statistics target settings to get a fresh cluster analyzed faster. I think this behavior is also useful for clusters or databases freshly created by pg_restore or any other loading mechanism, so it's suboptimal to have this constrained to pg_upgrade. Therefore, I suggest to add this functionality into the vacuumdb program. Seems reasonable. There are some details to be considered about who pg_upgrade would call this. For example, would we keep creating the script and just have the script call vacuumdb with the new option, or would we skip the script altogether and just print a message from pg_upgrade? Also, pg_upgrade contains logic to run a vacuum (not only analyze) in the final run when upgrading from PostgreSQL 8.4 to deal with the freespace map. Not sure how to adapt that; maybe just keep the script and run a non-analyze vacuum after the analyze. I don't think this vacuumdb feature should deal with any version-conversion issues. So it sounds like the thing to do is keep the wrapper script, which will give us a place to put any such special actions without having to kluge up vacuumdb's behavior. That'll avoid breaking scripts that users might've built for using pg_upgrade, too. I guess I don't see what's wrong with kludging up vacuumdb. It's not like that's a very complicated utility; what will be hurt by a few more options? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan9, 2014, at 18:09 , Tom Lane t...@sss.pgh.pa.us wrote: Florian Pflug f...@phlo.org writes: For float 4 and float8, wasn't the consensus that the potential lossy-ness of addition makes this impossible anyway, even without the NaN issue? But... Well, that was my opinion, I'm not sure if it was consensus ;-). I'd say your example showing how it could produce completely bogus results was pretty convincing... But NaN is an orthogonal problem I think. I'm not sure whether it has analogues in other data types. Transfer functions which are partially invertible are not that uncommon, I'd say. Browsing through 9.3's list of aggregate functions, the following come to mind max() Values smaller than the maximum can be removed, removing the current maximum requires a rescan. By remembering the N largest values, the number of required rescans can be reduced, but never fully eliminated. Same works for min(). bool_or() FALSE can be removed, removing TRUE requires a rescan. Could be made fully invertible by counting the number of TRUE and FALSE values, similar to my suggestion for how to handle NaN for sum(numeric). Same works for bool_and(). bit_or() Like boo_or(), 0 can be removed, everything else requires a rescan. Same works for bit_and() Plus, any aggregate with a strict transfer function would be in exactly the same situation regarding NULL as sum(numeric) is regarding NaN. AFAIK we don't have any such aggregate in core, though. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] integrate pg_upgrade analyze_new_cluster.sh into vacuumdb
Robert Haas robertmh...@gmail.com writes: On Wed, Jan 8, 2014 at 10:32 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't think this vacuumdb feature should deal with any version-conversion issues. So it sounds like the thing to do is keep the wrapper script, which will give us a place to put any such special actions without having to kluge up vacuumdb's behavior. That'll avoid breaking scripts that users might've built for using pg_upgrade, too. I guess I don't see what's wrong with kludging up vacuumdb. It's not like that's a very complicated utility; what will be hurt by a few more options? Carrying kluges forever, and exposing them to users' view. The particular example Peter gave was only relevant to upgrades from 8.4; why would we be putting code into vacuumdb now for that, and expecting to support it forevermore? What if the code to fix up something doesn't even *work* unless we're updating from version M.N? Putting such code into vacuumdb means you have to make it bulletproof against other invocation circumstances, and document what it does (since it's a user-visible switch), and just in general greatly increases the development overhead. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On Thu, Jan 9, 2014 at 09:36:47AM -0800, Jeff Janes wrote: Oh, right. Because the main reason for a sync replica degrading is that it's down. In which case it isn't going to record anything. This would still be useful for sync rep candidates, though, and I'll document why below. But first, lemme demolish the case for auto-degrade. So here's the case that we can't possibly solve for auto-degrade. Anyone who wants auto-degrade needs to come up with a solution for this case as a first requirement: It seems like the only deterministically useful thing to do is to send a NOTICE to the *client* that the commit has succeeded, but in degraded mode, so keep your receipts and have your lawyer's number handy. Whether anyone is willing to add code to the client to process that message is doubtful, as well as whether the client will even ever receive it if we are in the middle of a major disruption. I don't think clients are the right place for notification. Clients running on a single server could have fsync=off set by the admin or lying drives and never know it. I can't imagine a client only wiling to run if synchronous_standby_names is set. The synchronous slave is something the administrator has set up and is responsible for, so the administrator should be notified. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Thu, Jan 9, 2014 at 2:22 PM, Robert Haas robertmh...@gmail.com wrote: It would be nice to have better operating system support for this. For example, IIUC, 64-bit Linux has 128TB of address space available for user processes. When you clone(), it can either share the entire address space (i.e. it's a thread) or none of it (i.e. it's a process). There's no option to, say, share 64TB and not the other 64TB, which would be ideal for us. We could then map dynamic shared memory segments into the shared portion of the address space and do backend-private allocations in the unshared part. Of course, even if we had that, it wouldn't be portable, so who knows how much good it would do. But it would be awfully nice to have the option. You can map a segment at fork time, and unmap it after forking. That doesn't really use RAM, since it's supposed to be lazily allocated (it can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE, but I don't think that's portable). That guarantees it's free. Next, you can map shared memory at explicit addresses (linux's mmap has support for that, and I seem to recall Windows did too). All you have to do, is some book-keeping in shared memory (so all processes can coordinate new mappings). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] integrate pg_upgrade analyze_new_cluster.sh into vacuumdb
On Thu, Jan 9, 2014 at 12:44 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Wed, Jan 8, 2014 at 10:32 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't think this vacuumdb feature should deal with any version-conversion issues. So it sounds like the thing to do is keep the wrapper script, which will give us a place to put any such special actions without having to kluge up vacuumdb's behavior. That'll avoid breaking scripts that users might've built for using pg_upgrade, too. I guess I don't see what's wrong with kludging up vacuumdb. It's not like that's a very complicated utility; what will be hurt by a few more options? Carrying kluges forever, and exposing them to users' view. The particular example Peter gave was only relevant to upgrades from 8.4; why would we be putting code into vacuumdb now for that, and expecting to support it forevermore? What if the code to fix up something doesn't even *work* unless we're updating from version M.N? Putting such code into vacuumdb means you have to make it bulletproof against other invocation circumstances, and document what it does (since it's a user-visible switch), and just in general greatly increases the development overhead. I was referring to the analyze-in-stages logic, which is not specific to 8.4. I don't see a reason not to put that into vacuumdb. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
Robert Haas escribió: On Wed, Jan 8, 2014 at 10:27 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Hmm. This seems like a reasonable thing to do, except that I would like the output to always be the constant, and have some other way to enable the clause or disable it. With your present boolean: so if_not_exists: {output: IF NOT EXISTS, present: true/false} Why not: if_not_exists: true/false Yeah, that's another option. If we do this, though, the expansion function would have to know that an if_not_exist element expands to IF NOT EXISTS. Maybe that's okay. Right now, the expansion function is pretty stupid, which is nice. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] integrate pg_upgrade analyze_new_cluster.sh into vacuumdb
Robert Haas robertmh...@gmail.com writes: I was referring to the analyze-in-stages logic, which is not specific to 8.4. I don't see a reason not to put that into vacuumdb. Right, that's Peter's core proposal, which I agreed with. The issue was what to do with some other steps that pg_upgrade sometimes sticks into the analyze_new_cluster.sh script. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
Robert Haas escribió: Unfortunately, there's no categorical answer. You can come up with workloads where HOT pruning on selects is a win; just create a bunch of junk and then read the same pages lots of times in a row. And you can also come up with workloads where it's a loss; create a bunch of junk and then read them just once. I don't know how easy it's going to be to set that parameter in a useful way for some particular environment, and I think that's possibly an argument against having it. But the argument that we don't need a parameter because one behavior is best for everyone is not going to fly. In the above, there's the underlying assumption that it doesn't matter *what* we do with the page after doing or not doing pruning. But this is not necessarily the case: in the case of an UPDATE, having the space be freed beforehand is beneficial because there's the option of putting the new version of the tuple in the same page, potentially saving lots of I/O (bring up another destination page for the new tuple, write the new tuple there, end up dirtying two pages instead of one). But in a SELECT, the effect is only that you will have to skip less dead tuples, which is not as exciting. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
Robert, I think the problem here is that we tend to have a limited view of the right way to use synch rep. If I have 5 nodes, and I set 1 synchronous and the other 3 asynchronous, I've set up a known successor in the event that the leader fails. In this scenario though, if the successor fails, you actually probably want to keep accepting writes; since you weren't using synchronous for durability but for operational simplicity. I suspect there are probably other scenarios where users are willing to trade latency for improved and/or directed durability but not at the extent of availability, don't you? That's a workaround for a completely different limitation though; the inability to designate a specific async replica as first. That is, if there were some way to do so, you would be using that rather than sync rep. Extending the capabilities of that workaround is not something I would gladly do until I had exhausted other options. The other problem is that *many* users think they can get improved availability, consistency AND durability on two nodes somehow, and to heck with the CAP theorem (certain companies are happy to foster this illusion). Having a simple, easily-accessable auto-degrade without treading degrade as a major monitoring event will feed this self-deception. I know I already have to explain the difference between synchronous and simultaneous to practically every one of my clients for whom I set up replication. Realistically, degrade shouldn't be something that happens inside a single PostgreSQL node, either the master or the replica. It should be controlled by some external controller which is capable of deciding on degrade or not based on a more complex set of circumstances (e.g. Is the replica actually down or just slow?). Certainly this is the case with Cassandra, VoltDB, Riak, and the other serious multinode databases. This isn't to say there isn't a lot of confusion around the issue. Designing, implementing, and configuring different guarantees in the presence of node failures is a non-trivial problem. Still, I'd prefer to see Postgres head in the direction of providing more options in this area rather than drawing a firm line at being a CP-oriented system. I'm not categorically opposed to having any form of auto-degrade at all; what I'm opposed to is a patch which adds auto-degrade **without adding any additional monitoring or management infrastructure at all**. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup
On 01/09/2014 06:58 PM, Steeve Lennmark wrote: This patch adds the ability to relocate tablespaces by adding the command line argument --tablespace (-T) which takes a required argument in the format oid:tablespacedir. After all tablespaces are fetched this code updates the symlink to point to the new tablespace location. I would have loved to be able to pass tablespacename:tablespacedir though, but sadly I wasn't able to figure out how to retrieve that information without creating another connection to the database. This feature would be a nice addition to pg_basebackup, and I agree with that it would be preferable to use names of oids if possible. This feature might be missing because of some other limitation I fail to see, if so let me know. Please be gentle, this is my first patch ;-) It seems like you have attached the wrong patch. The only attachment I see is 0001-SQL-assertions-prototype.patch. Best regards, Andreas -- Andreas Karlsson -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup
On Thu, Jan 9, 2014 at 6:58 PM, Steeve Lennmark stee...@handeldsbanken.sewrote: Currently pg_basebackup is pretty invasive when using tablespaces, at least using the plain format. This since it requires the tablespace to be written to the same location as on the server beeing backed up. This both breaks backing up locally using -Fp (since the tablespace would be written to the same location) and requires the backup user to have write permissions in locations it shouldn't need to have access to. Yeah, this has been sitting on my TODO for a long time :) Glad to see someone is picking it up. This patch adds the ability to relocate tablespaces by adding the command line argument --tablespace (-T) which takes a required argument in the format oid:tablespacedir. After all tablespaces are fetched this code updates the symlink to point to the new tablespace location. I would have loved to be able to pass tablespacename:tablespacedir though, but sadly I wasn't able to figure out how to retrieve that information without creating another connection to the database. You could also use the format olddir:newdir, because you do know that. It's not the name of the tablespace. but I think it's still more usefriendly than using the oid. This feature might be missing because of some other limitation I fail to see, if so let me know. Please be gentle, this is my first patch ;-) Nope, I think it's just been limited on time. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.
On 12/12/2013 04:00 AM, Kyotaro HORIGUCHI wrote: Hello, we happened to see server crash on archive recovery under some condition. After TLI was incremented, there should be the case that the WAL file for older timeline is archived but not for that of the same segment id but for newer timeline. Archive recovery should fail for the case with PANIC error like follows, | PANIC: record with zero length at 0/1820D40 Replay script is attached. This issue occured for 9.4dev, 9.3.2, and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the TLI before trying archive for older TLIs. This occurrs during fetching checkpoint redo record in archive recovery. if (checkPoint.redo RecPtr) { /* back up to find the record */ record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false); Hmm. After running the script, pg_controldata says: Latest checkpoint location: 0/290 Prior checkpoint location:0/290 Latest checkpoint's REDO location:0/290 The PANIC is actually coming from here: /* * Re-fetch the last valid or last applied record, so we can identify the * exact endpoint of what we consider the valid portion of WAL. */ record = ReadRecord(xlogreader, LastRec, PANIC, false); If recovery started from an online checkpoint, then I think you'd get a similar PANIC from the code you quoted. And this is caused by that the segment file for older timeline in archive directory is preferred to that for newer timeline in pg_xlog. Yep. Looking into pg_xlog before trying the older TLIs in archive like 9.2- fixes this issue. The attached patch is one possible solution for 9.4dev. Attached files are, - recvtest.sh: Replay script. Step 1 and 2 makes the condition and step 3 causes the issue. - archrecvfix_20131212.patch: The patch fixes the issue. Archive recovery reads pg_xlog before trying older TLI in archive similarly to 9.1- by this patch. Hmm, that seems reasonable at a quick glance. I think it also needs a change to the state transition code earlier in the loop, to not move from XLOG_FROM_ARCHIVE to XLOG_FROM_PG_XLOG, if we've already tried reading pg_xlog. Another way to look at this is that it's wrong that we immediately PANIC if we successfully restore a file from archive, but then fail to read the record we're looking for. Instead, we should advance to next state in the state machine, ie. try reading the same record from pg_xlog, and only give up if the read fails from all sources. Yet another way to look at this is that we shouldn't even try to read the file with TLI 1 from the archive, when we know the checkpoint record is on timeline 2. It can't possibly work. I've been thinking for some time that we should stop doing the scan of all possible TLIs, and only try to read the exact file that contains the record we're reading. We have that information now, in the timeline history file, and we already do that during streaming replication. I was afraid of changing the behavior of archive recovery in 9.4, but maybe that's the way to go in the long term. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 9 January 2014 17:21, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs si...@2ndquadrant.com wrote: We also make SELECT clean up blocks as it goes. That is useful in OLTP workloads, but it means that large SQL queries and pg_dump effectively do much the same work as VACUUM, generating huge amounts of I/O and WAL on the master, the cost and annoyance of which is experienced directly by the user. That is avoided on standbys. On a pgbench workload, though, essentially all page cleanup happens as a result of HOT cleanups, like 99.9%. It might be OK to have that happen for write operations, but it would be a performance disaster if updates didn't try to HOT-prune. Our usual argument for doing HOT pruning even on SELECT cleanups is that not doing so pessimizes repeated scans, but there are clearly cases that end up worse off as a result of that decision. My recollection of the discussion when HOT was developed is that it works that way not because anyone thought it was beneficial, but simply because we didn't see an easy way to know when first fetching a page whether we're going to try to UPDATE some tuple on the page. (And we can't postpone the pruning, because the query will have tuple pointers into the page later.) Maybe we should work a little harder on passing that information down. It seems reasonable to me that SELECTs shouldn't be tasked with doing HOT pruning. I'm not entirely wild about adding a parameter in this area because it seems that we're increasingly choosing to further expose what arguably ought to be internal implementation details. I'm -1 for a parameter as well, but I think that just stopping SELECTs from doing pruning at all might well be a win. It's at least worthy of some investigation. Turning HOT off completely would be an absolute disaster for OLTP on high update use cases against medium-large tables. That scenario is well represented by pgbench and TPC-C. I am *not* suggesting we recommend that and would look for very large caveats in the docs. (That may not have been clear, I guess I just assumed people would know I was heavily involved in the HOT project and understood its benefits). As stated, I am interested in turning off HOT in isolated, user specified situations, perhaps just for isolated tables. I'm not crazy about exposing magic parameters either but then I'm not crazy about either automatic settings or deferring things because we don't know how to set it. In general, I prefer the idea of having a user settable parameter in one release then automating it in a later release if clear settings emerge from usage. I'll submit a patch with parameter, to allow experimentation, for possible removal at commit or beta. If I had to suggest a value for an internal parameter, I would say that each SELECT statement should clean no more than 4 blocks. That way current OLTP behaviour is mostly preserved while the big queries and pg_dump don't suck in unpredictable ways. I'll submit the patch and we can talk some more. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
Alvaro Herrera alvhe...@2ndquadrant.com writes: Robert Haas escribió: But the argument that we don't need a parameter because one behavior is best for everyone is not going to fly. In the above, there's the underlying assumption that it doesn't matter *what* we do with the page after doing or not doing pruning. But this is not necessarily the case: in the case of an UPDATE, having the space be freed beforehand is beneficial because there's the option of putting the new version of the tuple in the same page, potentially saving lots of I/O (bring up another destination page for the new tuple, write the new tuple there, end up dirtying two pages instead of one). But in a SELECT, the effect is only that you will have to skip less dead tuples, which is not as exciting. Yeah. Once they're hinted dead, it doesn't cost that much to skip over them. Not to mention that you might well never visit them at all, if this is an indexscan that knows which TIDs it needs to look at. It's possible that it can be shown that different use-cases have sufficiently different behaviors that we really do need a user-visible parameter. I don't want to start from that position though. If we did have a simple GUC parameter, it'd likely end up in the same boat as, say, enable_seqscan, which is way too blunt an instrument for real world use --- so I'm afraid this would soon bloat into a request for per-table settings, planner hints, or god knows what to try to confine the effects to the queries where it's appropriate. Let's not go there without proof that we have to. It's a much better thing if we can get the system's native behavior to be tuned well enough by depending on things it already knows. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 8 January 2014 21:40, Tom Lane t...@sss.pgh.pa.us wrote: Kevin Grittner kgri...@ymail.com writes: I'm torn on whether we should cave to popular demand on this; but if we do, we sure need to be very clear in the documentation about what a successful return from a commit request means. Sooner or later, Murphy's Law being what it is, if we do this someone will lose the primary and blame us because the synchronous replica is missing gobs of transactions that were successfully committed. I'm for not caving. I think people who are asking for this don't actually understand what they'd be getting. Agreed. Just to be clear, I made this mistake initially. Now I realise Heikki was right and if you think about it long enough, you will too. If you still disagree, think hard, read the archives until you do. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: But in a SELECT, the effect is only that you will have to skip less dead tuples, which is not as exciting. Agreed. There's also the option to have it be done based on some expectation of future work- that is, if we have to traverse X number of dead tuples during a select, then don't bother with HOT pruning, but if we get up to X+Y dead tuples, then do HOT pruning. That said, I'm not entirely convinced that traversing these dead tuples is all *that* painful during SELECT. If there's that many levels then hopefully it's not long til an UPDATE comes along and cleans them up. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Turning off HOT/Cleanup sometimes
* Simon Riggs (si...@2ndquadrant.com) wrote: I'm -1 for a parameter as well, but I think that just stopping SELECTs from doing pruning at all might well be a win. It's at least worthy of some investigation. Turning HOT off completely would be an absolute disaster for OLTP on high update use cases against medium-large tables. That scenario is well represented by pgbench and TPC-C. I am *not* suggesting we recommend that and would look for very large caveats in the docs. This is true even if we're only talking about turning it off for the SELECT case...? That's what's under discussion here, after all. Certainly, we wouldn't turn it off completely... (That may not have been clear, I guess I just assumed people would know I was heavily involved in the HOT project and understood its benefits). I'm certainly aware that you were heavily involved in HOT but I don't think anyone is argueing to turn it off for everything. As stated, I am interested in turning off HOT in isolated, user specified situations, perhaps just for isolated tables. I tend to agree w/ Tom on this point- having this be a per-table configurable doesn't sound very appealing to me and it wouldn't address the case you mentioned around pg_dump, but I'm sure that'd be the next step for this and a per-session GUC wouldn't be sufficient. I'm not crazy about exposing magic parameters either but then I'm not crazy about either automatic settings or deferring things because we don't know how to set it. In general, I prefer the idea of having a user settable parameter in one release then automating it in a later release if clear settings emerge from usage. I'll submit a patch with parameter, to allow experimentation, for possible removal at commit or beta. Ugh, adding GUCs is bad *because* we end up never being able to remove them. If I had to suggest a value for an internal parameter, I would say that each SELECT statement should clean no more than 4 blocks. That way current OLTP behaviour is mostly preserved while the big queries and pg_dump don't suck in unpredictable ways. Right, this was one idea that I had also, as noted in the other subthread. I'm not convinced that it's a great idea and it'd probably be good to do a bit of testing to see just what the cost is; perhaps even just come up with a worst-case example to see the difference between a clean table and one with HOT chains as deep as they can go.. I'll submit the patch and we can talk some more. Neat. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] [PATCH] pg_basebackup: progress report max once per second
On Thu, Nov 14, 2013 at 10:27 AM, Mika Eloranta m...@ohmu.fi wrote: On 13 Nov 2013, at 20:51, Mika Eloranta m...@ohmu.fi wrote: Prevent excessive progress reporting that can grow to gigabytes of output with large databases. Same patch as an attachment. Would it not make more sense to instead store the last number printed, and only print it if the percentage has changed? AIUI with this patch we still print the same thing on top of itself if it takes 1 second to get 1% further. (Except for verbose mode - but if you're asking for verbose mode, you are *asking* to get lots of output) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Turning off HOT/Cleanup sometimes
Stephen Frost sfr...@snowman.net writes: That said, I'm not entirely convinced that traversing these dead tuples is all *that* painful during SELECT. If there's that many levels then hopefully it's not long til an UPDATE comes along and cleans them up. There's always VACUUM ;-) If you take about ten steps back, what's happening here is that maintenance work that we'd originally delegated to VACUUM, precisely so that it wouldn't have to be done by foreground queries, is now being done by foreground queries. And oddly enough, people don't like that. There is a reasonable argument for forcing UPDATE queries to do it anyway, to improve the odds they can do same-page updates (whether HOT or otherwise). And probably an INSERT should do it on a page that it's selected as an insertion target. But I think the argument that the original do-maintenance-in-background-whenever-possible design was wrong is a lot harder to sustain for SELECT or even DELETE queries. As I said upthread, I think the current behavior was *not* chosen for performance reasons but just to limit the scope of what we had to change for HOT. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] nested hstore patch
On 01/09/2014 06:12 AM, Andrew Dunstan wrote: Oleg, Please merge in the jsonb work and resubmit. See https://github.com/feodor/postgres/commits/jsonb_and_hstore I note that this repo does not apparently contain any of your latest changes. I'll go further and say that if the Hstore2 patch doesn't support JSONB for 9.4, we should postpone it to 9.5. We really don't want to get into a situation where we need an Hstore3 because we accepted an Hstore2 which needs to be rev'd for JSON. Especially since there's no good reason for the JSON changes not to be merged already. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Thu, Jan 9, 2014 at 12:21 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jan 7, 2014 at 10:20 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jan 7, 2014 at 2:46 AM, Robert Haas robertmh...@gmail.com wrote: Well, right now we just reopen the same object from all of the processes, which seems to work fine and doesn't require any of this complexity. The only problem I don't know how to solve is how to make a segment stick around for the whole postmaster lifetime. If duplicating the handle into the postmaster without its knowledge gets us there, it may be worth considering, but that doesn't seem like a good reason to rework the rest of the existing mechanism. I think one has to try this to see if it works as per the need. If it's not urgent, I can try this early next week? Anything we want to get into 9.4 has to be submitted by next Tuesday, but I don't know that we're going to get this into 9.4. Using DuplicateHandle(), we can make segment stick for Postmaster lifetime. I have used below test (used dsm_demo module) to verify: Session - 1 select dsm_demo_create('this message is from session-1'); dsm_demo_create - 82712 Session - 2 - select dsm_demo_read(82712); dsm_demo_read this message is from session-1 (1 row) Session-1 \q -- till here it will work without DuplicateHandle as well Session -2 select dsm_demo_read(82712); dsm_demo_read this message is from session-1 (1 row) Session -2 \q Session -3 select dsm_demo_read(82712); dsm_demo_read this message is from session-1 (1 row) -- above shows that handle stays around. Note - Currently I have to bypass below code in dam_attach(), as it assumes segment will not stay if it's removed from control file. /* * If we didn't find the handle we're looking for in the control * segment, it probably means that everyone else who had it mapped, * including the original creator, died before we got to this point. * It's up to the caller to decide what to do about that. */ if (seg-control_slot == INVALID_CONTROL_SLOT) { dsm_detach(seg); return NULL; } Could you let me know what exactly you are expecting in patch, just a call to DuplicateHandle() after CreateFileMapping() or something else as well? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 01/09/2014 09:22 PM, Robert Haas wrote: On Wed, Jan 8, 2014 at 2:39 PM, knizhnik knizh...@garret.ru wrote: I wonder what is the intended use case of dynamic shared memory? Is is primarly oriented on PostgreSQL extensions or it will be used also in PosatgreSQL core? My main motivation is that I want to use it to support parallel query. There is unfortunately quite a bit of work left to be done before we can make that a reality, but that's the goal. I do not want to waste your time, but this topic is very interesting to me and I will be very pleased if you drop few words about how DSM can help to implement parallel query processing? It seems to me that the main complexity is in optimizer - it needs to split query plan into several subplans which can be executed concurrently and then merge their partial results. As far as I understand it is not possible to use multithreading for parallel query execution because most of PostgreSQL code is non-reentrant. So we need to execute this subplans by several processes. And unlike threads, the only way of efficient exchanging data between processes is shared memory. So it is clear why do we need shared memory for parallel query execution. But why it has to be dynamic? Why it can not be preallocated at start time as most of other resources used by PostgreSQL? May be I am wrong, but I do not see some reasons for creating multiple DSM segments by the same extension. Right. And total number of DSM segments is expected to be not very large (10). The same is true for synchronization primitives (LWLocks for example) needed to synchronize access to this DSM segments. So I am not sure if possibility to place locks in DSM is really so critical... We can just reserved some space for LWLocks which can be used by extension, so that LWLockAssign() can be used without RequestAddinLWLocks or RequestAddinLWLocks can be used not only from preloaded extension. If you're doing all of this at postmaster startup time, that all works fine. If you want to be able to load up an extension on the fly, then it doesn't. You can only RequestAddinLWLocks() at postmaster start time, not afterwards, so currently any extension that wants to use lwlocks has to be loaded at postmaster startup time, or you're out of luck. Well. Technically we reserve something like 3 extra lwlocks that could be assigned later. But relying on those to be available is not very reliable, and also, 3 is not very many, considering that we have something north of 32k core lwlocks in the default configuration. 3 is definitely too small. But you agreed with me that number of DSM segments will be not very large. And if we do not need fine grain locking (and IMHO it is not needed for most extensions), then we need just few (most likely one) lock per DSM segment. It means that if instead of 3 we reserve let's say 30 LW-locks, then it will be enough for most extensions. And there will be almost now extra resources overhead, because as you wrote PostgreSQL has 32k locks in default configuration. Certainly if we need independent lock for each page of DSM memory than there will be no other choice except placing locks in DSM segment itself. But once again - I do not think that most of extension needed shared memory will use such fine grain locking. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 01/09/2014 09:46 PM, Claudio Freire wrote: On Thu, Jan 9, 2014 at 2:22 PM, Robert Haas robertmh...@gmail.com wrote: It would be nice to have better operating system support for this. For example, IIUC, 64-bit Linux has 128TB of address space available for user processes. When you clone(), it can either share the entire address space (i.e. it's a thread) or none of it (i.e. it's a process). There's no option to, say, share 64TB and not the other 64TB, which would be ideal for us. We could then map dynamic shared memory segments into the shared portion of the address space and do backend-private allocations in the unshared part. Of course, even if we had that, it wouldn't be portable, so who knows how much good it would do. But it would be awfully nice to have the option. You can map a segment at fork time, and unmap it after forking. That doesn't really use RAM, since it's supposed to be lazily allocated (it can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE, but I don't think that's portable). That guarantees it's free. Next, you can map shared memory at explicit addresses (linux's mmap has support for that, and I seem to recall Windows did too). All you have to do, is some book-keeping in shared memory (so all processes can coordinate new mappings). As far as I undersand the main advantage of DSM is that segment can be allocated at any time - not only at fork time. And it is not because of memory consumption: even without unmap, allocation of some memory region doesn't cause loose pg physical memory. And there are usually no problem with exhaustion of virtual space at 64-bit architecture. But using some combination of flags (as MAP_NORESERVE), it is usually possible to completely eliminate overhead of reserving some address range in virtual space. But mapping dynamically created segment (not at fork time) to the same address really seems to be a big challenge. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] Filter error log statements by sqlstate
Allow the default log_min_error_statement to be overridden per sqlstate to make it possible to filter out some error types while maintaining a low log_min_error_statement or enable logging for some error types when the default is to not log anything. I've tried to do something like this using rsyslog filters, but that's pretty awkward and doesn't work at all when the statement is split to multiple syslog messages. https://github.com/saaros/postgres/compare/log-by-sqlstate src/backend/utils/error/elog.c | 183 - src/backend/utils/misc/guc.c | 14 +++- src/include/utils/guc.h| 4 + src/include/utils/guc_tables.h | 1 + 4 files changed, 199 insertions(+), 3 deletions(-) / Oskari From 61fe332f35f49c59257e9dcd0b5e2ff80f1f4055 Mon Sep 17 00:00:00 2001 From: Oskari Saarenmaa o...@ohmu.fi Date: Thu, 9 Jan 2014 20:49:28 +0200 Subject: [PATCH] Filter error log statements by sqlstate Allow the default log_min_error_statement to be overridden per sqlstate to make it possible to filter out some error types while maintaining a low log_min_error_statement or enable logging for some error types when the default is to not log anything. --- src/backend/utils/error/elog.c | 183 - src/backend/utils/misc/guc.c | 14 +++- src/include/utils/guc.h| 4 + src/include/utils/guc_tables.h | 1 + 4 files changed, 199 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 3de162b..c843e1a 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -74,7 +74,9 @@ #include storage/ipc.h #include storage/proc.h #include tcop/tcopprot.h +#include utils/builtins.h #include utils/guc.h +#include utils/guc_tables.h #include utils/memutils.h #include utils/ps_status.h @@ -111,6 +113,11 @@ char *Log_line_prefix = NULL; /* format for extra log line info */ int Log_destination = LOG_DESTINATION_STDERR; char *Log_destination_string = NULL; +static uint64 *log_sqlstate_error_statement = NULL; +static size_t log_sqlstate_error_statement_len = 0; + +static int get_sqlstate_error_level(int sqlstate); + #ifdef HAVE_SYSLOG /* @@ -2475,6 +2482,7 @@ static void write_csvlog(ErrorData *edata) { StringInfoData buf; + int requested_log_level; bool print_stmt = false; /* static counter for line numbers */ @@ -2618,7 +2626,10 @@ write_csvlog(ErrorData *edata) appendStringInfoChar(buf, ','); /* user query --- only reported if not disabled by the caller */ - if (is_log_level_output(edata-elevel, log_min_error_statement) + requested_log_level = get_sqlstate_error_level(edata-sqlerrcode); + if (requested_log_level 0) + requested_log_level = log_min_error_statement; + if (is_log_level_output(edata-elevel, requested_log_level) debug_query_string != NULL !edata-hide_stmt) print_stmt = true; @@ -2691,6 +2702,7 @@ static void send_message_to_server_log(ErrorData *edata) { StringInfoData buf; + int requested_log_level; initStringInfo(buf); @@ -2775,7 +2787,10 @@ send_message_to_server_log(ErrorData *edata) /* * If the user wants the query that generated this error logged, do it. */ - if (is_log_level_output(edata-elevel, log_min_error_statement) + requested_log_level = get_sqlstate_error_level(edata-sqlerrcode); + if (requested_log_level 0) + requested_log_level = log_min_error_statement; + if (is_log_level_output(edata-elevel, requested_log_level) debug_query_string != NULL !edata-hide_stmt) { @@ -3577,3 +3592,167 @@ trace_recovery(int trace_level) return trace_level; } + + +/* +*/ +static int +get_sqlstate_error_level(int sqlstate) +{ + uint64 left = 0, right = log_sqlstate_error_statement_len; + while (left right) + { + uint64 middle = left + (right - left) / 2; + int m_sqlstate = log_sqlstate_error_statement[middle] 32; + + if (m_sqlstate == sqlstate) + return log_sqlstate_error_statement[middle] 0x; + else if (m_sqlstate sqlstate) + left = middle + 1; + else + right = middle; + } + return -1; +} + +bool +check_log_sqlstate_error(char **newval, void **extra, GucSource source) +{ + const struct config_enum_entry *enum_entry; + char *rawstring, *new_newval, *rp; + List *elemlist; + ListCell *l; + uint64 *new_array = NULL; + int i, new_array_len = 0; + + /* Need a modifiable copy of string */ + rawstring = pstrdup(*newval); + + /* Parse string into list of identifiers */ + if (!SplitIdentifierString(rawstring, ',', elemlist)) + { + /* syntax error in list */ + GUC_check_errdetail(List syntax is invalid.); + pfree(rawstring); + list_free(elemlist); + return false; + } + + /* GUC wants malloced results, allocate room for as many elements on + * the list plus one to hold the array size */ + new_array = (uint64 *) malloc(sizeof(uint64) * (list_length(elemlist) + 1)); + if (!new_array) + { + pfree(rawstring); +
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Thu, Jan 9, 2014 at 4:24 PM, knizhnik knizh...@garret.ru wrote: On 01/09/2014 09:46 PM, Claudio Freire wrote: On Thu, Jan 9, 2014 at 2:22 PM, Robert Haas robertmh...@gmail.com wrote: It would be nice to have better operating system support for this. For example, IIUC, 64-bit Linux has 128TB of address space available for user processes. When you clone(), it can either share the entire address space (i.e. it's a thread) or none of it (i.e. it's a process). There's no option to, say, share 64TB and not the other 64TB, which would be ideal for us. We could then map dynamic shared memory segments into the shared portion of the address space and do backend-private allocations in the unshared part. Of course, even if we had that, it wouldn't be portable, so who knows how much good it would do. But it would be awfully nice to have the option. You can map a segment at fork time, and unmap it after forking. That doesn't really use RAM, since it's supposed to be lazily allocated (it can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE, but I don't think that's portable). That guarantees it's free. Next, you can map shared memory at explicit addresses (linux's mmap has support for that, and I seem to recall Windows did too). All you have to do, is some book-keeping in shared memory (so all processes can coordinate new mappings). As far as I undersand the main advantage of DSM is that segment can be allocated at any time - not only at fork time. And it is not because of memory consumption: even without unmap, allocation of some memory region doesn't cause loose pg physical memory. And there are usually no problem with exhaustion of virtual space at 64-bit architecture. But using some combination of flags (as MAP_NORESERVE), it is usually possible to completely eliminate overhead of reserving some address range in virtual space. But mapping dynamically created segment (not at fork time) to the same address really seems to be a big challenge. At fork time I only wrote about reserving the address space. After reserving it, all you have to do is implement an allocator that works in shared memory (protected by a lwlock of course). In essence, a hypothetical pg_dsm_alloc(region_name) would use regular shared memory to coordinate returning an already mapped region (same address which is guaranteed to work since we reserved that region), or allocate one (within the reserved address space). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 01/09/2014 11:09 PM, Amit Kapila wrote: On Thu, Jan 9, 2014 at 12:21 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jan 7, 2014 at 10:20 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jan 7, 2014 at 2:46 AM, Robert Haas robertmh...@gmail.com wrote: Well, right now we just reopen the same object from all of the processes, which seems to work fine and doesn't require any of this complexity. The only problem I don't know how to solve is how to make a segment stick around for the whole postmaster lifetime. If duplicating the handle into the postmaster without its knowledge gets us there, it may be worth considering, but that doesn't seem like a good reason to rework the rest of the existing mechanism. I think one has to try this to see if it works as per the need. If it's not urgent, I can try this early next week? Anything we want to get into 9.4 has to be submitted by next Tuesday, but I don't know that we're going to get this into 9.4. Using DuplicateHandle(), we can make segment stick for Postmaster lifetime. I have used below test (used dsm_demo module) to verify: Session - 1 select dsm_demo_create('this message is from session-1'); dsm_demo_create - 82712 Session - 2 - select dsm_demo_read(82712); dsm_demo_read this message is from session-1 (1 row) Session-1 \q -- till here it will work without DuplicateHandle as well Session -2 select dsm_demo_read(82712); dsm_demo_read this message is from session-1 (1 row) Session -2 \q Session -3 select dsm_demo_read(82712); dsm_demo_read this message is from session-1 (1 row) -- above shows that handle stays around. Note - Currently I have to bypass below code in dam_attach(), as it assumes segment will not stay if it's removed from control file. /* * If we didn't find the handle we're looking for in the control * segment, it probably means that everyone else who had it mapped, * including the original creator, died before we got to this point. * It's up to the caller to decide what to do about that. */ if (seg-control_slot == INVALID_CONTROL_SLOT) { dsm_detach(seg); return NULL; } Could you let me know what exactly you are expecting in patch, just a call to DuplicateHandle() after CreateFileMapping() or something else as well? As far as I understand DuplicateHandle() should really do the trick: protect segment from deallocation. But should postmaster be somehow notified about this handle? For example, if we really wants to delete this segment (drop extension), we should somehow make Postmaster to close this handle. How it can be done? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Fri, Jan 10, 2014 at 1:00 AM, knizhnik knizh...@garret.ru wrote: On 01/09/2014 11:09 PM, Amit Kapila wrote: Using DuplicateHandle(), we can make segment stick for Postmaster lifetime. I have used below test (used dsm_demo module) to verify: As far as I understand DuplicateHandle() should really do the trick: protect segment from deallocation. But should postmaster be somehow notified about this handle? For example, if we really wants to delete this segment (drop extension), we should somehow make Postmaster to close this handle. How it can be done? I think we need to use some form of IPC to communicate it to Postmaster. I could not think of any other way atm. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 01/09/2014 11:30 PM, Claudio Freire wrote: On Thu, Jan 9, 2014 at 4:24 PM, knizhnik knizh...@garret.ru wrote: On 01/09/2014 09:46 PM, Claudio Freire wrote: On Thu, Jan 9, 2014 at 2:22 PM, Robert Haas robertmh...@gmail.com wrote: It would be nice to have better operating system support for this. For example, IIUC, 64-bit Linux has 128TB of address space available for user processes. When you clone(), it can either share the entire address space (i.e. it's a thread) or none of it (i.e. it's a process). There's no option to, say, share 64TB and not the other 64TB, which would be ideal for us. We could then map dynamic shared memory segments into the shared portion of the address space and do backend-private allocations in the unshared part. Of course, even if we had that, it wouldn't be portable, so who knows how much good it would do. But it would be awfully nice to have the option. You can map a segment at fork time, and unmap it after forking. That doesn't really use RAM, since it's supposed to be lazily allocated (it can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE, but I don't think that's portable). That guarantees it's free. Next, you can map shared memory at explicit addresses (linux's mmap has support for that, and I seem to recall Windows did too). All you have to do, is some book-keeping in shared memory (so all processes can coordinate new mappings). As far as I undersand the main advantage of DSM is that segment can be allocated at any time - not only at fork time. And it is not because of memory consumption: even without unmap, allocation of some memory region doesn't cause loose pg physical memory. And there are usually no problem with exhaustion of virtual space at 64-bit architecture. But using some combination of flags (as MAP_NORESERVE), it is usually possible to completely eliminate overhead of reserving some address range in virtual space. But mapping dynamically created segment (not at fork time) to the same address really seems to be a big challenge. At fork time I only wrote about reserving the address space. After reserving it, all you have to do is implement an allocator that works in shared memory (protected by a lwlock of course). In essence, a hypothetical pg_dsm_alloc(region_name) would use regular shared memory to coordinate returning an already mapped region (same address which is guaranteed to work since we reserved that region), or allocate one (within the reserved address space). Why do we need named segments? There is ShmemAlloc function in PostgreSQL API. If RequestAddinShmemSpace can be used without requirement to place module in preloaded list, then isn't it enough for most extensions? And ShmemInitHash can be used to maintain named regions if it is needed... So if we have some reserved address space, do we actually need some special allocator for this space to allocate new segments in it? Why existed API to shared memory is not enough? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Thu, Jan 9, 2014 at 4:39 PM, knizhnik knizh...@garret.ru wrote: At fork time I only wrote about reserving the address space. After reserving it, all you have to do is implement an allocator that works in shared memory (protected by a lwlock of course). In essence, a hypothetical pg_dsm_alloc(region_name) would use regular shared memory to coordinate returning an already mapped region (same address which is guaranteed to work since we reserved that region), or allocate one (within the reserved address space). Why do we need named segments? There is ShmemAlloc function in PostgreSQL API. If RequestAddinShmemSpace can be used without requirement to place module in preloaded list, then isn't it enough for most extensions? And ShmemInitHash can be used to maintain named regions if it is needed... If you want to dynamically create the segments, you need some way to identify them. That is, the name. Otherwise, RequestWhateverShmemSpace won't know when to return an already-mapped region or not. Mind you, the name can be a number. No need to make it a string. So if we have some reserved address space, do we actually need some special allocator for this space to allocate new segments in it? Why existed API to shared memory is not enough? I don't know this existing API you mention. But I think this is quite a specific case very unlikely to be serviced from existing APIs. You need a data structure that can map names to regions, any hash map will do, or even an array since one wouldn't expect it to be too big, or require it to be too fast, and then you need to unmap the reserve mapping and put a shared region there instead, before returning the pointer to this shared region. So, the special thing is, the book-keeping region sits in regular shared memory, whereas the allocated regions sit in newly-created segments. And segments are referenced by pointers (since the address space is fixed and shared). Is there something like that already? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On Thu, Jan 9, 2014 at 4:48 PM, Claudio Freire klaussfre...@gmail.com wrote: On Thu, Jan 9, 2014 at 4:39 PM, knizhnik knizh...@garret.ru wrote: At fork time I only wrote about reserving the address space. After reserving it, all you have to do is implement an allocator that works in shared memory (protected by a lwlock of course). In essence, a hypothetical pg_dsm_alloc(region_name) would use regular shared memory to coordinate returning an already mapped region (same address which is guaranteed to work since we reserved that region), or allocate one (within the reserved address space). Why do we need named segments? There is ShmemAlloc function in PostgreSQL API. If RequestAddinShmemSpace can be used without requirement to place module in preloaded list, then isn't it enough for most extensions? And ShmemInitHash can be used to maintain named regions if it is needed... If you want to dynamically create the segments, you need some way to identify them. That is, the name. Otherwise, RequestWhateverShmemSpace won't know when to return an already-mapped region or not. Mind you, the name can be a number. No need to make it a string. So if we have some reserved address space, do we actually need some special allocator for this space to allocate new segments in it? Why existed API to shared memory is not enough? Oh, I notice why the confusion now. The reserve mapping I was proposing, was a MAP_NORESERVE with PROT_NONE. Ie: forbidden access. Which guarantees the OS won't try to allocate physical RAM to it. You'd have to re-map it before using, so it's not like a regular shared memory region where you can simply allocate pointers and intersperse bookkeeping data in-place. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.
On 01/09/2014 08:18 PM, Heikki Linnakangas wrote: On 12/12/2013 04:00 AM, Kyotaro HORIGUCHI wrote: Hello, we happened to see server crash on archive recovery under some condition. After TLI was incremented, there should be the case that the WAL file for older timeline is archived but not for that of the same segment id but for newer timeline. Archive recovery should fail for the case with PANIC error like follows, | PANIC: record with zero length at 0/1820D40 Replay script is attached. This issue occured for 9.4dev, 9.3.2, and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the TLI before trying archive for older TLIs. This occurrs during fetching checkpoint redo record in archive recovery. if (checkPoint.redo RecPtr) { /* back up to find the record */ record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false); Hmm. After running the script, pg_controldata says: Latest checkpoint location: 0/290 Prior checkpoint location:0/290 Latest checkpoint's REDO location:0/290 The PANIC is actually coming from here: /* * Re-fetch the last valid or last applied record, so we can identify the * exact endpoint of what we consider the valid portion of WAL. */ record = ReadRecord(xlogreader, LastRec, PANIC, false); If recovery started from an online checkpoint, then I think you'd get a similar PANIC from the code you quoted. And this is caused by that the segment file for older timeline in archive directory is preferred to that for newer timeline in pg_xlog. Yep. Actually, why is the partially-filled 00010002 file archived in the first place? Looking at the code, it's been like that forever, but it seems like a bad idea. If the original server is still up and running, and writing more data to that file, what will happen is that when the original server later tries to archive it, it will fail because the partial version of the file is already in the archive. Or worse, the partial version overwrites a previously archived more complete version. This is the code that does that: /* * If we are establishing a new timeline, we have to copy data from the * last WAL segment of the old timeline to create a starting WAL segment * for the new timeline. * * Notify the archiver that the last WAL segment of the old timeline is * ready to copy to archival storage. Otherwise, it is not archived for a * while. */ if (endTLI != ThisTimeLineID) { XLogFileCopy(endLogSegNo, endTLI, endLogSegNo); if (XLogArchivingActive()) { XLogFileName(xlogpath, endTLI, endLogSegNo); XLogArchiveNotify(xlogpath); } } So, the rationale is that otherwise it would take a long time until that segment is archived. To be precise, I don't think the segment with the old TLI would ever be archived without the above, but the same segment on the new timeline would, after it fills up. Wouldn't it be better to not archive the old segment, and instead switch to a new segment after writing the end-of-recovery checkpoint, so that the segment on the new timeline is archived sooner? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.
Heikki Linnakangas hlinnakan...@vmware.com writes: Actually, why is the partially-filled 00010002 file archived in the first place? ... So, the rationale is that otherwise it would take a long time until that segment is archived. To be precise, I don't think the segment with the old TLI would ever be archived without the above, but the same segment on the new timeline would, after it fills up. Wouldn't it be better to not archive the old segment, and instead switch to a new segment after writing the end-of-recovery checkpoint, so that the segment on the new timeline is archived sooner? Don't we want to archive both? If you want to recover to the end of the old timeline, you're going to need that file too, no? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 01/09/2014 11:48 PM, Claudio Freire wrote: On Thu, Jan 9, 2014 at 4:39 PM, knizhnik knizh...@garret.ru wrote: At fork time I only wrote about reserving the address space. After reserving it, all you have to do is implement an allocator that works in shared memory (protected by a lwlock of course). In essence, a hypothetical pg_dsm_alloc(region_name) would use regular shared memory to coordinate returning an already mapped region (same address which is guaranteed to work since we reserved that region), or allocate one (within the reserved address space). Why do we need named segments? There is ShmemAlloc function in PostgreSQL API. If RequestAddinShmemSpace can be used without requirement to place module in preloaded list, then isn't it enough for most extensions? And ShmemInitHash can be used to maintain named regions if it is needed... If you want to dynamically create the segments, you need some way to identify them. That is, the name. Otherwise, RequestWhateverShmemSpace won't know when to return an already-mapped region or not. Mind you, the name can be a number. No need to make it a string. So if we have some reserved address space, do we actually need some special allocator for this space to allocate new segments in it? Why existed API to shared memory is not enough? I don't know this existing API you mention. But I think this is quite a specific case very unlikely to be serviced from existing APIs. You need a data structure that can map names to regions, any hash map will do, or even an array since one wouldn't expect it to be too big, or require it to be too fast, and then you need to unmap the reserve mapping and put a shared region there instead, before returning the pointer to this shared region. So, the special thing is, the book-keeping region sits in regular shared memory, whereas the allocated regions sit in newly-created segments. And segments are referenced by pointers (since the address space is fixed and shared). Is there something like that already? By existed API I mostly mean 6 functions: RequestAddinShmemSpace() RequestAddinLWLocks() ShmemInitStruct() LWLockAssign() ShmemAlloc() ShmemInitHash() If it will be possible to use this function without requirement for module to be included in shared_preload_libraries list, then do we really need DSM? And it can be achieved by 1. Preserving address space (as you suggested) 2. Preserving some fixed number of free LWLocks (not very large 100). I do not have something against creation of own allocator of named shared memory segments within preserved address space. I just not sure if it is actually needed. In some sense RequestAddinShmemSpace() can be such allocator. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.
On 01/09/2014 10:16 PM, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: Actually, why is the partially-filled 00010002 file archived in the first place? ... So, the rationale is that otherwise it would take a long time until that segment is archived. To be precise, I don't think the segment with the old TLI would ever be archived without the above, but the same segment on the new timeline would, after it fills up. Wouldn't it be better to not archive the old segment, and instead switch to a new segment after writing the end-of-recovery checkpoint, so that the segment on the new timeline is archived sooner? Don't we want to archive both? If you want to recover to the end of the old timeline, you're going to need that file too, no? Hmm. It should be the responsibility of the original server to archive the segment on the old timeline. Mind you, partial segments are never archived, except for this one case, so how did the old segment find its way to the new server? A few possibilities come to mind: the DBA manually copied it from the old server to pg_xlog, it was streamed by streaming replication, or it was included in a base backup. The OP's test script resembles the base backup case. In all of those cases, I don't think it's the new server's responsibility to archive it. If it was copied to pg_xlog manually, the administrator may also copy it to the archive if he feels like it. If it was streamed from a live server, the original server should take care of it. If it was included in a backup, well, it's included in the backup so it doesn't necessarily need to be archived. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] pgcrypto: implement gen_random_uuid
The only useful feature of the uuid-ossp module in my opinion is the uuid_generate_v4 function and as uuid-ossp is more or less abandonware people have had trouble building and installing it. This patch implements an alternative uuid v4 generation function in pgcrypto which could be moved to core once there's a core PRNG with large enough internal state. On my test system it took 3796 msec to generate a million UUIDs with pgcrypto while uuid-ossp took 20375 msec. https://github.com/saaros/postgres/compare/pgcrypto-uuid-v4 contrib/pgcrypto/Makefile | 2 +- contrib/pgcrypto/pgcrypto--1.0--1.1.sql | 8 contrib/pgcrypto/{pgcrypto--1.0.sql = pgcrypto--1.1.sql} | 7 ++- contrib/pgcrypto/pgcrypto.c | 22 ++ contrib/pgcrypto/pgcrypto.control | 2 +- contrib/pgcrypto/pgcrypto.h | 1 + doc/src/sgml/pgcrypto.sgml| 11 +++ / Oskari From 522fef9c3739d4c4f3c107e574e84db67a0c07a2 Mon Sep 17 00:00:00 2001 From: Oskari Saarenmaa o...@ohmu.fi Date: Thu, 9 Jan 2014 22:24:36 +0200 Subject: [PATCH] pgcrypto: implement gen_random_uuid --- contrib/pgcrypto/Makefile | 2 +- contrib/pgcrypto/pgcrypto--1.0--1.1.sql | 8 ++ contrib/pgcrypto/pgcrypto--1.0.sql | 202 --- contrib/pgcrypto/pgcrypto--1.1.sql | 207 contrib/pgcrypto/pgcrypto.c | 22 contrib/pgcrypto/pgcrypto.control | 2 +- contrib/pgcrypto/pgcrypto.h | 1 + doc/src/sgml/pgcrypto.sgml | 11 ++ 8 files changed, 251 insertions(+), 204 deletions(-) create mode 100644 contrib/pgcrypto/pgcrypto--1.0--1.1.sql delete mode 100644 contrib/pgcrypto/pgcrypto--1.0.sql create mode 100644 contrib/pgcrypto/pgcrypto--1.1.sql diff --git a/contrib/pgcrypto/Makefile b/contrib/pgcrypto/Makefile index dadec95..1c85c98 100644 --- a/contrib/pgcrypto/Makefile +++ b/contrib/pgcrypto/Makefile @@ -26,7 +26,7 @@ MODULE_big = pgcrypto OBJS = $(SRCS:.c=.o) EXTENSION = pgcrypto -DATA = pgcrypto--1.0.sql pgcrypto--unpackaged--1.0.sql +DATA = pgcrypto--1.1.sql pgcrypto--1.0--1.1.sql pgcrypto--unpackaged--1.0.sql REGRESS = init md5 sha1 hmac-md5 hmac-sha1 blowfish rijndael \ $(CF_TESTS) \ diff --git a/contrib/pgcrypto/pgcrypto--1.0--1.1.sql b/contrib/pgcrypto/pgcrypto--1.0--1.1.sql new file mode 100644 index 000..2601669 --- /dev/null +++ b/contrib/pgcrypto/pgcrypto--1.0--1.1.sql @@ -0,0 +1,8 @@ +/* contrib/pgcrypto/pgcrypto--1.0--1.1.sql */ + +\echo Use ALTER EXTENSION pgcrypto UPDATE to load this file. \quit + +CREATE FUNCTION gen_random_uuid() +RETURNS uuid +AS 'MODULE_PATHNAME', 'pg_random_uuid' +LANGUAGE C VOLATILE; diff --git a/contrib/pgcrypto/pgcrypto--1.0.sql b/contrib/pgcrypto/pgcrypto--1.0.sql deleted file mode 100644 index 347825e..000 --- a/contrib/pgcrypto/pgcrypto--1.0.sql +++ /dev/null @@ -1,202 +0,0 @@ -/* contrib/pgcrypto/pgcrypto--1.0.sql */ - --- complain if script is sourced in psql, rather than via CREATE EXTENSION -\echo Use CREATE EXTENSION pgcrypto to load this file. \quit - -CREATE FUNCTION digest(text, text) -RETURNS bytea -AS 'MODULE_PATHNAME', 'pg_digest' -LANGUAGE C IMMUTABLE STRICT; - -CREATE FUNCTION digest(bytea, text) -RETURNS bytea -AS 'MODULE_PATHNAME', 'pg_digest' -LANGUAGE C IMMUTABLE STRICT; - -CREATE FUNCTION hmac(text, text, text) -RETURNS bytea -AS 'MODULE_PATHNAME', 'pg_hmac' -LANGUAGE C IMMUTABLE STRICT; - -CREATE FUNCTION hmac(bytea, bytea, text) -RETURNS bytea -AS 'MODULE_PATHNAME', 'pg_hmac' -LANGUAGE C IMMUTABLE STRICT; - -CREATE FUNCTION crypt(text, text) -RETURNS text -AS 'MODULE_PATHNAME', 'pg_crypt' -LANGUAGE C IMMUTABLE STRICT; - -CREATE FUNCTION gen_salt(text) -RETURNS text -AS 'MODULE_PATHNAME', 'pg_gen_salt' -LANGUAGE C VOLATILE STRICT; - -CREATE FUNCTION gen_salt(text, int4) -RETURNS text -AS 'MODULE_PATHNAME', 'pg_gen_salt_rounds' -LANGUAGE C VOLATILE STRICT; - -CREATE FUNCTION encrypt(bytea, bytea, text) -RETURNS bytea -AS 'MODULE_PATHNAME', 'pg_encrypt' -LANGUAGE C IMMUTABLE STRICT; - -CREATE FUNCTION decrypt(bytea, bytea, text) -RETURNS bytea -AS 'MODULE_PATHNAME', 'pg_decrypt' -LANGUAGE C IMMUTABLE STRICT; - -CREATE FUNCTION encrypt_iv(bytea, bytea, bytea, text) -RETURNS bytea -AS 'MODULE_PATHNAME', 'pg_encrypt_iv' -LANGUAGE C IMMUTABLE STRICT; - -CREATE FUNCTION decrypt_iv(bytea, bytea, bytea, text) -RETURNS bytea -AS 'MODULE_PATHNAME', 'pg_decrypt_iv' -LANGUAGE C IMMUTABLE STRICT; - -CREATE FUNCTION gen_random_bytes(int4) -RETURNS bytea -AS 'MODULE_PATHNAME', 'pg_random_bytes' -LANGUAGE C VOLATILE STRICT; - --- --- pgp_sym_encrypt(data, key) --- -CREATE FUNCTION pgp_sym_encrypt(text, text) -RETURNS bytea -AS 'MODULE_PATHNAME', 'pgp_sym_encrypt_text' -LANGUAGE C STRICT; - -CREATE FUNCTION pgp_sym_encrypt_bytea(bytea, text) -RETURNS bytea -AS
Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.
Heikki Linnakangas hlinnakan...@vmware.com writes: On 01/09/2014 10:16 PM, Tom Lane wrote: Don't we want to archive both? If you want to recover to the end of the old timeline, you're going to need that file too, no? Hmm. It should be the responsibility of the original server to archive the segment on the old timeline. Oh ... I was thinking about the case of a timeline switch within one server, say as a result of PITR recovery. If you're sure this same case doesn't come up that way, then nevermind. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.
On 01/09/2014 10:36 PM, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 01/09/2014 10:16 PM, Tom Lane wrote: Don't we want to archive both? If you want to recover to the end of the old timeline, you're going to need that file too, no? Hmm. It should be the responsibility of the original server to archive the segment on the old timeline. Oh ... I was thinking about the case of a timeline switch within one server, say as a result of PITR recovery. If you're sure this same case doesn't come up that way, then nevermind. I'm not sure what you mean, but it does apply to same-server PITR recovery too. In that case, again, if you copy the partial segment to pg_xlog when doing the PITR recovery, clearly you already somehow have access to the partial segment, and if you want to do another PITR to the same point, you should just copy it again. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.
On 01/09/2014 12:05 PM, Heikki Linnakangas wrote: Actually, why is the partially-filled 00010002 file archived in the first place? Looking at the code, it's been like that forever, but it seems like a bad idea. If the original server is still up and running, and writing more data to that file, what will happen is that when the original server later tries to archive it, it will fail because the partial version of the file is already in the archive. Or worse, the partial version overwrites a previously archived more complete version. Oh! This explains some transient errors I've seen. Wouldn't it be better to not archive the old segment, and instead switch to a new segment after writing the end-of-recovery checkpoint, so that the segment on the new timeline is archived sooner? It would be better to zero-fill and switch segments, yes. We should NEVER be in a position of archiving two different versions of the same segment. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planning time in explain/explain analyze
On Thu, Jan 9, 2014 at 1:50 PM, Stephen Frost sfr...@snowman.net wrote: I do not think we want to always measure the time it took to generate a plan due to slow clocks on some architectures. Also I feel that such a patch would be more invasive. The slow-clock argument is really quite frustrating for those of us who are looking to add more and more metrics to PG. Especially since implementing gettimeofday quickly is a solved problem on most architectures. However I don't see the issue here. Two gettimeofday calls per query plan is not really going to hurt even on systems where it's slow. The problems we run into are explain analyze which runs getimeofday twice for every node for every tuple processed. For cpu-bound queries that's can become the dominant cost. The only way two gettimeofday calls per query plan becomes an issue is if you're executing non-cached queries repeatedly on data that's entirely in ram. That means the query processing is entirely cpu-bound and adding two syscalls could actually be noticeable. The mitigation strategy would be to prepare and cache the query handle to execute it again. We're nowhere near the level that we need to be and it shows (particularly for users coming from $OTHER-RDBMS). Perhaps we should try and come up with a solution to address those cases (turn off metrics ala turning off stats?) while not preventing us from gathering metrics on more reasonable systems. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL
On 1/9/14, 1:18 PM, knizhnik wrote: So it is clear why do we need shared memory for parallel query execution. But why it has to be dynamic? Why it can not be preallocated at start time as most of other resources used by PostgreSQL? That would limit us to doing something like allocating a fixed maximum of parallel processes (which might be workable) and only allocating a very small amount of memory for IPC. Small as in can only handle a small number of tuples. That sounds like a really inefficient way to shuffle data to and from parallel processes, especially because one or both sides would probably have to actually copy the data if we're doing it that way. With DSM if you want to do something like a parallel sort each process can put their results into memory that the parent process can directly access. Of course the other enormous win for DSM is it's the foundation for finally being able to resize things without a restart. For large dollar sites that ability would be hugely beneficial. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planning time in explain/explain analyze
Greg Stark st...@mit.edu writes: However I don't see the issue here. Two gettimeofday calls per query plan is not really going to hurt even on systems where it's slow. I tend to agree with this, especially if the calls only occur when the user asks for the information (ie, does an EXPLAIN rather than just executing the query). The only way two gettimeofday calls per query plan becomes an issue is if you're executing non-cached queries repeatedly on data that's entirely in ram. That means the query processing is entirely cpu-bound and adding two syscalls could actually be noticeable. The mitigation strategy would be to prepare and cache the query handle to execute it again. This point weighs against the proposal that we time the work to fetch a previously-prepared query plan; if we do that then the mitigation strategy doesn't mitigate anything. In short then, I think we should just add this to EXPLAIN and be done. -1 for sticking the info into PlannedStmt or anything like that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 1/9/14, 12:54 PM, Tom Lane wrote: Stephen Frost sfr...@snowman.net writes: That said, I'm not entirely convinced that traversing these dead tuples is all *that* painful during SELECT. If there's that many levels then hopefully it's not long til an UPDATE comes along and cleans them up. There's always VACUUM ;-) If you take about ten steps back, what's happening here is that maintenance work that we'd originally delegated to VACUUM, precisely so that it wouldn't have to be done by foreground queries, is now being done by foreground queries. And oddly enough, people don't like that. There is a reasonable argument for forcing UPDATE queries to do it anyway, to improve the odds they can do same-page updates (whether HOT or otherwise). And probably an INSERT should do it on a page that it's selected as an insertion target. But I think the argument that the original do-maintenance-in-background-whenever-possible design was wrong is a lot harder to sustain for SELECT or even DELETE queries. As I said upthread, I think the current behavior was *not* chosen for performance reasons but just to limit the scope of what we had to change for HOT. Instead of looking at how to avoid this work in SELECTs maybe it'd be more useful to look at how we can get it done more quickly in the background. The VSM is already a step in the right direction, but it seems the big use case here is when some bulk operation comes through and touches a sizeable number of blocks (but perhaps not enough to hit autovac thresholds). ISTM it wouldn't be too difficult for a backend to track how many blocks in a relation it's dirtied (keep in mind that count doesn't have to be perfect). If we tracked that info, it could be put into a maintenance queue (LISTEN/NOTIFY?) along with our XID. That gives us a list of relations to vacuum and exactly when to vacuum them. Thanks to the VSM we wouldn't need to track individual pages (though it might be useful to track the minimum and maximum block IDs we hit, per relation). -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Thu, Jan 9, 2014 at 1:54 PM, Tom Lane t...@sss.pgh.pa.us wrote: Stephen Frost sfr...@snowman.net writes: That said, I'm not entirely convinced that traversing these dead tuples is all *that* painful during SELECT. If there's that many levels then hopefully it's not long til an UPDATE comes along and cleans them up. There's always VACUUM ;-) If you take about ten steps back, what's happening here is that maintenance work that we'd originally delegated to VACUUM, precisely so that it wouldn't have to be done by foreground queries, is now being done by foreground queries. And oddly enough, people don't like that. People *think* they don't like that, because that's the way it works right now. If it worked some other way, there's a good chance people would be complaining about that behavior, too. I submitted a patch a few years back to limit the setting of hint bits by foreground processes to approximately 5% of the buffers they touched in a large scan, so that no single scan would incur all the cost of setting the hint bits; instead, the cost would be amortized over the first 20 or so scans. However, nobody was very enthusiastic about that patch, because while it greatly softened the blow for the first scan, subsequent scans were slower, because now they had to carry part of the burden, too. And you know what? People didn't like *that* either. The problem with saying that we should let VACUUM do this work is the same as the problem with saying that if you're late for your Concorde flight, you should go running across the tarmac and try to catch it. The cost of dead tuples is related in a linear fashion to the rate at which pages are accessed. Not coincidentally, the number of opportunities for HOT pruning is *also* related in a linear fashion to the rate at which pages are accessed. This is why it works so well. The rate at which vacuuming happens does not ramp up in the same way; it's limited by autovacuum cost settings (which people tend not have set correctly, and don't adjust themselves on the fly) or by their hardware capabilities. If autovacuum can't keep up, foreground activity doesn't slow down to compensate; instead, the system just bloats out of control. While people may not like having this maintenance activity in the foreground, they like not having it at all even less. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup
Hi Steeve, Il 09/01/14 22:10, Steeve Lennmark ha scritto: That's a much better solution, I attached a patch with the updated code. # SELECT oid, pg_tablespace_location(oid) FROM pg_tablespace; [...] 16388 | /tmp/tblspc1 16389 | /tmp/tblspc2 I'd suggest, a similar solution to the one we have adopted in Barman (if you don't know it: www.pgbarman.org), that is: --tablespace NAME:LOCATION [--tablespace NAME:location] I prefer this over the location on the master as this might change over time (at least more frequently than the tablespace name) and over servers. $ pg_basebackup -Xs -D backup/data -T /tmp/tblspc1:$(pwd)/backup/t1 -T /tmp/tblspc2:$(pwd)/backup/t2 With the above example, it would become: $ pg_basebackup -Xs -D backup/data -T tblspc1:$(pwd)/backup/t1 -T tblspc2:$(pwd)/backup/t2 Thanks, Gabriele -- Gabriele Bartolini - 2ndQuadrant Italia PostgreSQL Training, Services and Support gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On 1/9/14, 9:01 AM, Hannu Krosing wrote: Yeah, and I think that the logging command that was suggested allows for that*if configured correctly*. *But* for relying on this, we would also need to make logging *synchronous*, which would probably not go down well with many people, as it makes things even more fragile from availability viewpoint (and slower as well). Not really... you only care about monitoring performance when the standby has gone AWOL *and* you haven't sent a notification yet. Once you've notified once you're done. So in this case the master won't go down unless you have a double fault: standby goes down AND you can't get to your monitoring. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 2014-01-09 16:27:23 -0500, Robert Haas wrote: People *think* they don't like that, because that's the way it works right now. If it worked some other way, there's a good chance people would be complaining about that behavior, too. I think on of the primary reason why it's causing huge slowdowns is that the ring buffer of scan strategies causes dirty buffer writes pretty much immediately, when a buffer is reused. Not that delaying the writeout would work all that effectively right now, with the current bgwriter... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup
On Thu, Jan 9, 2014 at 10:29 PM, Gabriele Bartolini gabriele.bartol...@2ndquadrant.it wrote: Hi Steeve, Il 09/01/14 22:10, Steeve Lennmark ha scritto: That's a much better solution, I attached a patch with the updated code. # SELECT oid, pg_tablespace_location(oid) FROM pg_tablespace; [...] 16388 | /tmp/tblspc1 16389 | /tmp/tblspc2 I'd suggest, a similar solution to the one we have adopted in Barman (if you don't know it: www.pgbarman.org), that is: --tablespace NAME:LOCATION [--tablespace NAME:location] I prefer this over the location on the master as this might change over time (at least more frequently than the tablespace name) and over servers. I'm a barman user myself so that was actually my initial thought. If there aren't some kind of hidden internal that I've missed I don't see a way to convert an OID (only have OID and path at this stage) to a tablespace name. This solution, even though not optimal, is a lot better than my initial one where I used the OID directly. $ pg_basebackup -Xs -D backup/data -T /tmp/tblspc1:$(pwd)/backup/t1 -T /tmp/tblspc2:$(pwd)/backup/t2 With the above example, it would become: $ pg_basebackup -Xs -D backup/data -T tblspc1:$(pwd)/backup/t1 -T tblspc2:$(pwd)/backup/t2 Yeah, that would be my favourite solution. Regards, Steeve -- Steeve Lennmark
Re: [HACKERS] Turning off HOT/Cleanup sometimes
Robert Haas robertmh...@gmail.com writes: The problem with saying that we should let VACUUM do this work is the same as the problem with saying that if you're late for your Concorde flight, you should go running across the tarmac and try to catch it. The cost of dead tuples is related in a linear fashion to the rate at which pages are accessed. Not coincidentally, the number of opportunities for HOT pruning is *also* related in a linear fashion to the rate at which pages are accessed. This is why it works so well. That seems like a large oversimplification. Some (most?) of the costs of dead tuples are proportional to the rate of dead tuple creation. I grant that there are also some costs proportional to the rate at which scans visit dead tuples, but I really don't believe that the latter are dominant. So I think it's bogus to claim that the current behavior is somehow optimal. One more time: the sole reason it works the way it does now is that that was the path of least resistance back in 2007, and we never yet got around to trying to optimize that. I'm glad to see someone wanting to revisit the issue, but I don't think that we necessarily have to go as far as creating user-visible knobs in order to make it better. The rate at which vacuuming happens does not ramp up in the same way; it's limited by autovacuum cost settings (which people tend not have set correctly, and don't adjust themselves on the fly) True, but that seems like a pretty well-defined improvement project right there (as well as an argument against user-visible knobs in general ;-)). Nasby's speculations just upthread could be useful here, too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planning time in explain/explain analyze
On Thu, Jan 9, 2014 at 9:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: In short then, I think we should just add this to EXPLAIN and be done. -1 for sticking the info into PlannedStmt or anything like that. I'm confused. I thought I was arguing to support your suggestion that the initial planning store the time in the cached plan and explain should output the time the original planning took. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On 1/9/14, 11:58 AM, Alvaro Herrera wrote: Robert Haas escribió: On Wed, Jan 8, 2014 at 10:27 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Hmm. This seems like a reasonable thing to do, except that I would like the output to always be the constant, and have some other way to enable the clause or disable it. With your present boolean: so if_not_exists: {output: IF NOT EXISTS, present: true/false} Why not: if_not_exists: true/false Yeah, that's another option. If we do this, though, the expansion function would have to know that an if_not_exist element expands to IF NOT EXISTS. Maybe that's okay. Right now, the expansion function is pretty stupid, which is nice. Yeah, the source side of this will always have to understand the nuances of every command; it'd be really nice to not burden the other side with that as well. The only downside I see is a larger JSON output, but meh. Another advantage is if you really wanted to you could modify the output formatting in the JSON doc to do something radically different if so inclined... -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array_length(anyarray)
On 1/9/14, 11:08 AM, Marko Tiikkaja wrote: On 1/9/14 5:44 PM, Florian Pflug wrote: On Jan9, 2014, at 14:57 , Dean Rasheed dean.a.rash...@gmail.com wrote: On 19 December 2013 08:05, Pavel Stehule pavel.steh...@gmail.com wrote: length should be irrelevant to fact so array starts from 1, 0 or anything else Yes, this should just return the number of elements, and 0 for an empty array. +1. Anything that complains about arrays whose lower bound isn't 1 really needs a *way* less generic name than array_length(). Problem is, if you're operating on an array which could have a lower bound that isn't 1, why would you look at the length in the first place? You can't access any elements by index, you'd need to look at array_lower(). You can't iterate over the array by index, you'd need to do array_lower() .. array_lower() + array_length(), which doesn't make sense. And then there's the myriad of stuff you can do with unnest() without actually having to look at the length. Same goes for multi-dimensional arrays: you have even less things you can do there with only a length. So if we give up these constraints, we also make this function completely useless. I'm generally opposed to creating code that doesn't support the full featureset of something (in this case, array_lower()1). But in this case I hope we can all agree that allowing the user to set an arbitrary array lower bound was an enormous mistake. While we might not be able to ever completely remove that behavior, I find the idea of throwing an error to be highly enticing. Plus, as Marko said, this function is pretty useless for non-1-based arrays. I do agree that the name is probably too generic for this though. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Disallow arrays with non-standard lower bounds
ISTM that allowing users to pick arbitrary lower array bounds was a huge mistake. I've never seen anyone make use of it, can't think of any legitimate use cases for it, and hate the stupendous amount of extra code needed to deal with it. Obviously we can't just drop support, but what about an initdb (or hell, even configure) option to disallow arrays with a lower bound 1? Unfortunately we can't do this with a GUC since you can store arrays in a table. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] nested hstore patch
I moved patch to the January commitfest (https://commitfest.postgresql.org/action/patch_view?id=1289) . Oleg PS. Kudos to Teodor and his mobile phone, which he used to synchronize branches on github. On Fri, Jan 10, 2014 at 2:08 AM, Andrew Dunstan and...@dunslane.net wrote: On 01/09/2014 02:11 PM, Josh Berkus wrote: On 01/09/2014 06:12 AM, Andrew Dunstan wrote: Oleg, Please merge in the jsonb work and resubmit. See https://github.com/feodor/postgres/commits/jsonb_and_hstore I note that this repo does not apparently contain any of your latest changes. I'll go further and say that if the Hstore2 patch doesn't support JSONB for 9.4, we should postpone it to 9.5. We really don't want to get into a situation where we need an Hstore3 because we accepted an Hstore2 which needs to be rev'd for JSON. Especially since there's no good reason for the JSON changes not to be merged already. After some work by Oleg, for which I'm grateful, and a little more by me, here is a combined patch for the jsonb and nested hstore work. Outstanding issues with the jsonb stuff: * I have replicated all the json processing functions for jsonb (although not the json generating functions, such as to_json). Most of these currently work by turning the jsonb back into json and then processing as before. I am sorting out some technical issues and hope to have all of these rewritten to use the native jsonb API in a few days time. * We still need to document jsonb. That too I hope will be done quite shortly. * The jsonb regression test currently contains U+ABCD - I guess we'd better use some hex encoding or whatever for that - unlike json, the jsonb de-serializer dissolves unicode escapes. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] proposal, patch: allow multiple plpgsql plugins
Hello We talked about enhancing a plpgsql plugin API to support more active plugins. I wrote a prototype based on function plpgsql_register_plugin instead rendezvous variable. There are two basic questions: a) will we support rendezvous variable still? b) will we support same API still - a reference on plugin_info in exec state is a issue - described in patch. without a) a d b) we will break a current plugins little bit more than is usual - not terrible hard to fix it. But without a) and b) a implementation can be significantly cleaner. Comments, notes? Regards Pavel commit 406ee9d32dbb09385ec38bb6d89e8531cac1cd5f Author: Pavel Stehule pavel.steh...@gooddata.com Date: Thu Jan 9 23:32:30 2014 +0100 initial diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile index 852b0c7..37d17a8 100644 --- a/src/pl/plpgsql/src/Makefile +++ b/src/pl/plpgsql/src/Makefile @@ -19,7 +19,7 @@ rpath = OBJS = pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o -DATA = plpgsql.control plpgsql--1.0.sql plpgsql--unpackaged--1.0.sql +DATA = plpgsql.control plpgsql--1.1.sql plpgsql--1.0--1.1.sql plpgsql--unpackaged--1.0.sql all: all-lib diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c index 3749fac..fc7158e 100644 --- a/src/pl/plpgsql/src/pl_exec.c +++ b/src/pl/plpgsql/src/pl_exec.c @@ -86,6 +86,21 @@ typedef struct SimpleEcontextStackEntry static EState *shared_simple_eval_estate = NULL; static SimpleEcontextStackEntry *simple_econtext_stack = NULL; +/* + * List of pointers and info of registered plugins. + */ +typedef struct PluginPtrEntry +{ + PLpgSQL_plugin *plugin_ptr; + void *plugin_info; /* reserved for use by optional plugin */ + struct PluginPtrEntry *next; +} PluginPtrEntry; + +/* + * Allocated in TopMemoryContext + */ +static PluginPtrEntry *plugins = NULL; + / * Local function forward declarations / @@ -236,6 +251,11 @@ static char *format_expr_params(PLpgSQL_execstate *estate, static char *format_preparedparamsdata(PLpgSQL_execstate *estate, const PreparedParamsData *ppd); +bool multi_plugin_func_setup = false; +bool multi_plugin_func_beg = false; +bool multi_plugin_func_end = false; +bool multi_plugin_stmt_beg = false; +bool multi_plugin_stmt_end = false; /* -- * plpgsql_exec_function Called by the call handler for @@ -336,6 +356,39 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo, if (*plugin_ptr (*plugin_ptr)-func_beg) ((*plugin_ptr)-func_beg) (estate, func); + if (multi_plugin_func_beg) + { + PluginPtrEntry *plugin_entry; + + Assert(plugins != NULL); + + for (plugin_entry = plugins; plugin_entry != NULL; + plugin_entry = plugin_entry-next) + { + PLpgSQL_plugin *plugin_ptr = plugin_entry-plugin_ptr; + + if (plugin_entry-plugin_ptr-func_beg) + { +void *plugin_info = estate.plugin_info; + +/* save a plugin_info related single only plpgsql plugin */ +PG_TRY(); +{ + estate.plugin_info = plugin_entry-plugin_info; + (plugin_ptr-func_beg) (estate, func); + plugin_entry-plugin_info = estate.plugin_info; + estate.plugin_info = plugin_info; +} +PG_CATCH(); +{ + estate.plugin_info = plugin_info; + PG_RE_THROW(); +} +PG_END_TRY(); + } + } + } + /* * Now call the toplevel block of statements */ @@ -484,6 +537,39 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo, if (*plugin_ptr (*plugin_ptr)-func_end) ((*plugin_ptr)-func_end) (estate, func); + if (multi_plugin_func_end) + { + PluginPtrEntry *plugin_entry; + + Assert(plugins != NULL); + + for (plugin_entry = plugins; plugin_entry != NULL; + plugin_entry = plugin_entry-next) + { + PLpgSQL_plugin *plugin_ptr = plugin_entry-plugin_ptr; + + if (plugin_entry-plugin_ptr-func_end) + { +void *plugin_info = estate.plugin_info; + +/* save a plugin_info related single only plpgsql plugin */ +PG_TRY(); +{ + estate.plugin_info = plugin_entry-plugin_info; + (plugin_ptr-func_end) (estate, func); + plugin_entry-plugin_info = estate.plugin_info; + estate.plugin_info = plugin_info; +} +PG_CATCH(); +{ + estate.plugin_info = plugin_info; + PG_RE_THROW(); +} +PG_END_TRY(); + } + } + } + /* Clean up any leftover temporary memory */ plpgsql_destroy_econtext(estate); exec_eval_cleanup(estate); @@ -1393,6 +1479,39 @@ exec_stmt(PLpgSQL_execstate *estate, PLpgSQL_stmt *stmt) if (*plugin_ptr (*plugin_ptr)-stmt_beg) ((*plugin_ptr)-stmt_beg) (estate, stmt); + if (multi_plugin_stmt_beg) + { + PluginPtrEntry *plugin_entry; + + Assert(plugins != NULL); + + for (plugin_entry = plugins; plugin_entry != NULL; + plugin_entry = plugin_entry-next) + { + PLpgSQL_plugin *plugin_ptr = plugin_entry-plugin_ptr; + + if
Re: [HACKERS] Disallow arrays with non-standard lower bounds
On Thu, Jan 9, 2014 at 2:30 PM, Jim Nasby j...@nasby.net wrote: ISTM that allowing users to pick arbitrary lower array bounds was a huge mistake. I've never seen anyone make use of it, can't think of any legitimate use cases for it, and hate the stupendous amount of extra code needed to deal with it. I agree with this, but I think it's too late. I don't think the answer is any type of parameter. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array_length(anyarray)
On Jan9, 2014, at 23:26 , Jim Nasby j...@nasby.net wrote: On 1/9/14, 11:08 AM, Marko Tiikkaja wrote: On 1/9/14 5:44 PM, Florian Pflug wrote: On Jan9, 2014, at 14:57 , Dean Rasheed dean.a.rash...@gmail.com wrote: On 19 December 2013 08:05, Pavel Stehule pavel.steh...@gmail.com wrote: length should be irrelevant to fact so array starts from 1, 0 or anything else Yes, this should just return the number of elements, and 0 for an empty array. +1. Anything that complains about arrays whose lower bound isn't 1 really needs a *way* less generic name than array_length(). Problem is, if you're operating on an array which could have a lower bound that isn't 1, why would you look at the length in the first place? You can't access any elements by index, you'd need to look at array_lower(). You can't iterate over the array by index, you'd need to do array_lower() .. array_lower() + array_length(), which doesn't make sense. And then there's the myriad of stuff you can do with unnest() without actually having to look at the length. Same goes for multi-dimensional arrays: you have even less things you can do there with only a length. So if we give up these constraints, we also make this function completely useless. I'm generally opposed to creating code that doesn't support the full featureset of something (in this case, array_lower()1). But in this case I hope we can all agree that allowing the user to set an arbitrary array lower bound was an enormous mistake. No doubt. While we might not be able to ever completely remove that behavior, I find the idea of throwing an error to be highly enticing. Plus, as Marko said, this function is pretty useless for non-1-based arrays. That I doubt, but... I do agree that the name is probably too generic for this though. this one is actually my main complaint. The name needs to very clearly mark such a function as dealing only with a subset of all possible arrays. Otherwise we'll just add to the confusion, not avoid it. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers