Re: [HACKERS] new --maintenance-db options
From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane Robert Haas robertmh...@gmail.com writes: From pg_upgrade's perspective, it would be nice to have a flag that starts the server in some mode where nobody but pg_upgrade can connect to it and all connections are automatically allowed, but it's not exactly clear how to implement nobody but pg_upgrade can connect to it. The implementation I've wanted to see for some time is that you can start a standalone backend, but it speaks FE/BE protocol to its caller (preferably over pipes, so that there is no issue whatsoever of where you can securely put a socket or anything like that). Can't it be done like follow the FE/BE protocol, but call directly the server API's at required places. This kind of implementation can be more performant than adding any communication to it which will be beneficial for embedded databases. Making that happen might be a bit too much work if pg_upgrade were the only use case, but there are a lot of people who would like to use PG as an embedded database, and this might be close enough for such use-cases. Seeing PG to run as embedded database would be interesting for many people using PG. There is another use case of embedded databases that they allow another remote connections as well to monitor the operations in database. However that can be done in a later version of implementation. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proof concept - access to session variables on client side
On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello I worked on simple patch, that enable access from server side to client side data. It add two new hooks to libpq - one for returning of local context, second for setting of local context. A motivation is integration of possibilities of psql console together with stronger language - plpgsql. Second target is enabling possibility to save a result of some server side process in psql. It improve vars feature in psql. pavel ~/src/postgresql/src $ cat test.sql \echo value of external paremeter is :myvar do $$ begin -- we can take any session variable on client side -- it is safe against to SQL injection raise notice 'external parameter accessed from plpgsql is %', hgetvar('myvar'); -- we can change this session variable and finish transaction perform hsetvar('myvar', 'Hello, World'); end; $$ language plpgsql; \echo new value of session variable is :myvar cat test.sql | psql postgres -v myvar=Hello value of external paremeter is Hello NOTICE: external parameter accessed from plpgsql is Hello DO new value of session variable is Hello, World This is just proof concept - there should be better integration with pl languages, using cache for read on server side, ... Notices? Why not just use a custom GUC variable instead? E.g. you could have psql SET psql.myvar='Hello, World', and then you'd need no changes at all in the backend? Maybe have a shorthand interface for accessing GUCs in psql would help in making it easier, but do we really need a whole new variable concept? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proof concept - access to session variables on client side
2012/6/26 Magnus Hagander mag...@hagander.net: On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello I worked on simple patch, that enable access from server side to client side data. It add two new hooks to libpq - one for returning of local context, second for setting of local context. A motivation is integration of possibilities of psql console together with stronger language - plpgsql. Second target is enabling possibility to save a result of some server side process in psql. It improve vars feature in psql. pavel ~/src/postgresql/src $ cat test.sql \echo value of external paremeter is :myvar do $$ begin -- we can take any session variable on client side -- it is safe against to SQL injection raise notice 'external parameter accessed from plpgsql is %', hgetvar('myvar'); -- we can change this session variable and finish transaction perform hsetvar('myvar', 'Hello, World'); end; $$ language plpgsql; \echo new value of session variable is :myvar cat test.sql | psql postgres -v myvar=Hello value of external paremeter is Hello NOTICE: external parameter accessed from plpgsql is Hello DO new value of session variable is Hello, World This is just proof concept - there should be better integration with pl languages, using cache for read on server side, ... Notices? Why not just use a custom GUC variable instead? E.g. you could have psql SET psql.myvar='Hello, World', and then you'd need no changes at all in the backend? Maybe have a shorthand interface for accessing GUCs in psql would help in making it easier, but do we really need a whole new variable concept? GUC variables doesn't help with access to psql's command line parameters from DO PL code. Regards Pavel -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
2012-06-26 06:59 keltezéssel, Alvaro Herrera írta: I cleaned up the framework patch a bit. My version's attached. Mainly, returning false for failure in some code paths that are only going to have the caller elog(FATAL) is rather pointless -- it seems much better to just have the code itself do the elog(FATAL). In any case, I searched for reports of those error messages being reported in the wild and saw none. OK. The cleanups are always good. One nitpick, though. Your version doesn't contain the timeout.h and compilation fails because of it. :-) You might have forgotten to do git commit -a. There are other things but they are in the nitpick department. (A reference to -check_timeout remains that needs to be fixed too). Yes, it's called -timeout_func() now. There is one thing that still bothers me on this whole framework patch, but I'm not sure it's easily fixable. Basically the API to timeout.c is the whole list of timeouts and their whole behaviors. If you want to add a new one, you can't just call into the entry points, you have to modify timeout.c itself (as well as timeout.h as well as whatever code it is that you want to add timeouts to). This may be good enough, but I don't like it. I think it boils down to proctimeout.h is cheating. The interface I would actually like to have is something that lets the modules trying to get a timeout register the timeout-checking function as a callback. API-wise, it would be much cleaner. However, I'm not raelly sure that code-wise it would be any cleaner at all. In fact I think it'd be rather cumbersome. However, if you could give that idea some thought, it'd be swell. Well, I can make the registration interface similar to how LWLocks are treated, but that doesn't avoid modification of the base_timeouts array in case a new internal use case arises. Say: #define USER_TIMEOUTS4 intn_timeouts = TIMEOUT_MAX; static timeout_params base_timeouts[TIMEOUT_MAX + USER_TIMEOUTS]; and register_timeout() adds data after TIMEOUT_MAX. I haven't looked at the second patch at all yet. This is the whole point of the first patch. But as I said above for registering a new timeout source, it's a new internal use case. One that touches a lot of places which cannot simply be done by registering a new timeout source. -- -- Zoltán Böszörményi Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proof concept - access to session variables on client side
On Tue, Jun 26, 2012 at 9:50 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2012/6/26 Magnus Hagander mag...@hagander.net: On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello I worked on simple patch, that enable access from server side to client side data. It add two new hooks to libpq - one for returning of local context, second for setting of local context. A motivation is integration of possibilities of psql console together with stronger language - plpgsql. Second target is enabling possibility to save a result of some server side process in psql. It improve vars feature in psql. pavel ~/src/postgresql/src $ cat test.sql \echo value of external paremeter is :myvar do $$ begin -- we can take any session variable on client side -- it is safe against to SQL injection raise notice 'external parameter accessed from plpgsql is %', hgetvar('myvar'); -- we can change this session variable and finish transaction perform hsetvar('myvar', 'Hello, World'); end; $$ language plpgsql; \echo new value of session variable is :myvar cat test.sql | psql postgres -v myvar=Hello value of external paremeter is Hello NOTICE: external parameter accessed from plpgsql is Hello DO new value of session variable is Hello, World This is just proof concept - there should be better integration with pl languages, using cache for read on server side, ... Notices? Why not just use a custom GUC variable instead? E.g. you could have psql SET psql.myvar='Hello, World', and then you'd need no changes at all in the backend? Maybe have a shorthand interface for accessing GUCs in psql would help in making it easier, but do we really need a whole new variable concept? GUC variables doesn't help with access to psql's command line parameters from DO PL code. But with a small change to psql they could, without the need for a whole new type of variable. For example, psql could set all those variable as psql.commandlinevarname, which could then be accessed from the DO PL code just fine. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proof concept - access to session variables on client side
2012/6/26 Magnus Hagander mag...@hagander.net: On Tue, Jun 26, 2012 at 9:50 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2012/6/26 Magnus Hagander mag...@hagander.net: On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello I worked on simple patch, that enable access from server side to client side data. It add two new hooks to libpq - one for returning of local context, second for setting of local context. A motivation is integration of possibilities of psql console together with stronger language - plpgsql. Second target is enabling possibility to save a result of some server side process in psql. It improve vars feature in psql. pavel ~/src/postgresql/src $ cat test.sql \echo value of external paremeter is :myvar do $$ begin -- we can take any session variable on client side -- it is safe against to SQL injection raise notice 'external parameter accessed from plpgsql is %', hgetvar('myvar'); -- we can change this session variable and finish transaction perform hsetvar('myvar', 'Hello, World'); end; $$ language plpgsql; \echo new value of session variable is :myvar cat test.sql | psql postgres -v myvar=Hello value of external paremeter is Hello NOTICE: external parameter accessed from plpgsql is Hello DO new value of session variable is Hello, World This is just proof concept - there should be better integration with pl languages, using cache for read on server side, ... Notices? Why not just use a custom GUC variable instead? E.g. you could have psql SET psql.myvar='Hello, World', and then you'd need no changes at all in the backend? Maybe have a shorthand interface for accessing GUCs in psql would help in making it easier, but do we really need a whole new variable concept? GUC variables doesn't help with access to psql's command line parameters from DO PL code. But with a small change to psql they could, without the need for a whole new type of variable. For example, psql could set all those variable as psql.commandlinevarname, which could then be accessed from the DO PL code just fine. yes, it is possibility too. It has different issues - it can send unwanted variables - maybe some compromise is optimum. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REVIEW: Optimize referential integrity checks (todo item)
On Wed, Jun 20, 2012 at 2:19 AM, Tom Lane t...@sss.pgh.pa.us wrote: I've marked this patch committed, although in the end there was nothing left of it ;-) Thank you, Dean and Tom! I'm sorry for not participating in this thread, I've been away for the past five weeks and have much catching up to do.
Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables
Hi Kaigai-san, -Original Message- From: Kohei KaiGai [mailto:kai...@kaigai.gr.jp] Sent: Monday, June 25, 2012 9:49 PM To: Etsuro Fujita Cc: Robert Haas; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables Fujita-san, The revised patch is almost good for me. Only point I noticed is the check for redundant or duplicated options I pointed out on the previous post. So, if you agree with the attached patch, I'd like to hand over this patch for committers. OK I agree with you. Thanks for the revision! Besides the revision, I modified check_selective_binary_conversion() to run heap_close() in the whole-row-reference case. Attached is an updated version of the patch. Thanks. Best regards, Etsuro Fujita All I changed is: --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -122,6 +122,11 @@ typedef struct CopyStateData @@ -217,7 +221,7 @@ index 98bcb2f..0143d81 100644 } + else if (strcmp(defel-defname, convert_binary) == 0) + { -+ if (cstate-convert_binary) ++ if (cstate-convert_selectively) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg(conflicting or redundant options))); Thanks, 2012/6/20 Etsuro Fujita fujita.ets...@lab.ntt.co.jp: Hi KaiGai-san, Thank you for the review. -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai Sent: Wednesday, June 20, 2012 1:26 AM To: Etsuro Fujita Cc: Robert Haas; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables Hi Fujita-san, Could you rebase this patch towards the latest tree? It was unavailable to apply the patch cleanly. Sorry, I updated the patch. Please find attached an updated version of the patch. I looked over the patch, then noticed a few points. At ProcessCopyOptions(), defel-arg can be NIL, isn't it? If so, cstate-convert_binary is not a suitable flag to check redundant option. It seems to me cstate-convert_selectively is more suitable flag to check it. + else if (strcmp(defel-defname, convert_binary) == 0) + { + if (cstate-convert_binary) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg(conflicting or redundant + options))); + cstate-convert_selectively = true; + if (defel-arg == NULL || IsA(defel-arg, List)) + cstate-convert_binary = (List *) defel-arg; + else + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg(argument to option \%s\ must be a list of column names, + defel-defname))); + } Yes, defel-arg can be NIL. defel-arg is a List structure listing all the columns needed to be converted to binary representation, which is NIL for the case where no columns are needed to be converted. For example, defel-arg is NIL for SELECT COUNT(*). In this case, while cstate-convert_selectively is set to true, no columns are converted cstate-at NextCopyFrom(). Most efficient case! In short, cstate-convert_selectively represents whether to do selective binary conversion at NextCopyFrom(), and cstate-convert_binary represents all the columns to be converted at NextCopyFrom(), that can be NIL. At NextCopyFrom(), this routine computes default values if configured. In case when these values are not referenced, it might be possible to skip unnecessary calculations. Is it unavailable to add logic to avoid to construct cstate-defmap on unreferenced columns at BeginCopyFrom()? I think that we don't need to add the above logic because file_fdw does BeginCopyFrom() with attnamelist = NIL, in which case, BeginCopyFrom() doesn't construct cstate-defmap at all. I fixed a bug plus some minor optimization in check_binary_conversion() that is renamed to check_selective_binary_conversion() in the updated version, and now file_fdw gives up selective binary conversion for the following cases: a) BINARY format b) CSV/TEXT format and whole row reference c) CSV/TEXT format and all the user attributes needed Best regards, Etsuro Fujita Thanks, 2012/5/11 Etsuro Fujita fujita.ets...@lab.ntt.co.jp: -Original Message- From: Robert Haas [mailto:robertmh...@gmail.com] Sent: Friday, May 11, 2012 1:36 AM To: Etsuro Fujita Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables On Tue, May 8, 2012 at 7:26 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I would like to propose
Re: [HACKERS] Backport of fsync queue compaction
On Tue, Jun 19, 2012 at 5:39 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jun 19, 2012 at 5:33 PM, Greg Smith g...@2ndquadrant.com wrote: In January of 2011 Robert committed 7f242d880b5b5d9642675517466d31373961cf98 to try and compact the fsync queue when clients find it full. There's no visible behavior change, just a substantial performance boost possible in the rare but extremely bad situations where the background writer stops doing fsync absorption. I've been running that in production at multiple locations since practically the day it hit this mailing list, with backports all the way to 8.3 being common (and straightforward to construct). I've never seen a hint of a problem with this new code. I've been in favor of back-porting this for a while, so you'll get no argument from me. Anyone disagree? Hearing no disagreement, I went ahead and did this. I didn't take Greg Smith's suggestion of adding a log message when/if the fsync compaction logic fails to make any headway, because (1) the proposed message didn't follow message style guidelines and I couldn't think of a better one that did and (2) I'm not sure it's worth creating extra translation work in the back-branches for such a marginal case. We can revisit this if people feel strongly about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
On Tue, Jun 26, 2012 at 3:59 AM, Boszormenyi Zoltan z...@cybertec.at wrote: Well, I can make the registration interface similar to how LWLocks are treated, but that doesn't avoid modification of the base_timeouts array in case a new internal use case arises. Say: #define USER_TIMEOUTS 4 int n_timeouts = TIMEOUT_MAX; static timeout_params base_timeouts[TIMEOUT_MAX + USER_TIMEOUTS]; Since timeouts - unlike lwlocks - do not need to touch shared memory, there's no need for a hard-coded limit here. You can just allocate the array using MemoryContextAlloc(TopMemoryContext, ...) and enlarge it as necessary. To avoid needing to modify the base_timeouts array, you can just have internal callers push their entries into the array during process startup using the same function call that an external module would use. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal
On Monday, June 25, 2012 08:50:54 PM Kevin Grittner wrote: Andres Freund and...@2ndquadrant.com wrote: We most particularly *don't* want DDL to replicate automatically, because the schema changes are deployed along with related software changes, and we like to pilot any changes for at least a few days. Depending on the release, the rollout may take a couple months, or we may slam in out everywhere a few days after the first pilot deployment. Thats a sensible for your use-case - but I do not think its thats the appropriate behaviour for anything which is somewhat out-of-the box... Right. We currently consider the issues involved in a change and script it by hand. I think we want to continue to do that. The point was that, given the variety of timings and techniques for distributing schema changes, maybe is was only worth doing that automatically for the most constrained and controlled cases. Agreed. So you could certainly punt all of this for any release as far as Wisconsin Courts are concerned. We need to know table and column names, before and after images, and some application-supplied metadata. I am not sure were going to get all that into 9.3. Sure, that was more related to why I was questioning how much these use cases even *could* integrate -- whether it even paid to *consider* these use cases at this point. Responses from Robert and you have pretty much convinced me that I was being overly pessimistic on that. I think its an important question to ask, otherwise we might just end up with infrastructure unusable for all our goals... Or usable but unfinished infrastructure because its to complex to build in sensible time. One fine point regarding before and after images -- if a value doesn't change in an UPDATE, there's no reason to include it in both the BEFORE and AFTER tuple images, as long as we have the null column bitmaps -- or some other way of distinguishing unchanged from NULL. (This could be especially important when the unchanged column was a 50 MB bytea.) It doesn't matter to me which way this is optimized -- in our trigger-based system we chose to keep the full BEFORE tuple and just show AFTER values which were distinct from the corresponding BEFORE values, but it would be trivial to adapt the code to the other way. I don't think doing that is worth the trouble in the first incarnation. There is enough detail hidden in that that makes that non-trivial that I wouldn't worry about it until the rest of the infrastructure exists. That part of the code will definitely be version specific so I see no problem improving on that incrementally. Sorry for that bout of pessimism. Oh, no reason for that. I have some doubts about that myself, so... -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
2012-06-26 13:50 keltezéssel, Robert Haas írta: On Tue, Jun 26, 2012 at 3:59 AM, Boszormenyi Zoltan z...@cybertec.at wrote: Well, I can make the registration interface similar to how LWLocks are treated, but that doesn't avoid modification of the base_timeouts array in case a new internal use case arises. Say: #define USER_TIMEOUTS4 intn_timeouts = TIMEOUT_MAX; static timeout_params base_timeouts[TIMEOUT_MAX + USER_TIMEOUTS]; Since timeouts - unlike lwlocks - do not need to touch shared memory, there's no need for a hard-coded limit here. You can just allocate the array using MemoryContextAlloc(TopMemoryContext, ...) and enlarge it as necessary. To avoid needing to modify the base_timeouts array, you can just have internal callers push their entries into the array during process startup using the same function call that an external module would use. I know, but it doesn't feel right to register static functionality. -- -- Zoltán Böszörményi Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics
Hi all, I've modified the pg_stat_lwlocks patch to be able to work with the latest PostgreSQL Git code. This patch provides: pg_stat_lwlocks New system view to show lwlock statistics. pg_stat_get_lwlocks() New function to retrieve lwlock statistics. pg_stat_reset_lwlocks() New function to reset lwlock statistics. Please try it out. Regards, 2012/06/26 5:29, Satoshi Nagayasu wrote: Hi all, I've been working on a new system view, pg_stat_lwlocks, to observe LWLock, and just completed my 'proof-of-concept' code that can work with version 9.1. Now, I'd like to know the possibility of this feature for future release. With this patch, DBA can easily determine a bottleneck around lwlocks. -- postgres=# SELECT * FROM pg_stat_lwlocks ORDER BY time_ms DESC LIMIT 10; lwlockid | calls | waits | time_ms --++---+- 49 | 193326 | 32096 | 23688 8 | 3305 | 133 |1335 2 | 21 | 0 | 0 4 | 135188 | 0 | 0 5 | 57935 | 0 | 0 6 |141 | 0 | 0 7 | 24580 | 1 | 0 3 | 3282 | 0 | 0 1 | 41 | 0 | 0 9 | 3 | 0 | 0 (10 rows) postgres=# -- In this view, 'lwlockid' column represents LWLockId used in the backends. 'calls' represents how many times LWLockAcquire() was called. 'waits' represents how many times LWLockAcquire() needed to wait within it before lock acquisition. 'time_ms' represents how long LWLockAcquire() totally waited on a lwlock. And lwlocks that use a LWLockId range, such as BufMappingLock or LockMgrLock, would be grouped and summed up in a single record. For example, lwlockid 49 in the above view represents LockMgrLock statistics. Now, I know there are some considerations. (1) Performance I've measured LWLock performance both with and without the patch, and confirmed that this patch does not affect the LWLock perfomance at all. pgbench scores with the patch: tps = 900.906658 (excluding connections establishing) tps = 908.528422 (excluding connections establishing) tps = 903.900977 (excluding connections establishing) tps = 910.470595 (excluding connections establishing) tps = 909.685396 (excluding connections establishing) pgbench scores without the patch: tps = 909.096785 (excluding connections establishing) tps = 894.868712 (excluding connections establishing) tps = 910.074669 (excluding connections establishing) tps = 904.022770 (excluding connections establishing) tps = 895.673830 (excluding connections establishing) Of course, this experiment was not I/O bound, and the cache hit ratio was99.9%. (2) Memory space In this patch, I added three new members to LWLock structure as uint64 to collect statistics. It means that those members must be held in the shared memory, but I'm not sure whether it's appropriate. I think another possible option is holding those statistics values in local (backend) process memory, and send them through the stat collector process (like other statistics values). (3) LWLock names (or labels) Now, pg_stat_lwlocks view shows LWLockId itself. But LWLockId is not easy for DBA to determine actual lock type. So, I want to show LWLock names (or labels), like 'WALWriteLock' or 'LockMgrLock', but how should I implement it? Any comments? Regards, -- Satoshi Nagayasu sn...@uptime.jp Uptime Technologies, LLC. http://www.uptime.jp diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 7cc1d41..f832b45 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -658,6 +658,14 @@ CREATE VIEW pg_stat_bgwriter AS pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; +CREATE VIEW pg_stat_lwlocks AS +SELECT +S.lwlockid, +S.calls, +S.waits, +S.time_ms +FROM pg_stat_get_lwlocks() AS S; + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 95d4b37..2a2c197 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -32,6 +32,7 @@ #include storage/proc.h #include storage/spin.h +#include sys/time.h /* We use the ShmemLock spinlock to protect LWLockAssign */ extern slock_t *ShmemLock; @@ -46,6 +47,11 @@ typedef struct LWLock PGPROC *head; /* head of list of waiting PGPROCs */ PGPROC *tail; /* tail of list of waiting PGPROCs */
Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal
On 25 June 2012 17:42, Kevin Grittner kevin.gritt...@wicourts.gov wrote: This is clearly going to depend on the topology. You would definitely want to try to replicate the DDL for the case on which Simon is focused (which seems to me to be essentially physical replication of catalogs with logical replication of data changes from any machine to all others). Just to remove any doubt: I'm not trying to support a single use case. The overall proposals include a variety of design patterns. Each of those covers many reasons for doing it, but end up with same architecture. 1) Single master replication, with options not possible with physical 2) Multimaster 3) Many to One: data aggregation 4) Online upgrade I don't think it will be possible to support all of those in one release. Each has different challenges. 3 and 4 will not be worked on until 9.4, unless someone else is willing to work on them. That isn't meant to be harsh, just an explanation of practical reality that I hope people can accept without needing to argue it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: avoid heavyweight locking on hash metapage
On Mon, Jun 25, 2012 at 11:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Mon, Jun 18, 2012 at 5:42 PM, Robert Haas robertmh...@gmail.com wrote: Hmm. That was actually a gloss I added on existing code to try to convince myself that it was safe; I don't think that the changes I made make that any more or less safe than it was before. Right, sorry. I thought there was some strength reduction going on there as well. Thanks for the various explanations, they address my concerns. I see that v2 applies over v1. I've verified performance improvements using 8 cores with my proposed pgbench -P benchmark, with a scale that fits in shared_buffers. It brings it most of the way, but not quite, up to the btree performance. I've marked this as ready for committer. Thanks for the review; committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
On Tue, Jun 26, 2012 at 8:03 AM, Boszormenyi Zoltan z...@cybertec.at wrote: I know, but it doesn't feel right to register static functionality. We do it elsewhere. The overhead is pretty minimal compared to other things we already do during startup, and avoiding the need for the array to have a fixed-size seems worth it, IMHO. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new --maintenance-db options
Amit Kapila amit.kap...@huawei.com writes: [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane The implementation I've wanted to see for some time is that you can start a standalone backend, but it speaks FE/BE protocol to its caller (preferably over pipes, so that there is no issue whatsoever of where you can securely put a socket or anything like that). Can't it be done like follow the FE/BE protocol, but call directly the server API's at required places. That wouldn't be easier, nor cleaner, and it would open us up to client-induced database corruption (from failure to follow APIs, crashes in the midst of an operation, memory stomps, etc). We decided long ago that we would never support truly embedded operation in the sense of PG executing in the client's process/address space. I like the design suggested above because it has many of the good properties of an embedded database (in particular, no need to manage or contact a server) but still keeps the client code at arm's length. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
Robert Haas robertmh...@gmail.com writes: On Tue, Jun 26, 2012 at 8:03 AM, Boszormenyi Zoltan z...@cybertec.at wrote: I know, but it doesn't feel right to register static functionality. We do it elsewhere. The overhead is pretty minimal compared to other things we already do during startup, and avoiding the need for the array to have a fixed-size seems worth it, IMHO. It's not even clear that the array needs to be dynamically resizable (yet). Compare for instance syscache invalidation callbacks, which have so far gotten along fine with a fixed-size array to hold registrations. It's foreseeable that we'll need something better someday, but that day hasn't arrived, and might never arrive. I agree with the feeling that this patch isn't done if the core timeout code has to know specifically about each usage. We have that situation already. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal
On Mon, Jun 25, 2012 at 3:17 PM, Andres Freund and...@2ndquadrant.com wrote: I suppose the main reason we haven't done it already is that it increases the period of time during which we're using 2X the disk space. I find that an acceptable price if its optional. Making it such doesn't seem to be a problem for me. +1. I think there is absolutely nothing wrong with doing extra things in ALTER TABLE when logical replication is enabled. We've got code that's conditional on Hot Standby being enabled in many places in the system; why should logical replication be any different? If we set the bar for logical replication at the system can't do anything differently when logical replication is enabled then I cheerfully submit that we are doomed. You've already made WAL format changes to support logging the pre-image of the tuple, which is a hundred times more likely to cause a performance problem than any monkeying around we might want to do in ALTER TABLE. I am deeply skeptical that we need to look inside of transactions that do full-table rewrites. But even if we do, I don't see that what I'm proposing precludes it. For example, I think we could have ALTER TABLE emit WAL records specifically for logical replication that allow us to disentangle which tuple descriptor to use at which point in the transaction. I don't see that that would even be very difficult to set up. Sorry, I was imprecise above: I have no problem doing some changes during ALTER TABLE if logical rep is enabled. I am worried though that to make that robust you would need loads of places that emit additional information: * ALTER TABLE * ALTER FUNCTIION * ALTER OPERATOR * ALTER/CREATE CAST * TRUNCATE * CLUSTER * ... I have the feeling that would we want to do that the full amount of required information would be rather high and end up being essentially the catalog. Just having an intermediate tupledesc doesn't help that much if you have e.g. record_out doing type lookups of its own. There also is the issue you have talked about before, that a user-type might depend on values in other tables. Unless were ready to break at least transactional behaviour for those for now...) I don't see how decoding outside of the transaction is ever going to be valid? I wouldn't have a big problem declaring that as broken for now... I've been thinking about this a lot. My thinking's still evolving somewhat, but consider the following case. A user defines a type T with an input function I and and output function O. They create a table which uses type T and insert a bunch of data, which is duly parsed using I; then, they replace I with a new input function I' and O with a new output function O'. Now, clearly, if we process the inserts using the catalogs that were in effect at the time the inserts we're done, we could theoretically get different output than if we use the catalogs that were in effect after the I/O functions were replaced. But is the latter output wrong, or merely different? My first thought when we started talking about this was it's wrong, but the more I think about it, the less convinced I am... ...because it can't possibly be right to suppose that it's impossible to decode heap tuples using any catalog contents other than the ones that were in effect at the time the tuples got inserted. If that were true, then we wouldn't be able to read a table after adding or dropping a column, which of course we can. It seems to me that it might be sufficient to guarantee that we'll decode using the same *types* that were in effect at the time the inserts happened. If the user yanks the rug out from under us by changing the type definition, maybe we simply define that as a situation in which they get to keep both pieces. After all, if you replace the type definition in a way that makes sensible decoding of the table impossible, you've pretty much shot yourself in the foot whether logical replication enters the picture or not. If the enum case, for example, we go to great pains to make sure that the table contents are always decodable not only under the current version of SnapshotNow, but also any successor version. We do that by prohibiting ALTER TYPE .. ADD VALUE from running inside a transaction block - because if we inserted a row into pg_enum and then inserted dependent rows into some user table, a rollback could leave us with rows that we can't decode. For the same reason, we don't allow ALTER TYPE .. DROP VALUE. I think that we can infer a general principle from this: while I/O functions may refer to catalog contents, they may not do so in a way that could be invalidated by subsequent commits or rollbacks. If they did, they'd be breaking the ability of subsequent SELECT statements to read the table. An interesting case that is arguably an exception to this rule is that regwhatever types, which will cheerfully output their value as an OID if it can't be decoded to text, but will bail on
Re: [HACKERS] pgsql_fdw in contrib
Harada-san, I checked your patch, and had an impression that includes many improvements from the previous revision that I looked at the last commit fest. However, I noticed several points to be revised, or investigated. * It seems to me expected results of the regression test is not attached, even though test cases were included. Please add it. * cleanup_connection() does not close the connection in case when this callback was invoked due to an error under sub- transaction. It probably makes problematic behavior. See the following steps to reproduce the matter: postgres=# BEGIN; BEGIN postgres=# SELECT * FROM ft3; a | b ---+- 1 | aaa 2 | bbb 3 | ccc 4 | ddd 5 | eee 6 | fff (6 rows) postgres=# SAVEPOINT sv_01; SAVEPOINT postgres=# SELECT * FROM ft3 WHERE 1 / (a - 4) 0; -- should be error on remote transaction ERROR: could not execute remote query DETAIL: ERROR: division by zero HINT: SELECT a, b FROM public.t1 WHERE (((1 OPERATOR(pg_catalog./) (a OPERATOR(pg_catalog.-) 4)) OPERATOR(pg_catalog.) 0)) postgres=# ROLLBACK TO SAVEPOINT sv_01; ROLLBACK postgres=# SELECT * FROM ft3; ERROR: could not execute EXPLAIN for cost estimation DETAIL: ERROR: current transaction is aborted, commands ignored until end of transaction block HINT: SELECT a, b FROM public.t1 Once we got an error under the remote transaction, it eventually raises an error on local side, then cleanup_connection() should be invoked. But it ignores the error due to sub-transaction, thus, the remote transaction being already aborted is kept. I'd like to suggest two remedy. 1. connections are closed, even if errors on sub-transaction. 2. close the connection if PQexecParams() returns an error, on execute_query() prior to raise a local error. * Regarding to deparseSimpleSql(), it pulls attributes being referenced from baserestrictinfo and reltargetlist using pull_var_clause(). Is it unavailable to use local_conds instead of baserestrictinfo? We can optimize references to the variable being consumed at the remote side only. All we need to transfer is variables referenced in targetlist and local-clauses. In addition, is pull_var_clause() reasonable to list up all the attribute referenced at the both expression tree? It seems to be pull_varattnos() is more useful API in this situation. * Regarding to deparseRelation(), it scans simple_rte_array to fetch RangeTblEntry with relation-id of the target foreign table. It does not match correct entry if same foreign table is appeared in this list twice or more, like a case of self-join. I'd like to recommend to use simple_rte_array[baserel-relid] instead. In addition, it checks whether relkind is RELKIND_FOREIGN_TABLE, or not. It seems to me this check should be Assert(), if routines of pgsql_fdw is called towards regular relations. * Regarding to deparseVar(), is it unnecessary to check relkind of the relation being referenced by Var node, isn't it? * Regarding to deparseBoolExpr(), compiler raised a warning because op can be used without initialized. * Even though it is harmless, sortConditions() is a misleading function name. How about categolizeConditions() or screeningConditions()? Thanks for your great jobs. 2012/6/14 Shigeru HANADA shigeru.han...@gmail.com: I'd like to propose pgsql_fdw, FDW for PostgreSQL, as a contrib module in core, again. This patch is basically rebased version of the patches proposed in 9.2 development cycle, and contains some additional changes such as concern about volatile variables for PG_CATCH blocks. In addition, I combined old three patches (pgsql_fdw core functionality, push-down support, and analyze support) into one patch for ease of review. (I think these features are necessary for pgsql_fdw use.) After the development for 9.2 cycle, pgsql_fdw can gather statistics from remote side by providing sampling function to analyze module in backend core, use them to estimate selectivity of WHERE clauses which will be pushed-down, and retrieve query results from remote side with custom row processor which prevents memory exhaustion from huge result set. It would be possible to add some more features, such as ORDER BY push-down with index information support, without changing existing APIs, but at first add relatively simple pgsql_fdw and enhance it seems better. In addition, once pgsql_fdw has been merged, it would help implementing proof-of-concept of SQL/MED-related features. Regards, -- Shigeru HANADA -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 01/16] Overhaul walsender wakeup handling
On Fri, Jun 22, 2012 at 12:35 PM, Andres Freund and...@2ndquadrant.com wrote: Can you elaborate on that a bit? What scenarios did you play around with, and what does win mean in this context? I had two machines connected locally and setup HS and my prototype between them (not at once obviously). The patch reduced all the average latency between both nodes (measured by 'ticker' rows arriving in a table on the standby), the jitter in latency and the amount of load I had to put on the master before the standby couldn't keep up anymore. I played with different loads: * multple concurrent ~50MB COPY's * multple concurrent ~50MB COPY's, pgbench * pgbench All three had a ticker running concurrently with synchronous_commit=off (because it shouldn't cause any difference in the replication pattern itself). The difference in averagelag and cutoff were smallest with just pgbench running alone and biggest with COPY running alone. Highjitter was most visible with just pgbench running alone but thats likely just because the average lag was smaller. OK, that sounds pretty promising. I'd like to run a few performance tests on this just to convince myself that it doesn't lead to a significant regression in other scenarios. Assuming that doesn't turn up anything major, I'll go ahead and commit this. Can you provide a rebased version? It seems that one of the hunks in xlog.c no longer applies. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables
2012/6/26 Etsuro Fujita fujita.ets...@lab.ntt.co.jp: Hi Kaigai-san, -Original Message- From: Kohei KaiGai [mailto:kai...@kaigai.gr.jp] Sent: Monday, June 25, 2012 9:49 PM To: Etsuro Fujita Cc: Robert Haas; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables Fujita-san, The revised patch is almost good for me. Only point I noticed is the check for redundant or duplicated options I pointed out on the previous post. So, if you agree with the attached patch, I'd like to hand over this patch for committers. OK I agree with you. Thanks for the revision! Besides the revision, I modified check_selective_binary_conversion() to run heap_close() in the whole-row-reference case. Attached is an updated version of the patch. Sorry, I overlooked this code path. It seems to me fair enough. So, I'd like to take over this patch for committers. Thanks, Thanks. Best regards, Etsuro Fujita All I changed is: --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -122,6 +122,11 @@ typedef struct CopyStateData @@ -217,7 +221,7 @@ index 98bcb2f..0143d81 100644 } + else if (strcmp(defel-defname, convert_binary) == 0) + { -+ if (cstate-convert_binary) ++ if (cstate-convert_selectively) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg(conflicting or redundant options))); Thanks, 2012/6/20 Etsuro Fujita fujita.ets...@lab.ntt.co.jp: Hi KaiGai-san, Thank you for the review. -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai Sent: Wednesday, June 20, 2012 1:26 AM To: Etsuro Fujita Cc: Robert Haas; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables Hi Fujita-san, Could you rebase this patch towards the latest tree? It was unavailable to apply the patch cleanly. Sorry, I updated the patch. Please find attached an updated version of the patch. I looked over the patch, then noticed a few points. At ProcessCopyOptions(), defel-arg can be NIL, isn't it? If so, cstate-convert_binary is not a suitable flag to check redundant option. It seems to me cstate-convert_selectively is more suitable flag to check it. + else if (strcmp(defel-defname, convert_binary) == 0) + { + if (cstate-convert_binary) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg(conflicting or redundant + options))); + cstate-convert_selectively = true; + if (defel-arg == NULL || IsA(defel-arg, List)) + cstate-convert_binary = (List *) defel-arg; + else + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg(argument to option \%s\ must be a list of column names, + defel-defname))); + } Yes, defel-arg can be NIL. defel-arg is a List structure listing all the columns needed to be converted to binary representation, which is NIL for the case where no columns are needed to be converted. For example, defel-arg is NIL for SELECT COUNT(*). In this case, while cstate-convert_selectively is set to true, no columns are converted cstate-at NextCopyFrom(). Most efficient case! In short, cstate-convert_selectively represents whether to do selective binary conversion at NextCopyFrom(), and cstate-convert_binary represents all the columns to be converted at NextCopyFrom(), that can be NIL. At NextCopyFrom(), this routine computes default values if configured. In case when these values are not referenced, it might be possible to skip unnecessary calculations. Is it unavailable to add logic to avoid to construct cstate-defmap on unreferenced columns at BeginCopyFrom()? I think that we don't need to add the above logic because file_fdw does BeginCopyFrom() with attnamelist = NIL, in which case, BeginCopyFrom() doesn't construct cstate-defmap at all. I fixed a bug plus some minor optimization in check_binary_conversion() that is renamed to check_selective_binary_conversion() in the updated version, and now file_fdw gives up selective binary conversion for the following cases: a) BINARY format b) CSV/TEXT format and whole row reference c) CSV/TEXT format and all the user attributes needed Best regards, Etsuro Fujita Thanks, 2012/5/11 Etsuro Fujita fujita.ets...@lab.ntt.co.jp: -Original Message- From: Robert Haas [mailto:robertmh...@gmail.com] Sent: Friday, May 11, 2012 1:36 AM To: Etsuro Fujita Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS]
Re: [HACKERS] [PATCH 01/16] Overhaul walsender wakeup handling
On Tuesday, June 26, 2012 04:01:26 PM Robert Haas wrote: On Fri, Jun 22, 2012 at 12:35 PM, Andres Freund and...@2ndquadrant.com wrote: Can you elaborate on that a bit? What scenarios did you play around with, and what does win mean in this context? I had two machines connected locally and setup HS and my prototype between them (not at once obviously). The patch reduced all the average latency between both nodes (measured by 'ticker' rows arriving in a table on the standby), the jitter in latency and the amount of load I had to put on the master before the standby couldn't keep up anymore. I played with different loads: * multple concurrent ~50MB COPY's * multple concurrent ~50MB COPY's, pgbench * pgbench All three had a ticker running concurrently with synchronous_commit=off (because it shouldn't cause any difference in the replication pattern itself). The difference in averagelag and cutoff were smallest with just pgbench running alone and biggest with COPY running alone. Highjitter was most visible with just pgbench running alone but thats likely just because the average lag was smaller. OK, that sounds pretty promising. I'd like to run a few performance tests on this just to convince myself that it doesn't lead to a significant regression in other scenarios. Assuming that doesn't turn up anything major, I'll go ahead and commit this. Independent testing would be great, its definitely possible that I oversaw something although I obviously don't think so ;). Can you provide a rebased version? It seems that one of the hunks in xlog.c no longer applies. Will do so. Not sure if I can finish it today though, I am in the midst of redoing the ilist and xlogreader patches. I guess tomorrow will suffice otherwise... Thanks! Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reporting WAL file containing checkpoint's REDO record in pg_controldata's result
On Wed, Mar 28, 2012 at 10:08 AM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Mar 28, 2012 at 12:30 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Fujii Masao's message of mar mar 27 06:40:34 -0300 2012: Anyway, should I add this patch into the next CF? Or is anyone planning to commit the patch for 9.2? I think the correct thing to do here is add to next CF, and if some committer has enough interest in getting it quickly it can be grabbed from there and committed into 9.2. Yep! I've added it to next CF. Attached is the revised version of the patch. Regards, -- Fujii Masao pg_controldata_walfilename_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq compression
On Mon, Jun 25, 2012 at 4:25 PM, k...@rice.edu k...@rice.edu wrote: On Mon, Jun 25, 2012 at 09:45:26PM +0200, Florian Pflug wrote: On Jun25, 2012, at 21:21 , Dimitri Fontaine wrote: Magnus Hagander mag...@hagander.net writes: Or that it takes less code/generates cleaner code... So we're talking about some LZO things such as snappy from google, and that would be another run time dependency IIUC. I think it's time to talk about fastlz: http://fastlz.org/ http://code.google.com/p/fastlz/source/browse/trunk/fastlz.c 551 lines of C code under MIT license, works also under windows I guess it would be easy (and safe) enough to embed in our tree should we decide going this way. Agreed. If we extend the protocol to support compression, and not rely on SSL, then let's pick one of these LZ77-style compressors, and let's integrate it into our tree. We should then also make it possible to enable compression only for the server - client direction. Since those types of compressions are usually pretty easy to decompress, that reduces the amount to work non-libpq clients have to put in to take advantage of compression. Here is the benchmark list from the Google lz4 page: Name Ratio C.speed D.speed LZ4 (r59) 2.084 330 915 LZO 2.05 1x_1 2.038 311 480 QuickLZ 1.5 -1 2.233 257 277 Snappy 1.0.5 2.024 227 729 LZF 2.076 197 465 FastLZ 2.030 190 420 zlib 1.2.5 -1 2.728 39 195 LZ4 HC (r66) 2.712 18 1020 zlib 1.2.5 -6 3.095 14 210 lz4 absolutely dominates on compression/decompression speed. While fastlz is faster than zlib(-1) on compression, lz4 is almost 2X faster still. At the risk of making everyone laugh at me, has anyone tested pglz? I observe that if the compression ration and performance are good, we might consider using it for this purpose, too, which would avoid having to add dependencies. Conversely, if they are bad, and we decide to support another algorithm, we might consider also using that other algorithm, at least optionally, for the purposes for which we now use pglz. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] empty backup_label
On Sun, Jun 24, 2012 at 5:33 PM, David Kerr d...@mr-paradox.net wrote: Howdy, We're using NetApp's flexclone's whenever we need to move our DB between machines. One specific case where we do that is when we're creating a new streaming replication target. The basic steps we're using are: pg_start_backup(); flex clone within the netapp pg_stop_backup(); The problem i'm seeing is that periodically the backup_label is empty, which means I can't start the new standby. I believe that since the NetApp stuff is all happening within the SAN this file hasn't been fsynced to disk by the time we take the snapshot. One option would be to do a sync prior to the clone, however that seems kind of like a heavy operation, and it's slightly more complicated to script. (having to have a user account on the system to sudo rather than just connecting to the db to issue the pg_start_backup(...) ) Another option is to add pg_fsync(fileno(fp)) after the fflush() when creating the file (I'm not sure if fsync implies fflush or not, if it does you could just replace it.) I think this type of snapshot is fairly common, I've been doing them since 2000 with EMC, i'm sure that most SAN vendors support it. These seems like a good idea to me. Actually, I'm wondering if we shouldn't back-patch this. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GiST subsplit question
On Wed, May 30, 2012 at 4:35 PM, Alexander Korotkov aekorot...@gmail.com wrote: So, do we demote that message to a DEBUG1? Or do we make it more clear what the authors of a specific picksplit are supposed to do to avoid that problem? Or am I misunderstanding something? +1 for demote message to DEBUG1. I think it shouldn't be so noisy, it just indicates something could be improved. Also I think we defenitely need to document secondary split. Now it's no chances to understand without reverse engeneering it from code. I'm happy to go demote the message if we have consensus on that, but somebody else is going to need to provide the doc patch. Any takers? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to avoid base backup in automated failover
On Mon, Jun 4, 2012 at 9:14 AM, chinnaobi chinna...@gmail.com wrote: Recently I was writing an application to implement automated failover with env: Two 2008 R2 servers, Network area storage, asynchronous replication, WAL archive on primary enabled. Is there any way to avoid starting standby server always from base backup in automated failover. I see the database is growing huge. I can't keep doing base backup every day. Please suggest solution The usual solution is to configure the standby as a warm or hot standby, so that logs are continuously replayed there. Then if the master dies, you only have to wait for replication to catch up before promoting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq compression
On 26-06-2012 12:23, Robert Haas wrote: At the risk of making everyone laugh at me, has anyone tested pglz? I observe that if the compression ration and performance are good, we might consider using it for this purpose, too, which would avoid having to add dependencies. Conversely, if they are bad, and we decide to support another algorithm, we might consider also using that other algorithm, at least optionally, for the purposes for which we now use pglz. I'll remember to test it too. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] empty backup_label
On Tue, Jun 26, 2012 at 5:24 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jun 24, 2012 at 5:33 PM, David Kerr d...@mr-paradox.net wrote: Howdy, We're using NetApp's flexclone's whenever we need to move our DB between machines. One specific case where we do that is when we're creating a new streaming replication target. The basic steps we're using are: pg_start_backup(); flex clone within the netapp pg_stop_backup(); The problem i'm seeing is that periodically the backup_label is empty, which means I can't start the new standby. I believe that since the NetApp stuff is all happening within the SAN this file hasn't been fsynced to disk by the time we take the snapshot. One option would be to do a sync prior to the clone, however that seems kind of like a heavy operation, and it's slightly more complicated to script. (having to have a user account on the system to sudo rather than just connecting to the db to issue the pg_start_backup(...) ) Another option is to add pg_fsync(fileno(fp)) after the fflush() when creating the file (I'm not sure if fsync implies fflush or not, if it does you could just replace it.) I think this type of snapshot is fairly common, I've been doing them since 2000 with EMC, i'm sure that most SAN vendors support it. These seems like a good idea to me. Actually, I'm wondering if we shouldn't back-patch this. Thoughts? Certainly can't hurt. I guess any other files that are lost this way will be recreated by WAL recovery - or is there something else tha tmight be of risk of similar treatment? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq compression
Robert Haas robertmh...@gmail.com writes: On Mon, Jun 25, 2012 at 4:25 PM, k...@rice.edu k...@rice.edu wrote: Here is the benchmark list from the Google lz4 page: NameRatio C.speed D.speed LZ4 (r59) 2.084 330 915 LZO 2.05 1x_1 2.038 311 480 QuickLZ 1.5 -1 2.233 257 277 Snappy 1.0.52.024 227 729 LZF 2.076 197 465 FastLZ 2.030 190 420 zlib 1.2.5 -1 2.72839 195 LZ4 HC (r66)2.71218 1020 zlib 1.2.5 -6 3.09514 210 lz4 absolutely dominates on compression/decompression speed. While fastlz is faster than zlib(-1) on compression, lz4 is almost 2X faster still. At the risk of making everyone laugh at me, has anyone tested pglz? Another point here is that those Google numbers (assuming that they apply to our use-cases, a point not in evidence) utterly fail to make the argument that zlib is not the thing to use. zlib is beating all the others on compression ratio, often by 50%. Before making any comparisons, you have to make some assumptions about what the network speed is ... and unless it's pretty damn fast relative to your CPU speed getting the better compression ratio is going to be more attractive than saving some cycles. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Schema version management
On Tue, May 22, 2012 at 11:31 PM, Joel Jacobson j...@trustly.com wrote: This is true, which means some users won't be able to use the feature, because they are using an ancient OS or have function names with slashes, hm, is it even possible to have function names with slashes? Sure. If you quote the function name, you can put anything you want in there. Note that Windows disallows a whole bunch of special characters in filenames, while UNIX-like systems tend to disallow only slash. I suppose you have a lot more experience of what postgres installations exists in the world. Do you think it's common databases have non-ascii problematic characters in object names? Is it a project policy all features of all standard tools must be useful for all users on all platforms on all databases? Or is it acceptable if some features are only useable for, say, 90% of the users? There are cases where we permit features that only work on some platforms, but it's rare. Usually, we do this only when the platform lacks some API that exists elsewhere. For example, collations and prefetching are not supported on Windows because the UNIX APIs we use don't exist there. In this case, it seems like you could work around the problem by, say, URL-escaping any characters that can't be used in an unquoted identifier. Of course that might make the file name long enough to hit the platform-specific file name limit. Not sure what to do about that. The basic idea you're proposing here has been proposed a number of times before, but it's always fallen down over questions of (1) what do do with very long object names or those containing special characters and (2) objects (like functions) for which schema+name is not a unique identifier. I don't think either of these problems ought to be a complete show-stopper. It seems to me that the trade-off is that when object names are long, contain special characters, or are overloaded, we'll have to munge the names in some way to prevent collisions. That could mean that the names are not 100% stable, which would possibly produce some annoyance if you're using a VCS to track changes, but maybe that's an acceptable trade-off, because it shouldn't happen very often. If we could guararantee that identifiers less than 64 characters which are not overloaded and contain no special characters requiring quoting end up in an eponymous file, I think that would be good enough to make most of our users pretty happy. In other cases, I think the point would just be to make it work (with a funny name) rather than fail. \i /home/postgres/database/gluepay-split/public/TYPE/dblink_pkey_results.sql \i /home/postgres/database/gluepay-split/public/TYPE/r_matchedwithdrawal.sql \i /home/postgres/database/gluepay-split/public/TYPE/r_unapprovedwithdrawal.sql \i /home/postgres/database/gluepay-split/public/TYPE/ukaccountvalidationchecktype.sql \i /home/postgres/database/gluepay-split/aml/FUNCTION/check_name.sql \i /home/postgres/database/gluepay-split/aml/FUNCTION/describe_entityid.sql \i /home/postgres/database/gluepay-split/aml/FUNCTION/get_linkid.sql \i /home/postgres/database/gluepay-split/aml/FUNCTION/set_address.sql -- ... all the objects .. \i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerid_fkey.sql \i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerstatusid_fkey.sql \i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workertypeid_fkey.sql It would be better to use \ir here rather than hard-code path names, I think. Then you'd only need to require that all the files be in the same directory, rather than requiring them to be at a certain hard-coded place in the filesystem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] empty backup_label
On Tue, Jun 26, 2012 at 05:33:42PM +0200, Magnus Hagander wrote: - On Tue, Jun 26, 2012 at 5:24 PM, Robert Haas robertmh...@gmail.com wrote: - On Sun, Jun 24, 2012 at 5:33 PM, David Kerr d...@mr-paradox.net wrote: - Howdy, - - We're using NetApp's flexclone's whenever we need to move our DB between machines. - - One specific case where we do that is when we're creating a new streaming replication target. - - The basic steps we're using are: - pg_start_backup(); - flex clone within the netapp - pg_stop_backup(); - - The problem i'm seeing is that periodically the backup_label is empty, which means - I can't start the new standby. - - I believe that since the NetApp stuff is all happening within the SAN this file hasn't been - fsynced to disk by the time we take the snapshot. - - One option would be to do a sync prior to the clone, however that seems kind of like a - heavy operation, and it's slightly more complicated to script. (having to have a user - account on the system to sudo rather than just connecting to the db to issue the - pg_start_backup(...) ) - - Another option is to add pg_fsync(fileno(fp)) after the fflush() when creating the file (I'm not - sure if fsync implies fflush or not, if it does you could just replace it.) - - I think this type of snapshot is fairly common, I've been doing them since 2000 with EMC, - i'm sure that most SAN vendors support it. - - These seems like a good idea to me. Actually, I'm wondering if we - shouldn't back-patch this. - - Thoughts? - - Certainly can't hurt. - - I guess any other files that are lost this way will be recreated by - WAL recovery - or is there something else tha tmight be of risk of - similar treatment? The only other file I've run into is pgstats.stat, which I think is ok to get blown away. There certianly could be others though. Dave -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] empty backup_label
On Tue, Jun 26, 2012 at 11:33 AM, Magnus Hagander mag...@hagander.net wrote: On Tue, Jun 26, 2012 at 5:24 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jun 24, 2012 at 5:33 PM, David Kerr d...@mr-paradox.net wrote: Howdy, We're using NetApp's flexclone's whenever we need to move our DB between machines. One specific case where we do that is when we're creating a new streaming replication target. The basic steps we're using are: pg_start_backup(); flex clone within the netapp pg_stop_backup(); The problem i'm seeing is that periodically the backup_label is empty, which means I can't start the new standby. I believe that since the NetApp stuff is all happening within the SAN this file hasn't been fsynced to disk by the time we take the snapshot. One option would be to do a sync prior to the clone, however that seems kind of like a heavy operation, and it's slightly more complicated to script. (having to have a user account on the system to sudo rather than just connecting to the db to issue the pg_start_backup(...) ) Another option is to add pg_fsync(fileno(fp)) after the fflush() when creating the file (I'm not sure if fsync implies fflush or not, if it does you could just replace it.) I think this type of snapshot is fairly common, I've been doing them since 2000 with EMC, i'm sure that most SAN vendors support it. These seems like a good idea to me. Actually, I'm wondering if we shouldn't back-patch this. Thoughts? Certainly can't hurt. I guess any other files that are lost this way will be recreated by WAL recovery - or is there something else tha tmight be of risk of similar treatment? I can't think of anything. pg_start_backup does a checkpoint, which in theory oughta be enough to make sure everything that matters hits the platter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hot standby PSQL 9.1 Windows 2008 Servers
On Tue, May 22, 2012 at 12:15 PM, chinnaobi chinna...@gmail.com wrote: You mean when the primary which is going to switch its role to standby might not have sent all the WAL records to the standby and If it is switched to standby it has more WAL records than the standby which is now serves as primary. Is it ?? Yes, that is possible. Or the standby might have received all the WAL records but not be caught up in terms of replaying them. It is actually the standby server which has to be restored from archive when it is switching to primary right .. Not the primary which is switching to standby ?? If you want to promote a standby, you can just do it (pg_ctl promote). If you have a master that you want to demote to a standby, you've got to resync it to whatever the current master is. I understand repmgr has some tooling to help automate that, although I have not played with it myself. In any event rsync can be a big help in reducing the resync time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
Excerpts from Boszormenyi Zoltan's message of mar jun 26 03:59:06 -0400 2012: 2012-06-26 06:59 keltezéssel, Alvaro Herrera írta: I cleaned up the framework patch a bit. My version's attached. Mainly, returning false for failure in some code paths that are only going to have the caller elog(FATAL) is rather pointless -- it seems much better to just have the code itself do the elog(FATAL). In any case, I searched for reports of those error messages being reported in the wild and saw none. OK. The cleanups are always good. One nitpick, though. Your version doesn't contain the timeout.h and compilation fails because of it. :-) You might have forgotten to do git commit -a. Oops. Attached. It's pretty much the same you had, except some bools are changed to void. There is one thing that still bothers me on this whole framework patch, but I'm not sure it's easily fixable. Basically the API to timeout.c is the whole list of timeouts and their whole behaviors. If you want to add a new one, you can't just call into the entry points, you have to modify timeout.c itself (as well as timeout.h as well as whatever code it is that you want to add timeouts to). This may be good enough, but I don't like it. I think it boils down to proctimeout.h is cheating. The interface I would actually like to have is something that lets the modules trying to get a timeout register the timeout-checking function as a callback. API-wise, it would be much cleaner. However, I'm not raelly sure that code-wise it would be any cleaner at all. In fact I think it'd be rather cumbersome. However, if you could give that idea some thought, it'd be swell. Well, I can make the registration interface similar to how LWLocks are treated, but that doesn't avoid modification of the base_timeouts array in case a new internal use case arises. Say: #define USER_TIMEOUTS4 intn_timeouts = TIMEOUT_MAX; static timeout_params base_timeouts[TIMEOUT_MAX + USER_TIMEOUTS]; and register_timeout() adds data after TIMEOUT_MAX. Well, I don't expect that we're going to have many external uses. My point about registration is so that internal callers use it as well. I was thinking we could do something like xact.c does, which is to have somewhere in proc.c a call like this: TimeoutRegister(DEADLOCK_TIMEOUT, 1, true, CheckDeadLock, GetCurrentTimestamp); at process startup (the magic value 1 is the priority. Maybe there's a better way to handle this). That way, timeout.c or timeout.h do not need to know anything about proc.c at all; we don't need proctimeout.h at all. The only thing it (the timeout module) needs to know is that there is a symbolic constant named DEADLOCK_TIMEOUT. Similarly for statement timeout, etc. When you call enable_timeout you first have to ensure that DEADLOCK_TIMEOUT has been registered; and if it's not, die on an Assert(). That way you ensure that there are no bugs where you try to enable a timeout that hasn't been registered. (If we later find the need to let external modules add timeouts, which I find unlikely, we can have TimeoutRegister return TimeoutName and have this value be dynamically allocated. But for now I think that would be useless complication). The difference with lwlocks is that each lwlock is just a number. The lwlock.c module doesn't need to know anything about how each lwlock is actually used. It's not the same thing here -- which is why I think it would be better to have all modules, even the hardcoded ones, use a registering interface. ... Now, it could be argued that it would be even better to have TimeoutRegister not take the TimeoutName argument, and instead generate it dynamically and pass it back to the caller -- that way you don't need the enum in timeout.h. The problem I have with that idea is that it would force the caller modules to have a global variable to keep track of that, which seems worse to me. I haven't looked at the second patch at all yet. This is the whole point of the first patch. I know that. But I want to get the details of the framework right before we move on to add more stuff to it. But as I said above for registering a new timeout source, it's a new internal use case. One that touches a lot of places which cannot simply be done by registering a new timeout source. Sure. That's going to be the case for any other timeout we want to add (which is why I think we don't really need dynamic timeouts). -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support 1-timeout-framework-v9a.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.3] Row-Level Security
On Thu, Jun 14, 2012 at 11:43 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: In the previous discussion, we planned to add a syntax option to clarify the command type to fire the RLS policy, such as FOR UPDATE. But current my opinion is, it is not necessary. For example, we can reference the contents of rows being updated / deleted using RETURNING clause. So, it does not make sense to have different RLS policy at UPDATE / DELETE from SELECT. I agree. That doesn't make sense, and we don't need to support it. On the other hand, I'm not 100% sure about my design to restrict rows to be updated and deleted. Similarly, it expands the target relation of UPDATE or DELETE statement into a sub-query with condition. ExecModifyTable() pulls a tuple from the sub-query, instead of regular table, so it seems to me working at least, but I didn't try all the possible cases of course. I don't think there's any reason why that shouldn't work. DELETE .. USING already allows ModifyTable to be scanning a join product, so this is not much different. Of course, here is some limitations, to keep the patch size reasonable level to review. - The permission to bypass RLS policy was under discussion. If and when we should provide a special permission to bypass RLS policy, the OR has_superuser_privilege() shall be replaced by OR has_table_privilege(tableoid, 'RLSBYPASS'). I think you're missing the point. Everyone who has commented on this issue is in favor of having some check that causes the RLS predicate *not to get added in the first place*. Adding a modified predicate is not the same thing. First, the query planner might not be smart enough to optimize away the clause even when the predicate holds, causing an unnecessary performance drain. Second, there's too much danger of being able to set a booby-trap for the superuser this way. Suppose that the RLS subsystem replaces f_malicious() by f_malicious OR has_superuser_privilege(). Now the superuser comes along with the nightly pg_dump run and the query optimizer executes SELECT * FROM nuts WHERE f_malicious() OR has_superuser_privilege(). The query optimizer notes that the cost of f_malicious() is very low and decides to execute that before has_superuser_privilege(). Oops. I think it's just not acceptable to handle this by clause-munging: we need to not add the clause in the first place. Comments on the patch itself: 1. Please do not abbreviate rowlevel to rowlv or RowLevel to RowLv or ROWLEVEL to ROWLV. That makes it harder to read and harder to grep. Spell it out. 2. Since the entirety of ATExecSetRowLvSecurity is conditional on whether clause != NULL, you might as well split it into two functions, one for each case. 3. The fact that you've had to hack preprocess_targetlist and adjust_appendrel_attrs_mutator suggests to me that the insertion of the generate subquery is happening at the wrong phase of the process. We don't need those special cases for views, so it seems like we shouldn't need them here, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
2012-06-26 18:12 keltezéssel, Alvaro Herrera írta: Excerpts from Boszormenyi Zoltan's message of mar jun 26 03:59:06 -0400 2012: 2012-06-26 06:59 keltezéssel, Alvaro Herrera írta: I cleaned up the framework patch a bit. My version's attached. Mainly, returning false for failure in some code paths that are only going to have the caller elog(FATAL) is rather pointless -- it seems much better to just have the code itself do the elog(FATAL). In any case, I searched for reports of those error messages being reported in the wild and saw none. OK. The cleanups are always good. One nitpick, though. Your version doesn't contain the timeout.h and compilation fails because of it. :-) You might have forgotten to do git commit -a. Oops. Attached. It's pretty much the same you had, except some bools are changed to void. There is one thing that still bothers me on this whole framework patch, but I'm not sure it's easily fixable. Basically the API to timeout.c is the whole list of timeouts and their whole behaviors. If you want to add a new one, you can't just call into the entry points, you have to modify timeout.c itself (as well as timeout.h as well as whatever code it is that you want to add timeouts to). This may be good enough, but I don't like it. I think it boils down to proctimeout.h is cheating. The interface I would actually like to have is something that lets the modules trying to get a timeout register the timeout-checking function as a callback. API-wise, it would be much cleaner. However, I'm not raelly sure that code-wise it would be any cleaner at all. In fact I think it'd be rather cumbersome. However, if you could give that idea some thought, it'd be swell. Well, I can make the registration interface similar to how LWLocks are treated, but that doesn't avoid modification of the base_timeouts array in case a new internal use case arises. Say: #define USER_TIMEOUTS4 intn_timeouts = TIMEOUT_MAX; static timeout_params base_timeouts[TIMEOUT_MAX + USER_TIMEOUTS]; and register_timeout() adds data after TIMEOUT_MAX. Well, I don't expect that we're going to have many external uses. My point about registration is so that internal callers use it as well. I was thinking we could do something like xact.c does, which is to have somewhere in proc.c a call like this: TimeoutRegister(DEADLOCK_TIMEOUT, 1, true, CheckDeadLock, GetCurrentTimestamp); at process startup (the magic value 1 is the priority. Maybe there's a better way to handle this). That way, timeout.c or timeout.h do not need to know anything about proc.c at all; we don't need proctimeout.h at all. The only thing it (the timeout module) needs to know is that there is a symbolic constant named DEADLOCK_TIMEOUT. Similarly for statement timeout, etc. When you call enable_timeout you first have to ensure that DEADLOCK_TIMEOUT has been registered; and if it's not, die on an Assert(). That way you ensure that there are no bugs where you try to enable a timeout that hasn't been registered. Currently, TimeoutName (as an index value) and priority is the same semantically. I would also add an Assert into register_timeout() to avoid double registration of the same to prevent overriding he callback function pointer. If that's in place, the TimeoutName value may still work as is: an index into base_timeouts[]. (If we later find the need to let external modules add timeouts, which I find unlikely, we can have TimeoutRegister return TimeoutName and have this value be dynamically allocated. But for now I think that would be useless complication). The difference with lwlocks is that each lwlock is just a number. Strictly speaking, just as TimeoutName. The lwlock.c module doesn't need to know anything about how each lwlock is actually used. It's not the same thing here -- which is why I think it would be better to have all modules, even the hardcoded ones, use a registering interface. OK. ... Now, it could be argued that it would be even better to have TimeoutRegister not take the TimeoutName argument, and instead generate it dynamically and pass it back to the caller -- that way you don't need the enum in timeout.h. This was what I had in mind at first ... The problem I have with that idea is that it would force the caller modules to have a global variable to keep track of that, which seems worse to me. ... and realized this as well. So, should I keep the enum TimeoutName? Are global variables for keeping dynamically assigned values preferred over the enum? Currently we have 5 timeout sources in total, 3 of them are used by regular backends, the remaining 2 are used by replication standby. We can have a fixed size array (say with 8 or 16 elements) for future use and this would be plenty. Opinions? I haven't looked at the second patch at all yet. This is the whole point of the first patch. I know that. But I want to get the details of the framework right before we move on to add more stuff to
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
Excerpts from Boszormenyi Zoltan's message of mar jun 26 12:43:34 -0400 2012: So, should I keep the enum TimeoutName? Are global variables for keeping dynamically assigned values preferred over the enum? Currently we have 5 timeout sources in total, 3 of them are used by regular backends, the remaining 2 are used by replication standby. We can have a fixed size array (say with 8 or 16 elements) for future use and this would be plenty. Opinions? My opinion is that the fixed size array is fine. I'll go set the patch waiting on author. Also, remember to review some other people's patches. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] why roll-your-own s_lock? / improving scalability
Hi, I am currently trying to understand what looks like really bad scalability of 9.1.3 on a 64core 512GB RAM system: the system runs OK when at 30% usr, but only marginal amounts of additional load seem to push it to 70% and the application becomes highly unresponsive. My current understanding basically matches the issues being addressed by various 9.2 improvements, well summarized in http://wiki.postgresql.org/images/e/e8/FOSDEM2012-Multi-CPU-performance-in-9.2.pdf An additional aspect is that, in order to address the latent risk of data loss corruption with WBCs and async replication, we have deliberately moved the db from a similar system with WB cached storage to ssd based storage without a WBC, which, by design, has (in the best WBC case) approx. 100x higher latencies, but much higher sustained throughput. On the new system, even with 30% user acceptable load, oprofile makes apparent significant lock contention: opreport --symbols --merge tgid -l /mnt/db1/hdd/pgsql-9.1/bin/postgres Profiling through timer interrupt samples %image name symbol name 3024027.9720 postgres s_lock 5069 4.6888 postgres GetSnapshotData 3743 3.4623 postgres AllocSetAlloc 3167 2.9295 libc-2.12.so strcoll_l 2662 2.4624 postgres SearchCatCache 2495 2.3079 postgres hash_search_with_hash_value 2143 1.9823 postgres nocachegetattr 1860 1.7205 postgres LWLockAcquire 1642 1.5189 postgres base_yyparse 1604 1.4837 libc-2.12.so __strcmp_sse42 1543 1.4273 libc-2.12.so __strlen_sse42 1156 1.0693 libc-2.12.so memcpy Unfortunately I don't have profiling data for the high-load / contention condition yet, but I fear the picture will be worse and pointing in the same direction. pure speculation In particular, the _impression_ is that lock contention could also be related to I/O latencies making me fear that cases could exist where spin locks are being helt while blocking on IO. /pure speculation Looking at the code, it appears to me that the roll-your-own s_lock code cannot handle a couple of cases, for instance it will also spin when the lock holder is not running at all or blocking on IO (which could even be implicit, e.g. for a page flush). These issues have long been addressed by adaptive mutexes and futexes. Also, the s_lock code tries to be somehow adaptive using spins_per_delay (when having spun for long (not not blocked), spin even longer in future), which appears to me to have the potential of becoming highly counter-productive. Now that the scene is set, here's the simple question: Why all this? Why not simply use posix mutexes which, on modern platforms, will map to efficient implementations like adaptive mutexes or futexes? Thanks, Nils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Improve DROP FUNCTION hint
On Mon, Jun 11, 2012 at 11:12 AM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jun 9, 2012 at 11:42 AM, Dean Rasheed dean.a.rash...@gmail.com wrote: Hi, Attached is a small patch to improve the HINT message produced by CREATE OR REPLACE FUNCTION when the new function definition conflicts with the old definition. With this patch the hint now includes the function's name and signature as a directly pasteable SQL command. So, for example, instead of psql:functions.sql:70: ERROR: cannot change return type of existing function HINT: Use DROP FUNCTION first. it now says psql:functions.sql:70: ERROR: cannot change return type of existing function HINT: Use DROP FUNCTION foo(integer,integer) first. which saves having to open the file, find the function and then type in the DROP statement manually. +1. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics
To implement it, a new array can be added in the local process memory to hold lwlock statistics, and update counters both in the shared memory and the local process memory at once. Then, the session can retrieve 'per-session' statistics from the local process memory via some dedicated function. That would be way cool from a diagnostics perspective. Not sure what impact it has, though. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GiST subsplit question
On Tue, 2012-06-26 at 11:28 -0400, Robert Haas wrote: On Wed, May 30, 2012 at 4:35 PM, Alexander Korotkov aekorot...@gmail.com wrote: So, do we demote that message to a DEBUG1? Or do we make it more clear what the authors of a specific picksplit are supposed to do to avoid that problem? Or am I misunderstanding something? +1 for demote message to DEBUG1. I think it shouldn't be so noisy, it just indicates something could be improved. Also I think we defenitely need to document secondary split. Now it's no chances to understand without reverse engeneering it from code. I'm happy to go demote the message if we have consensus on that, but somebody else is going to need to provide the doc patch. Any takers? I was planning to do that, but it won't be for a few days at least. If someone else wants to do it sooner, feel free. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proof concept - access to session variables on client side
On Tue, Jun 26, 2012 at 10:12:52AM +0200, Pavel Stehule wrote: 2012/6/26 Magnus Hagander mag...@hagander.net: On Tue, Jun 26, 2012 at 9:50 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2012/6/26 Magnus Hagander mag...@hagander.net: On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello I worked on simple patch, that enable access from server side to client side data. It add two new hooks to libpq - one for returning of local context, second for setting of local context. A motivation is integration of possibilities of psql console together with stronger language - plpgsql. Second target is enabling possibility to save a result of some server side process in psql. It improve vars feature in psql. pavel ~/src/postgresql/src $ cat test.sql \echo value of external paremeter is :myvar do $$ begin -- we can take any session variable on client side -- it is safe against to SQL injection raise notice 'external parameter accessed from plpgsql is %', hgetvar('myvar'); -- we can change this session variable and finish transaction perform hsetvar('myvar', 'Hello, World'); end; $$ language plpgsql; \echo new value of session variable is :myvar cat test.sql | psql postgres -v myvar=Hello value of external paremeter is Hello NOTICE: external parameter accessed from plpgsql is Hello DO new value of session variable is Hello, World This is just proof concept - there should be better integration with pl languages, using cache for read on server side, ... Notices? Why not just use a custom GUC variable instead? E.g. you could have psql SET psql.myvar='Hello, World', and then you'd need no changes at all in the backend? Maybe have a shorthand interface for accessing GUCs in psql would help in making it easier, but do we really need a whole new variable concept? GUC variables doesn't help with access to psql's command line parameters from DO PL code. But with a small change to psql they could, without the need for a whole new type of variable. For example, psql could set all those variable as psql.commandlinevarname, which could then be accessed from the DO PL code just fine. yes, it is possibility too. It has different issues - it can send unwanted variables - Could you expand on this just a bit? Are you picturing something an attacker could somehow use, or...? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal
On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote: Andres Freund and...@2ndquadrant.com wrote: I am not sure were going to get all that into 9.3. Sure, that was more related to why I was questioning how much these use cases even *could* integrate -- whether it even paid to *consider* these use cases at this point. Responses from Robert and you have pretty much convinced me that I was being overly pessimistic on that. One fine point regarding before and after images -- if a value doesn't change in an UPDATE, there's no reason to include it in both the BEFORE and AFTER tuple images, as long as we have the null column bitmaps -- or some other way of distinguishing unchanged from NULL. (This could be especially important when the unchanged column was a 50 MB bytea.) How about two bitmaps: one telling which columns are actually there, the other with NULLs? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why roll-your-own s_lock? / improving scalability
On Tue, Jun 26, 2012 at 12:02 PM, Nils Goroll sl...@schokola.de wrote: Hi, I am currently trying to understand what looks like really bad scalability of 9.1.3 on a 64core 512GB RAM system: the system runs OK when at 30% usr, but only marginal amounts of additional load seem to push it to 70% and the application becomes highly unresponsive. My current understanding basically matches the issues being addressed by various 9.2 improvements, well summarized in http://wiki.postgresql.org/images/e/e8/FOSDEM2012-Multi-CPU-performance-in-9.2.pdf An additional aspect is that, in order to address the latent risk of data loss corruption with WBCs and async replication, we have deliberately moved the db from a similar system with WB cached storage to ssd based storage without a WBC, which, by design, has (in the best WBC case) approx. 100x higher latencies, but much higher sustained throughput. On the new system, even with 30% user acceptable load, oprofile makes apparent significant lock contention: opreport --symbols --merge tgid -l /mnt/db1/hdd/pgsql-9.1/bin/postgres Profiling through timer interrupt samples % image name symbol name 30240 27.9720 postgres s_lock 5069 4.6888 postgres GetSnapshotData 3743 3.4623 postgres AllocSetAlloc 3167 2.9295 libc-2.12.so strcoll_l 2662 2.4624 postgres SearchCatCache 2495 2.3079 postgres hash_search_with_hash_value 2143 1.9823 postgres nocachegetattr 1860 1.7205 postgres LWLockAcquire 1642 1.5189 postgres base_yyparse 1604 1.4837 libc-2.12.so __strcmp_sse42 1543 1.4273 libc-2.12.so __strlen_sse42 1156 1.0693 libc-2.12.so memcpy Unfortunately I don't have profiling data for the high-load / contention condition yet, but I fear the picture will be worse and pointing in the same direction. pure speculation In particular, the _impression_ is that lock contention could also be related to I/O latencies making me fear that cases could exist where spin locks are being helt while blocking on IO. /pure speculation Looking at the code, it appears to me that the roll-your-own s_lock code cannot handle a couple of cases, for instance it will also spin when the lock holder is not running at all or blocking on IO (which could even be implicit, e.g. for a page flush). These issues have long been addressed by adaptive mutexes and futexes. Also, the s_lock code tries to be somehow adaptive using spins_per_delay (when having spun for long (not not blocked), spin even longer in future), which appears to me to have the potential of becoming highly counter-productive. Now that the scene is set, here's the simple question: Why all this? Why not simply use posix mutexes which, on modern platforms, will map to efficient implementations like adaptive mutexes or futexes? Well, that would introduce a backend dependency on pthreads, which is unpleasant. Also you'd need to feature test via _POSIX_THREAD_PROCESS_SHARED to make sure you can mutex between processes (and configure your mutexes as such when you do). There are probably other reasons why this can't be done, but I personally don' t klnow of any. Also, it's forbidden to do things like invoke i/o in the backend while holding only a spinlock. As to your larger point, it's an interesting assertion -- some data to back it up would help. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proof concept - access to session variables on client side
2012/6/26 David Fetter da...@fetter.org: On Tue, Jun 26, 2012 at 10:12:52AM +0200, Pavel Stehule wrote: 2012/6/26 Magnus Hagander mag...@hagander.net: On Tue, Jun 26, 2012 at 9:50 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2012/6/26 Magnus Hagander mag...@hagander.net: On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello I worked on simple patch, that enable access from server side to client side data. It add two new hooks to libpq - one for returning of local context, second for setting of local context. A motivation is integration of possibilities of psql console together with stronger language - plpgsql. Second target is enabling possibility to save a result of some server side process in psql. It improve vars feature in psql. pavel ~/src/postgresql/src $ cat test.sql \echo value of external paremeter is :myvar do $$ begin -- we can take any session variable on client side -- it is safe against to SQL injection raise notice 'external parameter accessed from plpgsql is %', hgetvar('myvar'); -- we can change this session variable and finish transaction perform hsetvar('myvar', 'Hello, World'); end; $$ language plpgsql; \echo new value of session variable is :myvar cat test.sql | psql postgres -v myvar=Hello value of external paremeter is Hello NOTICE: external parameter accessed from plpgsql is Hello DO new value of session variable is Hello, World This is just proof concept - there should be better integration with pl languages, using cache for read on server side, ... Notices? Why not just use a custom GUC variable instead? E.g. you could have psql SET psql.myvar='Hello, World', and then you'd need no changes at all in the backend? Maybe have a shorthand interface for accessing GUCs in psql would help in making it easier, but do we really need a whole new variable concept? GUC variables doesn't help with access to psql's command line parameters from DO PL code. But with a small change to psql they could, without the need for a whole new type of variable. For example, psql could set all those variable as psql.commandlinevarname, which could then be accessed from the DO PL code just fine. yes, it is possibility too. It has different issues - it can send unwanted variables - Could you expand on this just a bit? Are you picturing something an attacker could somehow use, or...? it is not security issue - just I dislike sending complete stack, when just only one variable should be used. Regards Pavel Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why roll-your-own s_lock? / improving scalability
Nils Goroll sl...@schokola.de writes: Now that the scene is set, here's the simple question: Why all this? Why not simply use posix mutexes which, on modern platforms, will map to efficient implementations like adaptive mutexes or futexes? (1) They do not exist everywhere. (2) There is absolutely no evidence to suggest that they'd make things better. If someone cared to rectify (2), we could consider how to use them as an alternative implementation. But if you start with let's not support any platforms that don't have this feature, you're going to get a cold reception. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why roll-your-own s_lock? / improving scalability
Hi Merlin, _POSIX_THREAD_PROCESS_SHARED sure. Also, it's forbidden to do things like invoke i/o in the backend while holding only a spinlock. As to your larger point, it's an interesting assertion -- some data to back it up would help. Let's see if I can get any. ATM I've only got indications, but no proof. Nils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why roll-your-own s_lock? / improving scalability
But if you start with let's not support any platforms that don't have this feature This will never be my intention. Nils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proof concept - access to session variables on client side
On Tue, Jun 26, 2012 at 3:05 AM, Magnus Hagander mag...@hagander.net wrote: But with a small change to psql they could, without the need for a whole new type of variable. For example, psql could set all those variable as psql.commandlinevarname, which could then be accessed from the DO PL code just fine. That's a really neat idea. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proof concept - access to session variables on client side
2012/6/26 Merlin Moncure mmonc...@gmail.com: On Tue, Jun 26, 2012 at 3:05 AM, Magnus Hagander mag...@hagander.net wrote: But with a small change to psql they could, without the need for a whole new type of variable. For example, psql could set all those variable as psql.commandlinevarname, which could then be accessed from the DO PL code just fine. That's a really neat idea. yes, it can be good idea - psql sends some status variables on start, so it should be small patch Pavel merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Posix Shared Mem patch
Robert, all: Last I checked, we had a reasonably acceptable patch to use mostly Posix Shared mem with a very small sysv ram partition. Is there anything keeping this from going into 9.3? It would eliminate a major configuration headache for our users. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On Mon, Jun 18, 2012 at 9:36 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jun 16, 2012 at 6:25 PM, Jeff Janes jeff.ja...@gmail.com wrote: Well, this fell through the cracks, because I forgot to add it to the January CommitFest. Here it is again, rebased. This applies and builds cleanly and passes make check (under enable-cassert). Not test or docs are needed for a patch of this nature. It does what it says, and we want it. I wondered if the change in the return signature of s_lock would have an affect on performance. So I've run a series of pgbench -T 30 -P -c8 -j8, at a scale of 30 which fits in shared_buffers, using an Amazon c1.xlarge (8 cores). I ran both HEAD, and HEAD+patch (without LWLOCK_STATS in both cases), in random ordering. The patch was 0.37% slower, average 298483 selects per second patched to 299582 HEAD. The difference is probably real (p value 0.042, one sided.) but is also pretty much negligible and could just be due to where the executable code falls in the cache lines which could move around with other changes to the code. I found this a bit surprising, so I retested on the IBM POWER7 box at concurrencies from 1 to 64 and found some test runs faster and some test runs slower - maybe a few more faster than slower. I could do a more exact analysis, but I'm inclined to think that if there is an effect here, it's probably just in the noise, and that the real effect you measured was, as you say, the result of cache-line shifting or some other uninteresting artifact that could just as easily move back the other way on some other commit. Really, s_lock should not be getting called often enough to matter much, and ignoring the return value of that function shouldn't cost anyone much. Two suggestions: In your original email you say number of pg_usleep() calls that are required to acquire each LWLock, but nothing in the code says this. Just reading lwlock.c I would naively assume it is reporting the number of TAS spins, not the number of spin-delays (and in fact that is what I did assume until I read your email more carefully). A comment somewhere in lwlock.c would be helpful. Yeah, or maybe I should just change the printout to say spindelay instead of spin, and rename the variables likewise. Also in lwlock.c, if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i] || spin_counts[i]) I don't think we can have spins (or blocks, for that matter) unless we have some acquires to have caused them, so the last two tests in that line seem to be noise. Perhaps so, but I think it's probably safe and clear to just follow the existing code pattern. Since my suggestions are minor, should I go ahead and mark this ready for committer? If you think it should be committed, yes. So Jeff did that, and I've now committed the patch. Thanks for the review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed
On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I'm confused by this remark, because surely the query planner does it this way only if there's no LIMIT. When there is a LIMIT, we choose based on the startup cost plus the estimated fraction of the total cost we expect to pay based on dividing the LIMIT by the overall row count estimate. Or is this not what you're talking about? I think that Ants is pointing the way of estimating costs in choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for cheapest_path + hashagg, where the costs are calculated based on the total cost only of cheapest_path. I think that it might be good to do cost_agg() for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED strategy. Well, Ants already made some adjustments to those functions; not sure if this means they need some more adjustment, but I don't see that there's a general problem with the costing algorithm around LIMIT. Ants, do you intend to update this patch for this CommitFest? Or at all? It seems nobody's too excited about this, so I'm not sure whether it makes sense for you to put more work on it. But please advise as to your plans. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012: Robert, all: Last I checked, we had a reasonably acceptable patch to use mostly Posix Shared mem with a very small sysv ram partition. Is there anything keeping this from going into 9.3? It would eliminate a major configuration headache for our users. I don't think that patch was all that reasonable. It needed work, and in any case it needs a rebase because it was pretty old. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.3] Row-Level Security
2012/6/26 Robert Haas robertmh...@gmail.com: Of course, here is some limitations, to keep the patch size reasonable level to review. - The permission to bypass RLS policy was under discussion. If and when we should provide a special permission to bypass RLS policy, the OR has_superuser_privilege() shall be replaced by OR has_table_privilege(tableoid, 'RLSBYPASS'). I think you're missing the point. Everyone who has commented on this issue is in favor of having some check that causes the RLS predicate *not to get added in the first place*. Adding a modified predicate is not the same thing. First, the query planner might not be smart enough to optimize away the clause even when the predicate holds, causing an unnecessary performance drain. Second, there's too much danger of being able to set a booby-trap for the superuser this way. Suppose that the RLS subsystem replaces f_malicious() by f_malicious OR has_superuser_privilege(). Now the superuser comes along with the nightly pg_dump run and the query optimizer executes SELECT * FROM nuts WHERE f_malicious() OR has_superuser_privilege(). The query optimizer notes that the cost of f_malicious() is very low and decides to execute that before has_superuser_privilege(). Oops. I think it's just not acceptable to handle this by clause-munging: we need to not add the clause in the first place. Here is a simple idea to avoid the second problematic scenario; that assign 0 as cost of has_superuser_privilege(). I allows to keep this function more lightweight than any possible malicious function, since CreateFunction enforces positive value. But the first point is still remaining. As you pointed out before, it might be a solution to have case-handling for superusers and others in case of simple query protocol; that uses same snapshot for planner and executor stage. How should we handle the issue? During the previous discussion, Tom mentioned about an idea that saves prepared statement hashed with user-id to switch the query plan depending on user's privilege. Even though I hesitated to buy this idea at that time, it might be worth to investigate this idea to satisfy both security and performance; that will generate multiple query plans to be chosen at executor stage later. How about your opinion? Comments on the patch itself: 1. Please do not abbreviate rowlevel to rowlv or RowLevel to RowLv or ROWLEVEL to ROWLV. That makes it harder to read and harder to grep. Spell it out. OK, 2. Since the entirety of ATExecSetRowLvSecurity is conditional on whether clause != NULL, you might as well split it into two functions, one for each case. OK, 3. The fact that you've had to hack preprocess_targetlist and adjust_appendrel_attrs_mutator suggests to me that the insertion of the generate subquery is happening at the wrong phase of the process. We don't need those special cases for views, so it seems like we shouldn't need them here, either. Main reason why I had to patch them is special case handling for references to system columns; that is unavailable to have for sub- queries. But, I'm not 100% sure around these implementation. So, it needs more investigations. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_terminate_backend for same-role
On Thu, Mar 29, 2012 at 3:04 AM, Daniel Farina dan...@heroku.com wrote: On Mon, Mar 26, 2012 at 12:14 AM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2012-03-20 at 01:38 -0700, Daniel Farina wrote: On Thu, Mar 15, 2012 at 9:39 PM, Fujii Masao masao.fu...@gmail.com wrote: On Fri, Mar 16, 2012 at 8:14 AM, Daniel Farina dan...@heroku.com wrote: Parallel to pg_cancel_backend, it'd be nice to allow the user to just outright kill a backend that they own (politely, with a SIGTERM), aborting any transactions in progress, including the idle transaction, and closing the socket. +1 Here's a patch implementing the simple version, with no more guards against signal racing than have been seen previously. The more elaborate variants to close those races is being discussed in a parallel thread, but I thought I'd get this simple version out there. Review: After reading through the threads, it looks like there was no real objection to this approach -- pid recycling is not something we're concerned about. I think you're missing a doc update though, in func.sgml: For the less restrictive functionpg_cancel_backend/, the role of an active backend can be found from the structfieldusename/structfield column of the structnamepg_stat_activity/structname view. Also, high-availability.sgml makes reference to pg_cancel_backend(), and it might be worthwhile to add an ...and pg_terminate_backend() there. Other than that, it looks good to me. Good comments. Patch attached to address the doc issues. The only iffy thing is that the paragraph For the less restrictive... I have opted to remove in its entirely. I think the documents are already pretty clear about the same-user rule, and I'm not sure if this is the right place for a crash-course on attributes in pg_stat_activity (but maybe it is). ...and pg_terminate_backend seems exactly right. And I think now that the system post-patch doesn't have such a strange contrast between the ability to send SIGINT vs. SIGTERM, such a contrast may not be necessary. I'm also not sure what the policy is about filling paragraphs in the manual. I filled one, which increases the sgml churn a bit. git (show|diff) --word-diff helps clean it up. I went ahead and committed this. I kinda think we should back-patch this into 9.2. It doesn't involve a catalog change, and would make the behavior consistent between the two releases, instead of changing in 9.1 and then changing again in 9.2. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.3] Row-Level Security
Kohei KaiGai kai...@kaigai.gr.jp writes: 2012/6/26 Robert Haas robertmh...@gmail.com: I think you're missing the point. Everyone who has commented on this issue is in favor of having some check that causes the RLS predicate *not to get added in the first place*. Here is a simple idea to avoid the second problematic scenario; that assign 0 as cost of has_superuser_privilege(). I am not sure which part of this isn't safe isn't getting through to you. Aside from the scenarios Robert mentioned, consider the possibility that f_malicious() is marked immutable, so that the planner is likely to call it (to replace the call with its value) before it will ever think about whether has_superuser_privilege should be called first. Please just do what everybody is asking for, and create a bypass that does not require fragile, easily-broken-by-future-changes assumptions about what the planner will do with a WHERE clause. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap
On Wed, May 2, 2012 at 9:01 PM, Josh Berkus j...@agliodbs.com wrote: On 5/2/12 10:20 AM, Jameison Martin wrote: Attached are the following as per various requests: * test_results.txt: the performance benchmarking results, * TestTrailingNull.java: the performance benchmarking code, with a few additional scenarios as per various requests * hardinfo_report.txt: some information about the hardware and OS of the system on which the benchmarks were run, and * postgresql.conf: the postgresql.conf used when running benchmarks. Note that the changes made to the vanilla postgresql.conf can be identified by looking for the string 'jamie' in the file I attached (there aren't that many) Nice, thanks. I'll try some of my own tests when I get a chance; I have a really good use-case for this optimization. Josh, The CommitFest application lists you as the reviewer for this patch. Are you (I hope) planning to review it? I see you posted up a follow-up email asking Tom what he had in mind. Personally, I don't think this needs incredibly complicated testing. I think you should just test a workload involving inserting and/or updating rows with lots of trailing NULL columns, and then another workload with a table of similar width that... doesn't. If we can't find a regression - or, better, we find a win in one or both cases - then I think we're done here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tuplesort memory usage: grow_memtuples
On Sat, Mar 3, 2012 at 3:22 PM, Jeff Janes jeff.ja...@gmail.com wrote: When sorting small tuples, the memtuples array can use a substantial fraction of the total per-tuple memory used. (In the case of pass by value, it is all of it) The way it grows leads to sub-optimal memory usage. Greg, I see you signed up to review this on the 14th, but I don't see a review yet. Are you planning to post one soon? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012: Robert, all: Last I checked, we had a reasonably acceptable patch to use mostly Posix Shared mem with a very small sysv ram partition. Is there anything keeping this from going into 9.3? It would eliminate a major configuration headache for our users. I don't think that patch was all that reasonable. It needed work, and in any case it needs a rebase because it was pretty old. Yep, agreed. I'd like to get this fixed too, but it hasn't made it up to the top of my list of things to worry about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proof concept - access to session variables on client side
Pavel Stehule pavel.steh...@gmail.com writes: it is not security issue - just I dislike sending complete stack, when just only one variable should be used. That's a pretty darn weak argument. If I read the patch correctly, what you're proposing involves a dynamic fetch from the client at runtime, which is going to be disastrous for performance. Quite aside from the network round trip involved, the fetch function would have to be marked volatile (since it has client-visible side-effects, not to mention that we don't know when the client might change the variable value); which would really hurt any query involving it, and probably lead to yet more round trips. Pushing over the known values once at session start (and individual values after updates) is likely to be vastly better-performant than this. Matters could be improved further by requiring variables to be sent to the server to be explicitly marked, which seems like a good idea anyway in case anybody has security concerns that they're not going to let you airily dismiss. Another thing I don't care for is the unannounced protocol extension. This feature is just not interesting enough to justify breaking client compatibility, but that's what it would do as proposed. Clients that haven't heard of this 'v' message would probably think they'd lost sync and drop the connection. (BTW, the patch doesn't seem to include the added backend source file?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On 6/26/12 2:13 PM, Robert Haas wrote: On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012: Robert, all: Last I checked, we had a reasonably acceptable patch to use mostly Posix Shared mem with a very small sysv ram partition. Is there anything keeping this from going into 9.3? It would eliminate a major configuration headache for our users. I don't think that patch was all that reasonable. It needed work, and in any case it needs a rebase because it was pretty old. Yep, agreed. I'd like to get this fixed too, but it hasn't made it up to the top of my list of things to worry about. Was there a post-AgentM version of the patch, which incorporated the small SySV RAM partition? I'm not finding it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proof concept - access to session variables on client side
Merlin Moncure mmonc...@gmail.com writes: On Tue, Jun 26, 2012 at 3:05 AM, Magnus Hagander mag...@hagander.net wrote: But with a small change to psql they could, without the need for a whole new type of variable. For example, psql could set all those variable as psql.commandlinevarname, which could then be accessed from the DO PL code just fine. That's a really neat idea. I do see a problem with this client-push idea, which is what happens if psql sends a SET and later the active transaction gets rolled back. psql does not have enough knowledge to be sure whether it lost the SET or not. It could hack things by always resending all variables after any rollback, but ugh. We could address that by inventing a non-transactional variant of SET, perhaps. Not sure it's worth the complication though --- I don't think I want to have to define how that would interact with other variants of SET in the same transaction ... Another approach would be to define such variables as being truly shared, in the spirit of last-update-wins multi master replication. The backend sends over its values using the existing GUC_REPORT mechanism. So a rollback would cause the psql-side variable to revert as well. Not actually sure if that behavior would be more or less useful than a simpler definition, but it's worth thinking about. In this connection, there was some recent discussion in the jdbc list of wanting clients to be able to set the GUC_REPORT flag on any GUC variable, because the jdbc driver would like to track some settings we have not seen fit to mark that way. Not sure if anybody mentioned that on -hackers yet, but it's coming. If we had that ability then a shared-variable behavior like this could be built entirely on the psql side: the push part is just SET, and the pull part is GUC_REPORT. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On Tue, Jun 26, 2012 at 2:18 PM, Josh Berkus j...@agliodbs.com wrote: On 6/26/12 2:13 PM, Robert Haas wrote: On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012: Robert, all: Last I checked, we had a reasonably acceptable patch to use mostly Posix Shared mem with a very small sysv ram partition. Is there anything keeping this from going into 9.3? It would eliminate a major configuration headache for our users. I don't think that patch was all that reasonable. It needed work, and in any case it needs a rebase because it was pretty old. Yep, agreed. I'd like to get this fixed too, but it hasn't made it up to the top of my list of things to worry about. Was there a post-AgentM version of the patch, which incorporated the small SySV RAM partition? I'm not finding it. On that, I used to be of the opinion that this is a good compromise (a small amount of interlock space, plus mostly posix shmem), but I've heard since then (I think via AgentM indirectly, but I'm not sure) that there are cases where even the small SysV segment can cause problems -- notably when other software tweaks shared memory settings on behalf of a user, but only leaves just-enough for the software being installed. This is most likely on platforms that don't have a high SysV shmem limit by default, so installers all feel the prerogative to increase the limit, but there's no great answer for how to compose a series of such installations. It only takes one installer that says whatever, I'm just catenating stuff to sysctl.conf that works for me to sabotage Postgres' ability to start. So there may be a benefit in finding a way to have no SysV memory at all. I wouldn't let perfect be the enemy of good to make progress here, but it appears this was a witnessed real problem, so it may be worth reconsidering if there is a way we can safely remove all SysV by finding an alternative to the nattach mechanic. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On Tue, Jun 26, 2012 at 5:18 PM, Josh Berkus j...@agliodbs.com wrote: On 6/26/12 2:13 PM, Robert Haas wrote: On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012: Robert, all: Last I checked, we had a reasonably acceptable patch to use mostly Posix Shared mem with a very small sysv ram partition. Is there anything keeping this from going into 9.3? It would eliminate a major configuration headache for our users. I don't think that patch was all that reasonable. It needed work, and in any case it needs a rebase because it was pretty old. Yep, agreed. I'd like to get this fixed too, but it hasn't made it up to the top of my list of things to worry about. Was there a post-AgentM version of the patch, which incorporated the small SySV RAM partition? I'm not finding it. To my knowledge, no. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On that, I used to be of the opinion that this is a good compromise (a small amount of interlock space, plus mostly posix shmem), but I've heard since then (I think via AgentM indirectly, but I'm not sure) that there are cases where even the small SysV segment can cause problems -- notably when other software tweaks shared memory settings on behalf of a user, but only leaves just-enough for the software being installed. This is most likely on platforms that don't have a high SysV shmem limit by default, so installers all feel the prerogative to increase the limit, but there's no great answer for how to compose a series of such installations. It only takes one installer that says whatever, I'm just catenating stuff to sysctl.conf that works for me to sabotage Postgres' ability to start. Personally, I see this as rather an extreme case, and aside from AgentM himself, have never run into it before. Certainly it would be useful to not need SysV RAM at all, but it's more important to get a working patch for 9.3. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why roll-your-own s_lock? / improving scalability
On Tue, Jun 26, 2012 at 01:46:06PM -0500, Merlin Moncure wrote: Well, that would introduce a backend dependency on pthreads, which is unpleasant. Also you'd need to feature test via _POSIX_THREAD_PROCESS_SHARED to make sure you can mutex between processes (and configure your mutexes as such when you do). There are probably other reasons why this can't be done, but I personally don' t klnow of any. And then you have fabulous things like: https://git.reviewboard.kde.org/r/102145/ (OSX defines _POSIX_THREAD_PROCESS_SHARED but does not actually support it.) Seems not very well tested in any case. It might be worthwhile testing futexes on Linux though, they are specifically supported on any kind of shared memory (shm/mmap/fork/etc) and quite well tested. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] pg_terminate_backend for same-role
On Tue, Jun 26, 2012 at 1:58 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Mar 29, 2012 at 3:04 AM, Daniel Farina dan...@heroku.com wrote: On Mon, Mar 26, 2012 at 12:14 AM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2012-03-20 at 01:38 -0700, Daniel Farina wrote: On Thu, Mar 15, 2012 at 9:39 PM, Fujii Masao masao.fu...@gmail.com wrote: On Fri, Mar 16, 2012 at 8:14 AM, Daniel Farina dan...@heroku.com wrote: Parallel to pg_cancel_backend, it'd be nice to allow the user to just outright kill a backend that they own (politely, with a SIGTERM), aborting any transactions in progress, including the idle transaction, and closing the socket. +1 Here's a patch implementing the simple version, with no more guards against signal racing than have been seen previously. The more elaborate variants to close those races is being discussed in a parallel thread, but I thought I'd get this simple version out there. Review: After reading through the threads, it looks like there was no real objection to this approach -- pid recycling is not something we're concerned about. I think you're missing a doc update though, in func.sgml: For the less restrictive functionpg_cancel_backend/, the role of an active backend can be found from the structfieldusename/structfield column of the structnamepg_stat_activity/structname view. Also, high-availability.sgml makes reference to pg_cancel_backend(), and it might be worthwhile to add an ...and pg_terminate_backend() there. Other than that, it looks good to me. Good comments. Patch attached to address the doc issues. The only iffy thing is that the paragraph For the less restrictive... I have opted to remove in its entirely. I think the documents are already pretty clear about the same-user rule, and I'm not sure if this is the right place for a crash-course on attributes in pg_stat_activity (but maybe it is). ...and pg_terminate_backend seems exactly right. And I think now that the system post-patch doesn't have such a strange contrast between the ability to send SIGINT vs. SIGTERM, such a contrast may not be necessary. I'm also not sure what the policy is about filling paragraphs in the manual. I filled one, which increases the sgml churn a bit. git (show|diff) --word-diff helps clean it up. I went ahead and committed this. I kinda think we should back-patch this into 9.2. It doesn't involve a catalog change, and would make the behavior consistent between the two releases, instead of changing in 9.1 and then changing again in 9.2. Thoughts? I think that would be good. It is at the level of pain whereas I would backpatch from day one, but I think it would be a welcome change to people who couldn't justify gnashing their teeth and distributing a tweaked Postgres like I would. It saves me some effort, too. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On Tue, Jun 26, 2012 at 5:44 PM, Josh Berkus j...@agliodbs.com wrote: On that, I used to be of the opinion that this is a good compromise (a small amount of interlock space, plus mostly posix shmem), but I've heard since then (I think via AgentM indirectly, but I'm not sure) that there are cases where even the small SysV segment can cause problems -- notably when other software tweaks shared memory settings on behalf of a user, but only leaves just-enough for the software being installed. This is most likely on platforms that don't have a high SysV shmem limit by default, so installers all feel the prerogative to increase the limit, but there's no great answer for how to compose a series of such installations. It only takes one installer that says whatever, I'm just catenating stuff to sysctl.conf that works for me to sabotage Postgres' ability to start. Personally, I see this as rather an extreme case, and aside from AgentM himself, have never run into it before. Certainly it would be useful to not need SysV RAM at all, but it's more important to get a working patch for 9.3. +1. I'd sort of given up on finding a solution that doesn't involve system V shmem anyway, but now that I think about it... what about using a FIFO? The man page for open on MacOS X says: [ENXIO]O_NONBLOCK and O_WRONLY are set, the file is a FIFO, and no process has it open for reading. And Linux says: ENXIO O_NONBLOCK | O_WRONLY is set, the named file is a FIFO and no process has the file open for reading. Or, the file is a device special file and no corresponding device exists. And HP/UX says: [ENXIO]O_NDELAY is set, the named file is a FIFO, O_WRONLY is set, and no process has the file open for reading. So, what about keeping a FIFO in the data directory? When the postmaster starts up, it tries to open the file with O_NONBLOCK | O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather than O_NONBLOCK). If that succeeds, it bails out. If it fails with anything other than ENXIO, it bails out. If it fails with exactly ENXIO, then it opens the pipe with O_RDONLY and arranges to pass the file descriptor down to all of its children, so that a subsequent open will fail if it or any of its children are still alive. This might even be more reliable than what we do right now, because our current system appears not to be robust against the removal of postmaster.pid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012: On that, I used to be of the opinion that this is a good compromise (a small amount of interlock space, plus mostly posix shmem), but I've heard since then (I think via AgentM indirectly, but I'm not sure) that there are cases where even the small SysV segment can cause problems -- notably when other software tweaks shared memory settings on behalf of a user, but only leaves just-enough for the software being installed. This argument is what killed the original patch. If you want to get anything done *at all* I think it needs to be dropped. Changing shmem implementation is already difficult enough --- you don't need to add the requirement that the interlocking mechanism be changed simultaneously. You (or whoever else) can always work on that as a followup patch. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal
David Fetter da...@fetter.org wrote: On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote: One fine point regarding before and after images -- if a value doesn't change in an UPDATE, there's no reason to include it in both the BEFORE and AFTER tuple images, as long as we have the null column bitmaps -- or some other way of distinguishing unchanged from NULL. (This could be especially important when the unchanged column was a 50 MB bytea.) How about two bitmaps: one telling which columns are actually there, the other with NULLs? There are quite a few ways that could be done, but I suspect Álvaro's idea is best: http://archives.postgresql.org/message-id/1340654533-sup-5...@alvh.no-ip.org In any event, it sounds like Andres wants to keep it as simple as possible for the moment, and just include both tuples in their entirety. Hopefully that is something which can be revisited before the last CF. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why roll-your-own s_lock? / improving scalability
Martijn van Oosterhout klep...@svana.org writes: And then you have fabulous things like: https://git.reviewboard.kde.org/r/102145/ (OSX defines _POSIX_THREAD_PROCESS_SHARED but does not actually support it.) Seems not very well tested in any case. It might be worthwhile testing futexes on Linux though, they are specifically supported on any kind of shared memory (shm/mmap/fork/etc) and quite well tested. Yeah, a Linux-specific replacement of spinlocks with futexes seems like a lot safer idea than let's rely on posix mutexes everywhere. It's still unproven whether it'd be an improvement, but you could expect to prove it one way or the other with a well-defined amount of testing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On Tue, Jun 26, 2012 at 2:53 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012: On that, I used to be of the opinion that this is a good compromise (a small amount of interlock space, plus mostly posix shmem), but I've heard since then (I think via AgentM indirectly, but I'm not sure) that there are cases where even the small SysV segment can cause problems -- notably when other software tweaks shared memory settings on behalf of a user, but only leaves just-enough for the software being installed. This argument is what killed the original patch. If you want to get anything done *at all* I think it needs to be dropped. Changing shmem implementation is already difficult enough --- you don't need to add the requirement that the interlocking mechanism be changed simultaneously. You (or whoever else) can always work on that as a followup patch. True, but then again, I did very intentionally write: Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012: *I wouldn't let perfect be the enemy of good* to make progress here, but it appears this was a witnessed real problem, so it may be worth reconsidering if there is a way we can safely remove all SysV by finding an alternative to the nattach mechanic. (Emphasis mine). I don't think that -hackers at the time gave the zero-shmem rationale much weight (I also was not that happy about the safety mechanism of that patch), but upon more reflection (and taking into account *other* software that may mangle shmem settings) I think it's something at least worth thinking about again one more time. What killed the patch was an attachment to the deemed-less-safe stategy for avoiding bogus shmem attachments already in it, but I don't seem to recall anyone putting a whole lot of thought at the time into the zero-shmem case from what I could read on the list, because a small interlock with nattach seemed good-enough. I'm simply suggesting that for additional benefits it may be worth thinking about getting around nattach and thus SysV shmem, especially with regard to safety, in an open-ended way. Maybe there's a solution (like Robert's FIFO suggestion?) that is not too onerous and can satisfy everyone. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On Jun 26, 2012, at 5:44 PM, Josh Berkus wrote: On that, I used to be of the opinion that this is a good compromise (a small amount of interlock space, plus mostly posix shmem), but I've heard since then (I think via AgentM indirectly, but I'm not sure) that there are cases where even the small SysV segment can cause problems -- notably when other software tweaks shared memory settings on behalf of a user, but only leaves just-enough for the software being installed. This is most likely on platforms that don't have a high SysV shmem limit by default, so installers all feel the prerogative to increase the limit, but there's no great answer for how to compose a series of such installations. It only takes one installer that says whatever, I'm just catenating stuff to sysctl.conf that works for me to sabotage Postgres' ability to start. Personally, I see this as rather an extreme case, and aside from AgentM himself, have never run into it before. Certainly it would be useful to not need SysV RAM at all, but it's more important to get a working patch for 9.3. This can be trivially reproduced if one runs an old (SysV shared memory-based) postgresql alongside a potentially newer postgresql with a smaller SysV segment. This can occur with applications that bundle postgresql as part of the app. Cheers, M -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
This can be trivially reproduced if one runs an old (SysV shared memory-based) postgresql alongside a potentially newer postgresql with a smaller SysV segment. This can occur with applications that bundle postgresql as part of the app. I'm not saying it doesn't happen at all. I'm saying it's not the 80% case. So let's fix the 80% case with something we feel confident in, and then revisit the no-sysv interlock as a separate patch. That way if we can't fix the interlock issues, we still have a reduced-shmem version of Postgres. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
Robert Haas robertmh...@gmail.com writes: So, what about keeping a FIFO in the data directory? Hm, does that work if the data directory is on NFS? Or some other weird not-really-Unix file system? When the postmaster starts up, it tries to open the file with O_NONBLOCK | O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather than O_NONBLOCK). If that succeeds, it bails out. If it fails with anything other than ENXIO, it bails out. If it fails with exactly ENXIO, then it opens the pipe with O_RDONLY ... race condition here ... and arranges to pass the file descriptor down to all of its children, so that a subsequent open will fail if it or any of its children are still alive. This might be made to work, but that doesn't sound quite right in detail. I remember we speculated about using an fcntl lock on some file in the data directory, but that fails because child processes don't inherit fcntl locks. In the modern world, it'd be really a step forward if the lock mechanism worked on shared storage, ie a data directory on NFS or similar could be locked against all comers not just those on the same node as the original postmaster. I don't know how to do that though. In the meantime, insisting that we solve this problem before we do anything is a good recipe for ensuring that nothing happens, just like it hasn't happened for the last half dozen years. (I see Alvaro just made the same point.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote: (Emphasis mine). I don't think that -hackers at the time gave the zero-shmem rationale much weight (I also was not that happy about the safety mechanism of that patch), but upon more reflection (and taking into account *other* software that may mangle shmem settings) I think it's something at least worth thinking about again one more time. What killed the patch was an attachment to the deemed-less-safe stategy for avoiding bogus shmem attachments already in it, but I don't seem to recall anyone putting a whole lot of thought at the time into the zero-shmem case from what I could read on the list, because a small interlock with nattach seemed good-enough. I'm simply suggesting that for additional benefits it may be worth thinking about getting around nattach and thus SysV shmem, especially with regard to safety, in an open-ended way. Maybe there's a solution (like Robert's FIFO suggestion?) that is not too onerous and can satisfy everyone. I solved this via fcntl locking. I also set up gdb to break in critical regions to test the interlock and I found no flaw in the design. More eyes would be welcome, of course. https://github.com/agentm/postgres/tree/posix_shmem Cheers, M -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
Josh Berkus j...@agliodbs.com writes: So let's fix the 80% case with something we feel confident in, and then revisit the no-sysv interlock as a separate patch. That way if we can't fix the interlock issues, we still have a reduced-shmem version of Postgres. Yes. Insisting that we have the whole change in one patch is a good way to prevent any forward progress from happening. As Alvaro noted, there are plenty of issues to resolve without trying to change the interlock mechanism at the same time. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
Tom Lane t...@sss.pgh.pa.us wrote: In the meantime, insisting that we solve this problem before we do anything is a good recipe for ensuring that nothing happens, just like it hasn't happened for the last half dozen years. (I see Alvaro just made the same point.) And now so has Josh. +1 from me, too. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
A.M. age...@themactionfaction.com writes: This can be trivially reproduced if one runs an old (SysV shared memory-based) postgresql alongside a potentially newer postgresql with a smaller SysV segment. This can occur with applications that bundle postgresql as part of the app. I don't believe that that case is a counterexample to what's being proposed (namely, grabbing a minimum-size shmem segment, perhaps 1K). It would only fail if the old postmaster ate up *exactly* SHMMAX worth of shmem, which is not real likely. As a data point, on my Mac laptop with SHMMAX set to 32MB, 9.2 will by default eat up 31624KB, leaving more than a meg available. Sure, that isn't enough to start another old-style postmaster, but it would be plenty of room for one that only wants 1K. Even if you actively try to configure the shmem settings to exactly fill shmmax (which I concede some installation scripts might do), it's going to be hard to do because of the 8K granularity of the main knob, shared_buffers. Moreover, a installation script that did that would soon learn not to, because of the fact that we don't worry too much about changing small details of shared memory consumption in minor releases. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] experimental: replace s_lock spinlock code with pthread_mutex on linux
It's still unproven whether it'd be an improvement, but you could expect to prove it one way or the other with a well-defined amount of testing. I've hacked the code to use adaptive pthread mutexes instead of spinlocks. see attached patch. The patch is for the git head, but it can easily be applied for 9.1.3, which is what I did for my tests. This had disastrous effects on Solaris because it does not use anything similar to futexes for PTHREAD_PROCESS_SHARED mutexes (only the _PRIVATE mutexes do without syscalls for the simple case). But I was surprised to see that it works relatively well on linux. Here's a glimpse of my results: hacked code 9.1.3: -bash-4.1$ rsync -av --delete /tmp/test_template_data/ ../data/ ; /usr/bin/time ./postgres -D ../data -p 55502 ppid=$! ; pid=$(pgrep -P $ppid ) ; sleep 15 ; ./pgbench -c 768 -t 20 -j 128 -p 55502 postgres ; kill $pid sending incremental file list ... ransaction type: TPC-B (sort of) scaling factor: 10 query mode: simple number of clients: 768 number of threads: 128 number of transactions per client: 20 number of transactions actually processed: 15360/15360 tps = 476.873261 (including connections establishing) tps = 485.964355 (excluding connections establishing) LOG: received smart shutdown request LOG: autovacuum launcher shutting down -bash-4.1$ LOG: shutting down LOG: database system is shut down 210.58user 78.88system 0:50.64elapsed 571%CPU (0avgtext+0avgdata 1995968maxresident)k 0inputs+1153872outputs (0major+2464649minor)pagefaults 0swaps original code (vanilla build on amd64) 9.1.3: -bash-4.1$ rsync -av --delete /tmp/test_template_data/ ../data/ ; /usr/bin/time ./postgres -D ../data -p 55502 ppid=$! ; pid=$(pgrep -P $ppid ) ; sleep 15 ; ./pgbench -c 768 -t 20 -j 128 -p 55502 postgres ; kill $pid sending incremental file list ... transaction type: TPC-B (sort of) scaling factor: 10 query mode: simple number of clients: 768 number of threads: 128 number of transactions per client: 20 number of transactions actually processed: 15360/15360 tps = 499.993685 (including connections establishing) tps = 510.410883 (excluding connections establishing) LOG: received smart shutdown request -bash-4.1$ LOG: autovacuum launcher shutting down LOG: shutting down LOG: database system is shut down 196.21user 71.38system 0:47.99elapsed 557%CPU (0avgtext+0avgdata 1360800maxresident)k 0inputs+1147904outputs (0major+2375965minor)pagefaults 0swaps config: -bash-4.1$ egrep '^[a-z]' /tmp/test_template_data/postgresql.conf max_connections = 1800 # (change requires restart) shared_buffers = 10GB # min 128kB temp_buffers = 64MB # min 800kB work_mem = 256MB# min 64kB,d efault 1MB maintenance_work_mem = 2GB # min 1MB, default 16MB bgwriter_delay = 10ms # 10-1ms between rounds bgwriter_lru_maxpages = 1000# 0-1000 max buffers written/round bgwriter_lru_multiplier = 10.0 # 0-10.0 multipler on buffers scanned/round wal_level = hot_standby # minimal, archive, or hot_standby wal_buffers = 64MB # min 32kB, -1 sets based on shared_buffers commit_delay = 1# range 0-10, in microseconds datestyle = 'iso, mdy' lc_messages = 'en_US.UTF-8' # locale for system error message lc_monetary = 'en_US.UTF-8' # locale for monetary formatting lc_numeric = 'en_US.UTF-8' # locale for number formatting lc_time = 'en_US.UTF-8' # locale for time formatting default_text_search_config = 'pg_catalog.english' seq_page_cost = 1.0 # measured on an arbitrary scale random_page_cost = 1.5 # same scale as above (default: 4.0) cpu_tuple_cost = 0.005 cpu_index_tuple_cost = 0.0025 cpu_operator_cost = 0.0001 effective_cache_size = 192GB So it looks like using pthread_mutexes could at least be an option on Linux. Using futexes directly could be even cheaper. As a side note, it looks like I have not expressed myself clearly: I did not intend to suggest to replace proven, working code (which probably is the best you can get for some platforms) with posix calls. I apologize for the provocative question. Regarding the actual production issue, I did not manage to synthetically provoke the saturation we are seeing in production using pgbench - I could not even get anywhere near the production load. So I cannot currently test if reducing the amount of spinning and waking up exactly one waiter (which is what linux/nptl pthread_mutex_unlock does) would solve/mitigate the production issue I am working on, and I'd highly appreciate any pointers in this direction. Cheers, Nils diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c index bc8d89f..a45fdf6 100644 --- a/src/backend/storage/lmgr/s_lock.c +++ b/src/backend/storage/lmgr/s_lock.c @@ -20,6 +20,8 @@
Re: [HACKERS] Posix Shared Mem patch
Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012: Even if you actively try to configure the shmem settings to exactly fill shmmax (which I concede some installation scripts might do), it's going to be hard to do because of the 8K granularity of the main knob, shared_buffers. Actually it's very easy -- just try to start postmaster on a system with not enough shmmax and it will tell you how much shmem it wants. Then copy that number verbatim in the config file. This might fail on picky systems such as MacOSX that require some exact multiple or power of some other parameter, but it works fine on Linux. I think the minimum you can request, at least on Linux, is 1 byte. Moreover, a installation script that did that would soon learn not to, because of the fact that we don't worry too much about changing small details of shared memory consumption in minor releases. +1 -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] embedded list v2
Hi, To recapitulate why I think this sort of embedded list is worthwile: * minimal memory overhead (16 bytes for double linked list heads/nodes on 64bit systems) * no additional memory allocation overhead * no additional dereference to access the contents of a list element * most modifications are completely branchless * the current dllist.h interface has double the memory overhead and much more complex manipulation operators * Multiple places in postgres have grown local single or double linked list implementations * I need it ;) Attached are three patches: 1. embedded list implementation 2. make the list implementation usable without USE_INLINE 3. convert all callers to ilist.h away from dllist.h For 1 I: a. added more comments and some introduction, some more functions b. moved the file from utils/ilist.h to lib/ilist.h c. actually included the c file with the check functions d. did *not* split it up into single/double linked list files, doesn't seem to be worth the trouble given how small ilist.(c|h) are e. did *not* try to get an interface similar to dllist.h. I don't think the old one is better and it makes the breakage more obvious should somebody else use the old implementation although I doubt it. I can be convinced to do d. and e. but I don't think they are an improvement. For 2 I used ugly macro hackery to avoid declaring every function twice, in a c file and in a header. Opinions on the state of the above patches? I did not expect any performance difference in the current usage, but just to be sure I ran the following tests: connection heavy: pgbench -n -S -p 5501 -h /tmp -U andres postgres -c 16 -j 16 -T 10 -C master: 3109 3024 3012 ilist: 3097 3033 3024 somewhat SearchCatCache heavy: pgbench -n -S -p 5501 -h /tmp -U andres postgres -T 100 -c 16 -j 1 master: 98979.453879 99554.485631 99393.587880 ilist: 98960.545559 99583.319870 99498.923273 As expected the differences are on the level of noise... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 2e9d955fbb625004061509a62ecca83fde68d027 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Sun, 6 May 2012 00:26:35 +0200 Subject: [PATCH 1/3] Add embedded list interface (header only) Adds a single and a double linked list which can easily embedded into other datastructures and can be used without any additional allocations. Problematic: It requires USE_INLINE to be used. It could be remade to fallback to to externally defined functions if that is not available but that hardly seems sensibly at this day and age. Besides, the speed hit would be noticeable and its only used in new code which could be disabled on machines - given they still exists - without proper support for inline functions 3 files changed, 509 insertions(+), 1 deletion(-) diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 2e1061e..c94297a 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -12,6 +12,6 @@ subdir = src/backend/lib top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -OBJS = dllist.o stringinfo.o +OBJS = dllist.o stringinfo.o ilist.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/lib/ilist.c b/src/backend/lib/ilist.c new file mode 100644 index 000..72de7a3 --- /dev/null +++ b/src/backend/lib/ilist.c @@ -0,0 +1,79 @@ +/*- + * + * ilist.c + * support for integrated/inline double and single linked lists + * + * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/lib/ilist.c + * + * NOTES + * + * This function only contains testing code for inline single/double linked + * lists. + * + *- + */ + +#include postgres.h + +#include lib/ilist.h + +#ifdef ILIST_DEBUG +void ilist_d_check(ilist_d_head* head) +{ +ilist_d_node *cur; + +if(!head || + !(head-head)) + elog(ERROR, double linked list head is not properly initialized); + +for(cur = head-head.next; +cur != head-head; +cur = cur-next){ + if(!(cur) || + !(cur-next) || + !(cur-prev) || + !(cur-prev-next == cur) || + !(cur-next-prev == cur)) + { + elog(ERROR, double linked list is corrupted); + } +} + +for(cur = head-head.prev; +cur != head-head; +cur = cur-prev){ + if(!(cur) || + !(cur-next) || + !(cur-prev) || + !(cur-prev-next == cur) || + !(cur-next-prev == cur)) + { + elog(ERROR, double linked list is corrupted); + } +} +} + +void ilist_s_check(ilist_s_head* head) +{ +ilist_s_node *cur; + +if(!head || + !(head-head)) + elog(ERROR, single
Re: [HACKERS] Posix Shared Mem patch
A.M. age...@themactionfaction.com writes: On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote: I'm simply suggesting that for additional benefits it may be worth thinking about getting around nattach and thus SysV shmem, especially with regard to safety, in an open-ended way. I solved this via fcntl locking. No, you didn't, because fcntl locks aren't inherited by child processes. Too bad, because they'd be a great solution otherwise. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes
Hi Steve, On Tuesday, June 26, 2012 02:14:22 AM Steve Singer wrote: I planned to have some cutoff 'max_changes_in_memory_per_txn' value. If it has been reached for one transaction all existing changes are spilled to disk. New changes again can be kept in memory till its reached again. Do you want max_changes_per_in_memory_txn or do you want to put a limit on the total amount of memory that the cache is able to use? How are you going to tell a DBA to tune max_changes_in_memory_per_txn? They know how much memory their system has and that they can devote to the apply cache versus other things, giving them guidance on how estimating how much open transactions they might have at a point in time and how many WAL change records each transaction generates seems like a step backwards from the progress we've been making in getting Postgresql to be easier to tune. The maximum number of transactions that could be opened at a time is governed by max_connections on the master at the time the WAL was generated , so I don't even see how the machine processing the WAL records could autotune/guess that. It even can be significantly higher than max_connections because subtransactions are only recognizable as part of their parent transaction uppon commit. I think max_changes_in_memory_per_txn will be the number of changes for now. Making memory based accounting across multiple concurrent transactions work efficiently and correctly isn't easy. We need to support serializing the cache for crash recovery + shutdown of the receiving side as well. Depending on how we do the wal decoding we will need it more frequently... Have you described your thoughts on crash recovery on another thread? I think I have somewhere, but given how much in flux our thoughts on decoding are I think its not that important yet. I am thinking that this module would have to serialize some state everytime it calls cache-commit() to ensure that consumers don't get invoked twice on the same transaction. In one of the other patches I implemented it by adding the (origin_id, origin_lsn) pair to replicated commits. During recovery the startup process sets up the shared memory status up to which point we applied. If you then every now and then perform a 'logical checkpoint' writing down whats the beginning lsn of the longest in-progress transaction is you can fully recover from that point on. If the apply module is making changes to the same backend that the apply cache serializes to then both the state for the apply cache and the changes that committed changes/transactions make will be persisted (or not persisted) together. What if I am replicating from x86 to x86_64 via a apply module that does textout conversions? x86 Proxy x86_64 WAL-- apply cache | (proxy catalog) apply module textout - SQL statements How do we ensure that the commits are all visible(or not visible) on the catalog on the proxy instance used for decoding WAL, the destination database, and the state + spill files of the apply cache stay consistent in the event of a crash of either the proxy or the target? I don't think you can (unless we consider two-phase commit, and I'd rather we didn't). Can we come up with a way of avoiding the need for them to be consistent with each other? Thats discussed in the Catalog/Metadata consistency during changeset extraction from wal thread and we haven't yet determined which solution is the best ;) Code Review = applycache.h --- +typedef struct ApplyCacheTupleBuf +{ +/* position in preallocated list */ +ilist_s_node node; + +HeapTupleData tuple; +HeapTupleHeaderData header; +char data[MaxHeapTupleSize]; +} ApplyCacheTupleBuf; Each ApplyCacheTupleBuf will be about 8k (BLKSZ) big no matter how big the data in the transaction is? Wouldn't workloads with inserts of lots of small rows in a transaction eat up lots of memory that is allocated but storing nothing? The only alternative I can think of is dynamically allocating these and I don't know what the cost/benefit of that overhead will be versus spilling to disk sooner. Dynamically allocating them totally destroys performance, I tried that. I think at some point we should have 4 or so list of preallocated tuple bufs of different sizes and then use the smallest possible one. But I think this solution is ok in the very first version. If you allocate dynamically you also get a noticeable performance drop when you let the decoding run for a while because of fragmentation inside the memory allocator. +* FIXME: better name + */ +ApplyCacheChange* +ApplyCacheGetChange(ApplyCache*); How about: ApplyCacheReserveChangeStruct(..)
Re: [HACKERS] Posix Shared Mem patch
On 06/26/2012 07:30 PM, Tom Lane wrote: A.M. age...@themactionfaction.com writes: On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote: I'm simply suggesting that for additional benefits it may be worth thinking about getting around nattach and thus SysV shmem, especially with regard to safety, in an open-ended way. I solved this via fcntl locking. No, you didn't, because fcntl locks aren't inherited by child processes. Too bad, because they'd be a great solution otherwise. You claimed this last time and I replied: http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php I address this race condition by ensuring that a lock-holding violator is the postmaster or a postmaster child. If such as condition is detected, the child exits immediately without touching the shared memory. POSIX shmem is inherited via file descriptors. This is possible because the locking API allows one to request which PID violates the lock. The child expects the lock to be held and checks that the PID is the parent. If the lock is not held, that means that the postmaster is dead, so the child exits immediately. Cheers, M -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On 06/26/2012 07:15 PM, Alvaro Herrera wrote: Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012: Even if you actively try to configure the shmem settings to exactly fill shmmax (which I concede some installation scripts might do), it's going to be hard to do because of the 8K granularity of the main knob, shared_buffers. Actually it's very easy -- just try to start postmaster on a system with not enough shmmax and it will tell you how much shmem it wants. Then copy that number verbatim in the config file. This might fail on picky systems such as MacOSX that require some exact multiple or power of some other parameter, but it works fine on Linux. Except that we have to account for other installers. A user can install an application in the future which clobbers the value and then the original application will fail to run. The options to get the first app working is: a) to re-install the first app (potentially preventing the second app from running) b) to have the first app detect the failure and readjust the value (guessing what it should be) and potentially forcing a reboot c) to have the the user manually adjust the value and potentially force a reboot The failure usually gets blamed on the first application. That's why we had to nuke SysV shmem. Cheers, M -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On Tue, Jun 26, 2012 at 6:20 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: So, what about keeping a FIFO in the data directory? Hm, does that work if the data directory is on NFS? Or some other weird not-really-Unix file system? I would expect NFS to work in general. We could test that. Of course, it's more than possible that there's some bizarre device out there that purports to be NFS but doesn't actually support mkfifo. It's difficult to prove a negative. When the postmaster starts up, it tries to open the file with O_NONBLOCK | O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather than O_NONBLOCK). If that succeeds, it bails out. If it fails with anything other than ENXIO, it bails out. If it fails with exactly ENXIO, then it opens the pipe with O_RDONLY ... race condition here ... Oh, if someone tries to start two postmasters at the same time? Hmm. and arranges to pass the file descriptor down to all of its children, so that a subsequent open will fail if it or any of its children are still alive. This might be made to work, but that doesn't sound quite right in detail. I remember we speculated about using an fcntl lock on some file in the data directory, but that fails because child processes don't inherit fcntl locks. In the modern world, it'd be really a step forward if the lock mechanism worked on shared storage, ie a data directory on NFS or similar could be locked against all comers not just those on the same node as the original postmaster. I don't know how to do that though. Well, I think that in theory that DOES work. But I also think it's often misconfigured. Which could also be said of NFS in general. In the meantime, insisting that we solve this problem before we do anything is a good recipe for ensuring that nothing happens, just like it hasn't happened for the last half dozen years. (I see Alvaro just made the same point.) Agreed all around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
A.M. age...@themactionfaction.com writes: On 06/26/2012 07:30 PM, Tom Lane wrote: I solved this via fcntl locking. No, you didn't, because fcntl locks aren't inherited by child processes. Too bad, because they'd be a great solution otherwise. You claimed this last time and I replied: http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php I address this race condition by ensuring that a lock-holding violator is the postmaster or a postmaster child. If such as condition is detected, the child exits immediately without touching the shared memory. POSIX shmem is inherited via file descriptors. This is possible because the locking API allows one to request which PID violates the lock. The child expects the lock to be held and checks that the PID is the parent. If the lock is not held, that means that the postmaster is dead, so the child exits immediately. OK, I went back and re-read the original patch, and I now agree that something like this is possible --- but I don't like the way you did it. The dependence on particular PIDs seems both unnecessary and risky. The key concept here seems to be that the postmaster first stakes a claim on the data directory by exclusive-locking a lock file. If successful, it reduces that lock to shared mode (which can be done atomically, according to the SUS fcntl specification), and then holds the shared lock until it exits. Spawned children will not initially have a lock, but what they can do is attempt to acquire shared lock on the lock file. If fail, exit. If successful, *check to see that the parent postmaster is still alive* (ie, getppid() != 1). If so, the parent must have been continuously holding the lock, and the child has successfully joined the pool of shared lock holders. Otherwise bail out without having changed anything. It is the parent is still alive check, not any test on individual PIDs, that makes this work. There are two concrete reasons why I don't care for the GetPIDHoldingLock() way. Firstly, the fact that you can get a blocking PID from F_GETLK isn't an essential part of the concept of file locking IMO --- it's just an incidental part of this particular API. May I remind you that the reason we're stuck on SysV shmem in the first place is that we decided to depend on an incidental part of that API, namely nattch? I would like to not require file locking to have any semantics more specific than a process can hold an exclusive or a shared lock on a file, which is auto-released at process exit. Secondly, in an NFS world I don't believe that the returned l_pid value can be trusted for anything. If it's a PID from a different machine then it might accidentally conflict with one on our machine, or not. Reflecting on this further, it seems to me that the main remaining failure modes are (1) file locking doesn't work, or (2) idiot DBA manually removes the lock file. Both of these could be ameliorated with some refinements to the basic idea. For (1), I suggest that we tweak the startup process (only) to attempt to acquire exclusive lock on the lock file. If it succeeds, we know that file locking is broken, and we can complain. (This wouldn't help for cases where cross-machine locking is broken, but I see no practical way to detect that.) For (2), the problem really is that the proposed patch conflates the PID file with the lock file, but people are conditioned to think that PID files are removable. I suggest that we create a separate, permanently present file that serves only as the lock file and doesn't ever get modified (it need have no content other than the string Don't remove this!). It'd be created by initdb, not by individual postmaster runs; indeed the postmaster should fail if it doesn't find the lock file already present. The postmaster PID file should still exist with its current contents, but it would serve mostly as documentation and as server-contact information for pg_ctl; it would not be part of the data directory locking mechanism. I wonder whether this design can be adapted to Windows? IIRC we do not have a bulletproof data directory lock scheme for Windows. It seems like this makes few enough demands on the lock mechanism that there ought to be suitable primitives available there too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal
On Tue, Jun 26, 2012 at 05:05:27PM -0500, Kevin Grittner wrote: David Fetter da...@fetter.org wrote: On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote: One fine point regarding before and after images -- if a value doesn't change in an UPDATE, there's no reason to include it in both the BEFORE and AFTER tuple images, as long as we have the null column bitmaps -- or some other way of distinguishing unchanged from NULL. (This could be especially important when the unchanged column was a 50 MB bytea.) How about two bitmaps: one telling which columns are actually there, the other with NULLs? There are quite a few ways that could be done, but I suspect Álvaro's idea is best: http://archives.postgresql.org/message-id/1340654533-sup-5...@alvh.no-ip.org Looks great (or at least way better than mine) to me :) In any event, it sounds like Andres wants to keep it as simple as possible for the moment, and just include both tuples in their entirety. Hopefully that is something which can be revisited before the last CF. I hope so, too... Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
I wrote: Reflecting on this further, it seems to me that the main remaining failure modes are (1) file locking doesn't work, or (2) idiot DBA manually removes the lock file. Oh, wait, I just remembered the really fatal problem here: to quote from the SUS fcntl spec, All locks associated with a file for a given process are removed when a file descriptor for that file is closed by that process or the process holding that file descriptor terminates. That carefully says a file descriptor, not the file descriptor through which the lock was acquired. Any close() referencing the lock file will do. That means that it is possible for perfectly innocent code --- for example, something that scans all files in the data directory, as say pg_basebackup might do --- to cause a backend process to lose its lock. When we looked at this before, it seemed like a showstopper. Even if we carefully taught every directory-scanning loop in postgres not to touch the lock file, we cannot expect that for instance a pl/perl function wouldn't accidentally break things. And 99.999% of the time nobody would notice ... it would just be that last 0.001% of people that would be screwed. Still, this discussion has yielded a useful advance, which is that we now see how we might safely make use of lock mechanisms that don't inherit across fork(). We just need something less broken than fcntl(). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables
Hi Kaigai-san, -Original Message- From: Kohei KaiGai [mailto:kai...@kaigai.gr.jp] Sent: Tuesday, June 26, 2012 11:05 PM To: Etsuro Fujita Cc: Robert Haas; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables 2012/6/26 Etsuro Fujita fujita.ets...@lab.ntt.co.jp: Hi Kaigai-san, -Original Message- From: Kohei KaiGai [mailto:kai...@kaigai.gr.jp] Sent: Monday, June 25, 2012 9:49 PM To: Etsuro Fujita Cc: Robert Haas; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables Fujita-san, The revised patch is almost good for me. Only point I noticed is the check for redundant or duplicated options I pointed out on the previous post. So, if you agree with the attached patch, I'd like to hand over this patch for committers. OK I agree with you. Thanks for the revision! Besides the revision, I modified check_selective_binary_conversion() to run heap_close() in the whole-row-reference case. Attached is an updated version of the patch. Sorry, I overlooked this code path. No, It's my fault. So, I'd like to take over this patch for committers. Thanks, Best regards, Etsuro Fujita Thanks, Thanks. Best regards, Etsuro Fujita All I changed is: --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -122,6 +122,11 @@ typedef struct CopyStateData @@ -217,7 +221,7 @@ index 98bcb2f..0143d81 100644 } + else if (strcmp(defel-defname, convert_binary) == 0) + { -+ if (cstate-convert_binary) ++ if (cstate-convert_selectively) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg(conflicting or redundant options))); Thanks, 2012/6/20 Etsuro Fujita fujita.ets...@lab.ntt.co.jp: Hi KaiGai-san, Thank you for the review. -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai Sent: Wednesday, June 20, 2012 1:26 AM To: Etsuro Fujita Cc: Robert Haas; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables Hi Fujita-san, Could you rebase this patch towards the latest tree? It was unavailable to apply the patch cleanly. Sorry, I updated the patch. Please find attached an updated version of the patch. I looked over the patch, then noticed a few points. At ProcessCopyOptions(), defel-arg can be NIL, isn't it? If so, cstate-convert_binary is not a suitable flag to check redundant option. It seems to me cstate-convert_selectively is more suitable flag to check it. + else if (strcmp(defel-defname, convert_binary) == 0) + { + if (cstate-convert_binary) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg(conflicting or redundant + options))); + cstate-convert_selectively = true; + if (defel-arg == NULL || IsA(defel-arg, List)) + cstate-convert_binary = (List *) defel-arg; + else + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg(argument to option \%s\ must be a list of column names, + defel-defname))); + } Yes, defel-arg can be NIL. defel-arg is a List structure listing all the columns needed to be converted to binary representation, which is NIL for the case where no columns are needed to be converted. For example, defel-arg is NIL for SELECT COUNT(*). In this case, while cstate-convert_selectively is set to true, no columns are converted cstate-at NextCopyFrom(). Most efficient case! In short, cstate-convert_selectively represents whether to do selective binary conversion at NextCopyFrom(), and cstate-convert_binary represents all the columns to be converted at NextCopyFrom(), that can be NIL. At NextCopyFrom(), this routine computes default values if configured. In case when these values are not referenced, it might be possible to skip unnecessary calculations. Is it unavailable to add logic to avoid to construct cstate-defmap on unreferenced columns at BeginCopyFrom()? I think that we don't need to add the above logic because file_fdw does BeginCopyFrom() with attnamelist = NIL, in which case, BeginCopyFrom() doesn't construct cstate-defmap at all. I fixed a bug plus some minor optimization in check_binary_conversion() that is renamed to check_selective_binary_conversion() in the updated version, and now file_fdw gives up selective binary conversion for the
Re: [HACKERS] Posix Shared Mem patch
On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane t...@sss.pgh.pa.us wrote: Josh Berkus j...@agliodbs.com writes: So let's fix the 80% case with something we feel confident in, and then revisit the no-sysv interlock as a separate patch. That way if we can't fix the interlock issues, we still have a reduced-shmem version of Postgres. Yes. Insisting that we have the whole change in one patch is a good way to prevent any forward progress from happening. As Alvaro noted, there are plenty of issues to resolve without trying to change the interlock mechanism at the same time. So, here's a patch. Instead of using POSIX shmem, I just took the expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS memory. The sysv shm is still allocated, but it's just a copy of PGShmemHeader; the real shared memory is the anonymous block. This won't work if EXEC_BACKEND is defined so it just falls back on straight sysv shm in that case. There are obviously some portability issues here - this is documented not to work on Linux = 2.4, but it's not clear whether it fails with some suitable error code or just pretends to work and does the wrong thing. I tested that it does compile and work on both Linux 3.2.6 and MacOS X 10.6.8. And the comments probably need work and... who knows what else is wrong. But, thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company anonymous-shmem.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Optimizer Path Candidates difference in 9.1.3 and 9.2 beta1
Hi, hackersI modified the code in add_path() a bit so that all the query path candidates inside pathlist will not be removed and all new path will be added into the pathlist, thus all path candidates are kept in pathlist. I then tested a four-relation query. In 9.1.3, I can see thousands of candidates in the final RelOptInfo, and some of them are even busy trees. But in 9.2 beta1 which I forked from github, there are no such busy trees and only about 50 join path in total, which should match the requirement of System R algo. Is there any modification regarding the system R algo in the new release? And something wrong in algo in 9.1.3?Thanks Best RegardsHuang Qi VictorComputer Science of National University of Singapore
Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed
Hi, -Original Message- From: Robert Haas [mailto:robertmh...@gmail.com] Sent: Wednesday, June 27, 2012 5:09 AM To: Etsuro Fujita Cc: Ants Aasma; Jay Levitt; Tom Lane; PostgreSQL-development; Francois Deliege Subject: Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I'm confused by this remark, because surely the query planner does it this way only if there's no LIMIT. When there is a LIMIT, we choose based on the startup cost plus the estimated fraction of the total cost we expect to pay based on dividing the LIMIT by the overall row count estimate. Or is this not what you're talking about? I think that Ants is pointing the way of estimating costs in choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for cheapest_path + hashagg, where the costs are calculated based on the total cost only of cheapest_path. I think that it might be good to do cost_agg() for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED strategy. Well, Ants already made some adjustments to those functions; not sure if this means they need some more adjustment, but I don't see that there's a general problem with the costing algorithm around LIMIT. Ants, do you intend to update this patch for this CommitFest? Or at all? It seems nobody's too excited about this, so I'm not sure whether it makes sense for you to put more work on it. But please advise as to your plans. Please excuse my slow response, I would also like to know your plan. Best regards, Etsuro Fujita Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
Robert Haas robertmh...@gmail.com writes: So, here's a patch. Instead of using POSIX shmem, I just took the expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS memory. The sysv shm is still allocated, but it's just a copy of PGShmemHeader; the real shared memory is the anonymous block. This won't work if EXEC_BACKEND is defined so it just falls back on straight sysv shm in that case. Um. I hadn't thought about the EXEC_BACKEND interaction, but that seems like a bit of a showstopper. I would not like to give up the ability to debug EXEC_BACKEND mode on Unixen. Would Posix shmem help with that at all? Why did you choose not to use the Posix API, anyway? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimizer Path Candidates difference in 9.1.3 and 9.2 beta1
Qi Huang huangq...@hotmail.com writes: Hi, hackersI modified the code in add_path() a bit so that all the query path candidates inside pathlist will not be removed and all new path will be added into the pathlist, thus all path candidates are kept in pathlist. I then tested a four-relation query. In 9.1.3, I can see thousands of candidates in the final RelOptInfo, and some of them are even busy trees. But in 9.2 beta1 which I forked from github, there are no such busy trees and only about 50 join path in total, which should match the requirement of System R algo. Is there any modification regarding the system R algo in the new release? And something wrong in algo in 9.1.3?Thanks [ shrug... ] When you're not showing us exactly what you did, it's hard to answer that for sure. But there is some prefiltering logic in joinpath.c that you might have to lobotomize too if you want to keep known-inferior join paths. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimizer Path Candidates difference in 9.1.3 and 9.2 beta1
[ shrug... ] When you're not showing us exactly what you did, it's hard to answer that for sure. But there is some prefiltering logic in joinpath.c that you might have to lobotomize too if you want to keep known-inferior join paths. regards, tom lane Thanks, Tom. Below is what I did for the code in add_path(). I commented out the section of remove_old, and also make sure always accept_new. Then at make_one_rel(), after calling rel = make_rel_from_joinlist(root, joinlist); , I set pprint(rel) in the next line. Checking the RelOptInfo printed out in logfile, I can see some bushy trees, in 9.1.3. 267 ---/*268 --- * Loop to check proposed new path against old paths. Note it is possible269 --- * for more than one old path to be tossed out because new_path dominates270 --- * it.271 --- *272 --- * We can't use foreach here because the loop body may delete the current273 --- * list cell.274 --- */275 ---p1_prev = NULL;276 ---for (p1 = list_head(parent_rel-pathlist); p1 != NULL; p1 = p1_next)277 ---{ ...338339 --/* 340 -- * Remove current element from pathlist if dominated by new. 341 -- */ 342 //if (remove_old) 343 //{ 344 //---parent_rel-pathlist = list_delete_cell(parent_rel-pathlist, 345 //-p1, p1_prev); 346 347 -/* 348 - * Delete the data pointed-to by the deleted cell, if possible 349 - */ 350 //---if (!IsA(old_path, IndexPath)) 351 / /--pfree(old_path); 352 -/* p1_prev does not advance */ 353 //} 354 //else 355 //{ 356 -/* new belongs after this old path if it has cost = old's */ 357 -if (costcmp = 0) 358 insert_after = p1; 359 -/* p1_prev advances */ 360 -p1_prev = p1; 361 //}362 363 --/* 364 -- * If we found an old path that dominates new_path, we can quit 365 -- * scanning the pathlist; we will not add new_path, and we assume 366 -- * new_path cannot dominate any other elements of the pathlist. 367 -- */ 368 --if (!accept_new) 369 -break; 370 ---} 371 372 //-if (accept_new) 373 //-{ 374 --/* Accept the new path: insert it at proper place in pathlist */ 375 --if (insert_after) 376 -lappend_cell(parent_rel-pathlist, in sert_after, new_path); 377 --else 378 -parent_rel-pathlist = lcons(new_path, parent_rel-pathlist); 379 //-} 380 //-else 381 //-{ 382 --/* Reject and recycle the new path */ 383 //if (!IsA(new_path, IndexPath)) 384 //---pfree(new_path); 385 //-} Best RegardsHuang Qi VictorComputer Science of National University of Singapore