Re: [HACKERS] BRIN index and aborted transaction
Alvaro Herrera wrote: Tatsuo Ishii wrote: When a transaction aborts, it seems a BRIN index leaves summary data which is not valid any more. Is this an expected behavior? I guess the answer is yes, because it does not affect correctness of a query result, but I would like to make sure. You're right, that is not rolled back (just like any other index type, actually). Let me clarify this a bit. Summarization normally takes place during vacuum (or upon the brin_summarize_new_ranges() function being called on the table). If the INSERT adds tuples to a page in a range that has already been summarized, then the summary tuple for that page range will be updated to cover the to-be-aborted tuples. On the other hand, if the INSERT adds tuples to a page that is not summarized, there is no summary tuple to update; and the subsequent vacuum will remove those tuples before summarizing the range, so they will not appear in the summary tuple. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
Hi, I, every now and then, spent a bit of time making this more efficient over the last few weeks. I had a bit of a problem to reproduce the problems I'd seen in production on physical hardware (found EC2 to be to variable to benchmark this), but luckily 2ndQuadrant today allowed me access to their four socket machine[1] of the AXLE project. Thanks Simon and Tomas! First, some mostly juicy numbers: My benchmark was a parallel COPY into a single wal logged target table: CREATE TABLE data(data text); The source data has been generated with narrow: COPY (select g.i::text FROM generate_series(1, 1) g(i)) TO '/tmp/copybinary' WITH BINARY; wide: COPY (select repeat(random()::text, 10) FROM generate_series(1, 1) g(i)) TO '/tmp/copybinarywide' WITH BINARY; Between every test I ran a TRUNCATE data; CHECKPOINT; For each number of clients I ran pgbench for 70 seconds. I'd previously determined using -P 1 that the numbers are fairly stable. Longer runs would have been nice, but then I'd not have finished in time. shared_buffers = 48GB, narrow table contents: client tps after: tps before: 1 180.255577 210.125143 2 338.231058 391.875088 4 638.814300 405.243901 8 1126.852233 370.922271 16 1242.363623 498.487008 32 1229.648854 484.477042 48 1223.288397 468.127943 64 1198.007422 438.238119 96 1201.501278 370.556354 1281198.554929 288.213032 1961189.603398 193.841993 2561144.082291 191.293781 512643.323675 200.782105 shared_buffers = 1GB, narrow table contents: client tps after: tps before: 1 191.137410 210.787214 2 351.293017 384.086634 4 649.800991 420.703149 8 1103.770749 355.947915 16 1287.192256 489.050768 32 1226.329585 464.936427 48 1187.266489 443.386440 64 1182.698974 402.251258 96 1208.315983 331.290851 1281183.469635 269.250601 1961202.847382 202.788617 2561177.924515 190.876852 512572.457773 192.413191 1 shared_buffers = 48GB, wide table contents: client tps after: tps before: 1 59.685215 68.445331 2 102.034688 103.210277 4 179.434065 78.982315 8 222.613727 76.195353 16 232.162484 77.520265 32 231.979136 71.654421 48 231.981216 64.730114 64 230.955979 57.444215 96 228.016910 56.324725 128227.693947 45.701038 196227.410386 37.138537 256224.626948 35.265530 512105.356439 34.397636 shared_buffers = 1GB, wide table contents: (ran out of patience) Note that the peak performance with the patch is significantly better, but there's currently a noticeable regression in single threaded performance. That undoubtedly needs to be addressed. So, to get to the actual meat: My goal was to essentially get rid of an exclusive lock over relation extension alltogether. I think I found a way to do that that addresses the concerns made in this thread. Thew new algorithm basically is: 1) Acquire victim buffer, clean it, and mark it as pinned 2) Get the current size of the relation, save buffer into blockno 3) Try to insert an entry into the buffer table for blockno 4) If the page is already in the buffer table, increment blockno by 1, goto 3) 5) Try to read the page. In most cases it'll not yet exist. But the page might concurrently have been written by another backend and removed from shared buffers already. If already existing, goto 1) 6) Zero out the page on disk. I think this does handle the concurrency issues. This patch very clearly is in the POC stage. But I do think the approach is generally sound. I'd like to see some comments before deciding whether to carry on. Greetings, Andres Freund PS: Yes, I know that precision in the benchmark isn't warranted, but I'm too lazy to truncate them. [1] [10:28:11 PM] Tomas Vondra: 4x Intel Xeon E54620 Eight Core 2.2GHz Processor’s generation Sandy Bridge EP each core handles 2 threads, so 16 threads total 256GB (16x16GB) ECC REG System Validated Memory (1333 MHz) 2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB) 17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB) LSI MegaRAID 92718iCC controller and Cache Vault Kit (1GB cache) 2 x Nvidia Tesla K20 Active GPU Cards (GK110GL) From fc095897a6f4207d384559a095f80a36cf49648c Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Sun, 29 Mar 2015 20:55:32 +0200 Subject: [PATCH] WIP: Saner heap extension. --- src/backend/access/heap/hio.c | 86 src/backend/commands/vacuumlazy.c | 39 ++-- src/backend/storage/buffer/bufmgr.c | 377 ++-- src/backend/storage/smgr/md.c | 62 ++
Re: [HACKERS] Relation extension scalability
Hi, Eeek, the attached patch included a trivial last minute screwup (dereferencing bistate unconditionally...). Fixed version attached. Andres From fc095897a6f4207d384559a095f80a36cf49648c Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Sun, 29 Mar 2015 20:55:32 +0200 Subject: [PATCH] WIP: Saner heap extension. --- src/backend/access/heap/hio.c | 86 src/backend/commands/vacuumlazy.c | 39 ++-- src/backend/storage/buffer/bufmgr.c | 377 ++-- src/backend/storage/smgr/md.c | 62 ++ src/backend/storage/smgr/smgr.c | 20 +- src/include/storage/buf_internals.h | 1 + src/include/storage/bufmgr.h| 1 + src/include/storage/smgr.h | 7 +- 8 files changed, 417 insertions(+), 176 deletions(-) diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 6db73bf..b47f9fe 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -15,6 +15,8 @@ #include postgres.h +#include miscadmin.h + #include access/heapam.h #include access/hio.h #include access/htup_details.h @@ -237,7 +239,6 @@ RelationGetBufferForTuple(Relation relation, Size len, saveFreeSpace; BlockNumber targetBlock, otherBlock; - bool needLock; len = MAXALIGN(len); /* be conservative */ @@ -433,63 +434,50 @@ RelationGetBufferForTuple(Relation relation, Size len, /* * Have to extend the relation. * - * We have to use a lock to ensure no one else is extending the rel at the - * same time, else we will both try to initialize the same new page. We - * can skip locking for new or temp relations, however, since no one else - * could be accessing them. + * To avoid, as it used to be the case, holding the extension lock during + * victim buffer search for the new buffer, we extend the relation here + * instead of relying on bufmgr.c. We still have to hold the extension + * lock to prevent a race between two backends initializing the same page. */ - needLock = !RELATION_IS_LOCAL(relation); - - if (needLock) - LockRelationForExtension(relation, ExclusiveLock); + while(true) + { + buffer = ExtendRelation(relation, MAIN_FORKNUM, bistate-strategy); - /* - * XXX This does an lseek - rather expensive - but at the moment it is the - * only way to accurately determine how many blocks are in a relation. Is - * it worth keeping an accurate file length in shared memory someplace, - * rather than relying on the kernel to do it for us? - */ - buffer = ReadBufferBI(relation, P_NEW, bistate); + if (otherBuffer != InvalidBuffer) + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); - /* - * We can be certain that locking the otherBuffer first is OK, since it - * must have a lower page number. - */ - if (otherBuffer != InvalidBuffer) - LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); + /* + * Now acquire lock on the new page. + */ + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); - /* - * Now acquire lock on the new page. - */ - LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + page = BufferGetPage(buffer); - /* - * Release the file-extension lock; it's now OK for someone else to extend - * the relation some more. Note that we cannot release this lock before - * we have buffer lock on the new page, or we risk a race condition - * against vacuumlazy.c --- see comments therein. - */ - if (needLock) - UnlockRelationForExtension(relation, ExclusiveLock); + /* + * While unlikely, it's possible that another backend managed to + * initialize the page and use up the free space till we got the + * exclusive lock. That'd require the page to be vacuumed (to be put + * on the free space list) and then be used; possible but fairly + * unlikely in practice. If it happens and there's not enough space, + * just retry. + */ + if (PageIsNew(page)) + { + PageInit(page, BLCKSZ, 0); - /* - * We need to initialize the empty new page. Double-check that it really - * is empty (this should never happen, but if it does we don't want to - * risk wiping out valid data). - */ - page = BufferGetPage(buffer); + Assert(len = PageGetHeapFreeSpace(page)); + break; + } + else if (len = PageGetHeapFreeSpace(page)) + break; - if (!PageIsNew(page)) - elog(ERROR, page %u of relation \%s\ should be empty but is not, - BufferGetBlockNumber(buffer), - RelationGetRelationName(relation)); + if (otherBuffer != InvalidBuffer) + LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK); - PageInit(page, BufferGetPageSize(buffer), 0); + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + ReleaseBuffer(buffer); - if (len PageGetHeapFreeSpace(page)) - { - /* We should not get here given the test at the top */ - elog(PANIC, tuple is too big: size %zu, len); + CHECK_FOR_INTERRUPTS(); } /* diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index a01cfb4..896731c 100644 --- a/src/backend/commands/vacuumlazy.c +++
Re: [HACKERS] pg_dump quietly ignore missing tables - is it bug?
2015-07-19 20:54 GMT+02:00 Pavel Stehule pavel.steh...@gmail.com: Hi I am sending updated version. It implements new long option --strict-names. If this option is used, then for any entered name (without any wildcard char) must be found least one object. This option has not impact on patters (has wildcards chars). When this option is not used, then behave is 100% same as before (with same numbers of SQL queries for -t option). It is based on Oleksandr's documentation (and lightly modified philosophy), and cleaned my previous patch. A test on wildchard existency strcspn(cell-val, ?*) cannot be used, because it doesn't calculate quotes (but a replace has few lines only). There is a possibility to remove a wildcard char test and require least one entry for patters too. But I am thinking, when somebody explicitly uses any wildcard, then he calculate with a possibility of empty result. other variant is using --strict-names behave as default (and implement negative option like --disable-strict-names or some similar). Regards Pavel 2015-07-09 22:48 GMT+02:00 Pavel Stehule pavel.steh...@gmail.com: 2015-07-08 5:36 GMT+02:00 Fujii Masao masao.fu...@gmail.com: On Sat, May 23, 2015 at 1:41 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2015-05-22 18:34 GMT+02:00 Tom Lane t...@sss.pgh.pa.us: Oleksandr Shulgin oleksandr.shul...@zalando.de writes: I think this is a bit over-engineered (apart from the fact that processSQLNamePattern is also used in two dozen of places in psql/describe.c and all of them must be touched for this patch to compile). Also, the new --table-if-exists options seems to be doing what the old --table did, and I'm not really sure I underestand what --table does now. I'm pretty sure we had agreed *not* to change the default behavior of -t. I propose instead to add a separate new option --strict-include, without argument, that only controls the behavior when an include pattern didn't find any table (or schema). If we do it as a separate option, then it necessarily changes the behavior for *each* -t switch in the call. Can anyone show a common use-case where that's no good, and you need separate behavior for each of several -t switches? If not, I like the simplicity of this approach. (Perhaps the switch name could use some bikeshedding, though.) it is near to one proposal implement only new long option --required-table There is no updated version of the patch. So I marked this patch as Waiting on Author. tomorrow I'll return to this topic. One very simple question is, doesn't -n option have very similar problem? Also what about -N, -T and --exclude-table-data? Not sure if we need to handle them in the similar way as you proposed. hard to say - I understand to your motivation, and it can signalize some inconsistency in environment, but it has not same negative impact as same problem with -t -n options. This is maybe place for warning message with possibility to disable this warning. But maybe it is premature optimization? Next way is introduction of strictness option - default can be equivalent of current, safe can check only tables required for dump, strict can check all. Isn't it sufficient to only emit the warning message to stderr if there is no table matching the pattern specified in -t? I prefer raising error in this case. 1. I am thinking so missing tables for dump signalize important inconsistency in environment and it is important bug 2. The warning is not simply detected (and processed) in bash scripts. Regards Pavel Regards, -- Fujii Masao
Re: [HACKERS] Implementation of global temporary tables?
Pavel, All: Just to be clear, the idea of a global temp table is that the table def is available to all users, but the data is private to each session? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 07/17/2015 04:36 PM, Jim Nasby wrote: On 7/16/15 12:40 PM, Robert Haas wrote: They may well be 2-3 times as long. Why is that a negative? In my opinion, brevity makes things easier to read and understand. We also don't support multi-line GUCs, so if your configuration takes 140 characters, you're going to have a very long line in your postgresql.conf (and in your pg_settings output, etc.) Brevity goes both ways, but I don't think that's the real problem here; it's the lack of multi-line support. The JSON that's been proposed makes you work really hard to track what level of nesting you're at, while every alternative format I've seen is terse enough to be very clear on a single line. I will point out that the proposed non-JSON syntax does not offer any ability to name consensus/priority groups. I believe that being able to name groups is vital to managing any complex synch rep, but if we add names it will make the non-JSON syntax less compact. I'm guessing it'd be really ugly/hard to support at least this GUC being multi-line? Yes. Mind you, multi-line GUCs would be useful otherwise, but we don't want to hinge this feature on making that work. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 19 July 2015 at 21:16, Tom Lane t...@sss.pgh.pa.us wrote: Josh Berkus j...@agliodbs.com writes: On 07/17/2015 04:36 PM, Jim Nasby wrote: I'm guessing it'd be really ugly/hard to support at least this GUC being multi-line? Mind you, multi-line GUCs would be useful otherwise, but we don't want to hinge this feature on making that work. I'm pretty sure that changing the GUC parser to allow quoted strings to continue across lines would be trivial. Agreed The problem with it is not that it's hard, it's that omitting a closing quote mark would then result in the entire file being syntactically broken, with the error message(s) almost certainly pointing somewhere else than where the actual mistake is. That depends upon how we specify line-continuation. If we do it with starting and ending quotes, then we would have the problem you suggest. If we required each new continuation line to start with a \ then it wouldn't (or similar). Or perhaps it gets its own file even, an idea raised before. Do we really want such a global reduction in friendliness to make this feature easier? Clearly not, but we must first decide whether that is how we characterise the decision. synchronous_standby_name= is already 25 characters, so that leaves 115 characters - are they always single byte chars? It's not black and white for me that JSON necessarily requires 115 chars whereas other ways never will do. What we are discussing is expanding an existing parameter to include more information. If Josh gets some of the things he's been asking for, then the format will bloat further. It doesn't take much for me to believe it might expand further still, so my view from the discussion is that we'll likely need to expand beyond 115 chars one day whatever format we choose. I'm personally ambivalent what the exact format is that we choose; I care much more about the feature than the syntax, always. My contribution so far was to summarise what I thought was the majority opinion, and to challenge the thought that JSON had no discernible benefit. If the majority view is different, I have no problem there. Clusters of 20 or more standby nodes are reasonably common, so those limits do seem a little small. Synchronous commit behavior is far from being the only cluster metadata we need to record. I'm thinking now that this illustrates that this is the wrong way altogether and we should just be storing cluster metadata in database tables, which is what was discussed and agreed at the BDR meeting at PGCon. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
On 19 July 2015 at 21:56, Tom Lane t...@sss.pgh.pa.us wrote: Since I'm not observing any movement Apologies if nothing visible was occurring. Petr and I had arranged to meet and discuss Mon/Tues. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Arguable RLS security bug, EvalPlanQual() paranoia
On Mon, Jun 1, 2015 at 12:29 AM, Peter Geoghegan p...@heroku.com wrote: If you're using another well known MVCC database system that has RLS, I imagine when this happens the attacker similarly waits on the conflicting (privileged) xact to finish (in my example in the patch, Bob's xact). However, unlike with the Postgres READ COMMITTED mode, Mallory would then have her malicious UPDATE statement entirely rolled back, and her statement would acquire an entirely new MVCC snapshot, to be used by the USING security barrier qual (and everything else) from scratch. This other system would then re-run her UPDATE with the new MVCC snapshot. This would repeat until Mallory's UPDATE statement completes without encountering any concurrent UPDATEs/DELETEs to her would-be affected rows. In general, with this other database system, an UPDATE must run to completion without violating MVCC, even in READ COMMITTED mode. For that reason, I think we can take no comfort from the presumption that this flexibility in USING security barrier quals (allowing subqueries, etc) works securely in this other system. (I actually didn't check this out, but I imagine it's true). Where are we on this? Discussion during pgCon with Heikki and Andres led me to believe that the issue is acceptable. The issue can be documented to help ensure that user expectation is in line with actual user-visible behavior. Unfortunately, I think that that will be a clunky documentation patch. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore
On Thu, Jul 16, 2015 at 8:49 PM, Tom Lane t...@sss.pgh.pa.us wrote: Meh. I don't like the assumption that non-GCC compilers will be smart enough to optimize away the useless-to-them if() tests this adds. Please refactor that so that there is exactly 0 new code when the intrinsic doesn't exist. I imagined that there was some value in copying the GCC intrinsic's behavior, and actually evaluating the addr expression even in the event of no platform support. On reflection, I suppose that that isn't actually a particularly useful property for Postgres. There will only ever be a handful of callers. Attached revision does not rely on such optimization occurring on platforms that lack __builtin_prefetch(). This allowed me to decouple availability from actual use, in the style of posix_fadvise(), so that one can manually disable memory prefetching within pg_config_manual.h. Clang is compatibile with __builtin_prefetch() intrinsic, FWIW. I'm not sure if it's worth trying to make the wrapper portable across a variety of supported compilers. If we were to attempt it, we would not be the first. I note that ICC's memref_control has an identical interface to __builtin_prefetch(). -- Peter Geoghegan From 357e3b65c4f1511746a529e41fe180f0c7668d70 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan peter.geoghega...@gmail.com Date: Sun, 12 Jul 2015 13:14:01 -0700 Subject: [PATCH] Prefetch from memtuples array in tuplesort Testing shows that prefetching the tuple proper of a slightly later SortTuple in the memtuples array during each of many sequential, in-logical-order SortTuple fetches speeds up various sorting intense operations considerably. For example, B-Tree index builds are accelerated as leaf pages are created from the memtuples array. (i.e. The operation following actually performing the sort, but before a tuplesort_end() call is made as a B-Tree spool is destroyed.) Similarly, ordered set aggregates (all cases except the datumsort case with a pass-by-value type), and regular heap tuplesorts benefit to about the same degree. The optimization is only used when sorts fit in memory, though. Also, prefetch a few places ahead within the analogous fetching point in tuplestore.c. This appears to offer similar benefits in certain cases. For example, queries involving large common table expressions significantly benefit. --- config/c-compiler.m4| 17 + configure | 30 ++ configure.in| 1 + src/backend/utils/sort/tuplesort.c | 20 src/backend/utils/sort/tuplestore.c | 13 + src/include/c.h | 14 ++ src/include/pg_config.h.in | 3 +++ src/include/pg_config.h.win32 | 3 +++ src/include/pg_config_manual.h | 10 ++ 9 files changed, 111 insertions(+) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 050bfa5..5776201 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -287,6 +287,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE +# PGAC_C_BUILTIN_PREFETCH +# - +# Check if the C compiler understands __builtin_prefetch(), +# and define HAVE__BUILTIN_PREFETCH if so. +AC_DEFUN([PGAC_C_BUILTIN_PREFETCH], +[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch, +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], +[int i = 0;__builtin_prefetch(i, 0, 3);])], +[pgac_cv__builtin_prefetch=yes], +[pgac_cv__builtin_prefetch=no])]) +if test x$pgac_cv__builtin_prefetch = xyes ; then +AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1, + [Define to 1 if your compiler understands __builtin_prefetch.]) +fi])# PGAC_C_BUILTIN_PREFETCH + + + # PGAC_C_VA_ARGS # -- # Check if the C compiler understands C99-style variadic macros, diff --git a/configure b/configure index 2176d65..5fdd6bd 100755 --- a/configure +++ b/configure @@ -11283,6 +11283,36 @@ if test x$pgac_cv__builtin_unreachable = xyes ; then $as_echo #define HAVE__BUILTIN_UNREACHABLE 1 confdefs.h fi +{ $as_echo $as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch 5 +$as_echo_n checking for __builtin_prefetch... 6; } +if ${pgac_cv__builtin_prefetch+:} false; then : + $as_echo_n (cached) 6 +else + cat confdefs.h - _ACEOF conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ +int i = 0;__builtin_prefetch(i, 0, 3); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile $LINENO; then : + pgac_cv__builtin_prefetch=yes +else + pgac_cv__builtin_prefetch=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +fi +{ $as_echo $as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch 5 +$as_echo $pgac_cv__builtin_prefetch 6; } +if test x$pgac_cv__builtin_prefetch = xyes ; then + +$as_echo #define HAVE__BUILTIN_PREFETCH 1 confdefs.h + +fi { $as_echo $as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__ 5 $as_echo_n checking for __VA_ARGS__... 6; } if ${pgac_cv__va_args+:} false; then : diff
Re: [HACKERS] Refactoring speculative insertion with unique indexes a little
On Thu, Jul 2, 2015 at 2:58 PM, Peter Geoghegan p...@heroku.com wrote: As with row locking, with insertion, if there is a conflict, any outcome (UPDATE or INSERT) is then possible. Where are we on this? It would be nice to get this out of the way before a 9.5 beta is put out. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump quietly ignore missing tables - is it bug?
2015-07-19 21:08 GMT+02:00 Pavel Stehule pavel.steh...@gmail.com: 2015-07-19 20:54 GMT+02:00 Pavel Stehule pavel.steh...@gmail.com: Hi I am sending updated version. It implements new long option --strict-names. If this option is used, then for any entered name (without any wildcard char) must be found least one object. This option has not impact on patters (has wildcards chars). When this option is not used, then behave is 100% same as before (with same numbers of SQL queries for -t option). It is based on Oleksandr's documentation (and lightly modified philosophy), and cleaned my previous patch. A test on wildchard existency strcspn(cell-val, ?*) cannot be used, because it doesn't calculate quotes (but a replace has few lines only). There is a possibility to remove a wildcard char test and require least one entry for patters too. But I am thinking, when somebody explicitly uses any wildcard, then he calculate with a possibility of empty result. other variant is using --strict-names behave as default (and implement negative option like --disable-strict-names or some similar). Note: originally I though, we have to fix it and change the default behave. But with special option, we don't need it. This option in help is signal for user, so some is risky. Pavel Regards Pavel 2015-07-09 22:48 GMT+02:00 Pavel Stehule pavel.steh...@gmail.com: 2015-07-08 5:36 GMT+02:00 Fujii Masao masao.fu...@gmail.com: On Sat, May 23, 2015 at 1:41 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2015-05-22 18:34 GMT+02:00 Tom Lane t...@sss.pgh.pa.us: Oleksandr Shulgin oleksandr.shul...@zalando.de writes: I think this is a bit over-engineered (apart from the fact that processSQLNamePattern is also used in two dozen of places in psql/describe.c and all of them must be touched for this patch to compile). Also, the new --table-if-exists options seems to be doing what the old --table did, and I'm not really sure I underestand what --table does now. I'm pretty sure we had agreed *not* to change the default behavior of -t. I propose instead to add a separate new option --strict-include, without argument, that only controls the behavior when an include pattern didn't find any table (or schema). If we do it as a separate option, then it necessarily changes the behavior for *each* -t switch in the call. Can anyone show a common use-case where that's no good, and you need separate behavior for each of several -t switches? If not, I like the simplicity of this approach. (Perhaps the switch name could use some bikeshedding, though.) it is near to one proposal implement only new long option --required-table There is no updated version of the patch. So I marked this patch as Waiting on Author. tomorrow I'll return to this topic. One very simple question is, doesn't -n option have very similar problem? Also what about -N, -T and --exclude-table-data? Not sure if we need to handle them in the similar way as you proposed. hard to say - I understand to your motivation, and it can signalize some inconsistency in environment, but it has not same negative impact as same problem with -t -n options. This is maybe place for warning message with possibility to disable this warning. But maybe it is premature optimization? Next way is introduction of strictness option - default can be equivalent of current, safe can check only tables required for dump, strict can check all. Isn't it sufficient to only emit the warning message to stderr if there is no table matching the pattern specified in -t? I prefer raising error in this case. 1. I am thinking so missing tables for dump signalize important inconsistency in environment and it is important bug 2. The warning is not simply detected (and processed) in bash scripts. Regards Pavel Regards, -- Fujii Masao
Re: [HACKERS] Implementation of global temporary tables?
2015-07-19 21:39 GMT+02:00 Josh Berkus j...@agliodbs.com: Pavel, All: Just to be clear, the idea of a global temp table is that the table def is available to all users, but the data is private to each session? yes. Pavel -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: [HACKERS] pg_dump quietly ignore missing tables - is it bug?
Hi I am sending updated version. It implements new long option --strict-names. If this option is used, then for any entered name (without any wildcard char) must be found least one object. This option has not impact on patters (has wildcards chars). When this option is not used, then behave is 100% same as before (with same numbers of SQL queries for -t option). It is based on Oleksandr's documentation (and lightly modified philosophy), and cleaned my previous patch. A test on wildchard existency strcspn(cell-val, ?*) cannot be used, because it doesn't calculate quotes (but a replace has few lines only). There is a possibility to remove a wildcard char test and require least one entry for patters too. But I am thinking, when somebody explicitly uses any wildcard, then he calculate with a possibility of empty result. Regards Pavel 2015-07-09 22:48 GMT+02:00 Pavel Stehule pavel.steh...@gmail.com: 2015-07-08 5:36 GMT+02:00 Fujii Masao masao.fu...@gmail.com: On Sat, May 23, 2015 at 1:41 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2015-05-22 18:34 GMT+02:00 Tom Lane t...@sss.pgh.pa.us: Oleksandr Shulgin oleksandr.shul...@zalando.de writes: I think this is a bit over-engineered (apart from the fact that processSQLNamePattern is also used in two dozen of places in psql/describe.c and all of them must be touched for this patch to compile). Also, the new --table-if-exists options seems to be doing what the old --table did, and I'm not really sure I underestand what --table does now. I'm pretty sure we had agreed *not* to change the default behavior of -t. I propose instead to add a separate new option --strict-include, without argument, that only controls the behavior when an include pattern didn't find any table (or schema). If we do it as a separate option, then it necessarily changes the behavior for *each* -t switch in the call. Can anyone show a common use-case where that's no good, and you need separate behavior for each of several -t switches? If not, I like the simplicity of this approach. (Perhaps the switch name could use some bikeshedding, though.) it is near to one proposal implement only new long option --required-table There is no updated version of the patch. So I marked this patch as Waiting on Author. tomorrow I'll return to this topic. One very simple question is, doesn't -n option have very similar problem? Also what about -N, -T and --exclude-table-data? Not sure if we need to handle them in the similar way as you proposed. hard to say - I understand to your motivation, and it can signalize some inconsistency in environment, but it has not same negative impact as same problem with -t -n options. This is maybe place for warning message with possibility to disable this warning. But maybe it is premature optimization? Next way is introduction of strictness option - default can be equivalent of current, safe can check only tables required for dump, strict can check all. Isn't it sufficient to only emit the warning message to stderr if there is no table matching the pattern specified in -t? I prefer raising error in this case. 1. I am thinking so missing tables for dump signalize important inconsistency in environment and it is important bug 2. The warning is not simply detected (and processed) in bash scripts. Regards Pavel Regards, -- Fujii Masao commit 2f54f7ea786744540d9176cecc086cfbf32e7695 Author: Pavel Stehule pavel.steh...@gooddata.com Date: Sun Jul 19 20:30:32 2015 +0200 initial diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 7467e86..423757d 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -493,6 +493,22 @@ PostgreSQL documentation /varlistentry varlistentry + termoption--strict-names//term + listitem + para + Require that table and/or schema without wildcard chars match + at least one entity each. Without any entity in the database + to be dumped, an error message is printed and dump is aborted. + /para + para + This option has no effect on the exclude table and schema patterns + (and also option--exclude-table-data/): not matching any entities + isn't considered an error. + /para + /listitem + /varlistentry + + varlistentry termoption-t replaceable class=parametertable/replaceable/option/term termoption--table=replaceable class=parametertable/replaceable/option/term listitem diff --git a/src/bin/pg_dump/dumputils.c b/src/bin/pg_dump/dumputils.c index d7506e1..598df0b 100644 --- a/src/bin/pg_dump/dumputils.c +++ b/src/bin/pg_dump/dumputils.c @@ -955,9 +955,9 @@ AddAcl(PQExpBuffer aclbuf, const char *keyword, const char *subname) /* - * processSQLNamePattern + * processSQLName * - * Scan a wildcard-pattern string and generate
Re: [HACKERS] Bug in bttext_abbrev_convert()
On 2015-07-19 12:35:46 -0400, Peter Eisentraut wrote: I would like to do that, but the current host has no more capacity and the hoster is complaining, so I need to look for a new hosting solution before I can expand what is being built. Perhaps this could be moved into a virtual machine hosted by the postgres project infrastructure? What kind of resources do you currently have and what would you need? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
Josh Berkus j...@agliodbs.com writes: On 07/17/2015 04:36 PM, Jim Nasby wrote: I'm guessing it'd be really ugly/hard to support at least this GUC being multi-line? Mind you, multi-line GUCs would be useful otherwise, but we don't want to hinge this feature on making that work. I'm pretty sure that changing the GUC parser to allow quoted strings to continue across lines would be trivial. The problem with it is not that it's hard, it's that omitting a closing quote mark would then result in the entire file being syntactically broken, with the error message(s) almost certainly pointing somewhere else than where the actual mistake is. Do we really want such a global reduction in friendliness to make this feature easier? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
Since I'm not observing any movement on the key question of redesigning the tablesample method API, and I think that's something that's absolutely necessary to fix for 9.5, attached is an attempt to respecify the API. I haven't actually written any code, but I've written a tsmapi.h file modeled on fdwapi.h, and rewritten tablesample-method.sgml to match; those two files are attached. Some notes: * This is predicated on the assumption that we'll get rid of the pg_tablesample_method catalog and instead have a single FDW-style handler function for each sample method. That fixes the problem with the contrib modules being broken in DROP/pg_upgrade cases. * I got rid of the TableSampleDesc struct altogether in favor of giving the execution functions access to the whole SampleScanState executor state node. If we add support for filtering at the join level, filtering in indexscan nodes, etc, those would become separate sets of API functions in my view; there is no need to pretend that this set of API functions works for anything except the SampleScan case. This way is more parallel to the FDW precedent, too. In particular it lets tablesample code get at the executor's EState, which the old API did not, but which might be necessary for some scenarios. * You might have expected me to move the tsmseqscan and tsmpagemode flags into the TsmRoutine struct, but instead this API puts equivalent flags into the SampleScanState struct. The reason for that is that it lets the settings be determined at runtime after inspecting the TABLESAMPLE parameters, which I think would be useful. For example, whether to use the bulkread strategy should probably depend on what the sampling percentage is. * I got rid of the ExamineTuple function altogether. As I said before, I think what that is mostly doing is encouraging people to do unsound things. But in any case, there is no need for it because NextSampleTuple *can* look at the HeapScanDesc state if it really wants to: that lets it identify visible tuples, or even inspect the tuple contents. In the attached, I documented the visible-tuples aspect but did not suggest examining tuple contents. * As written, this still allows TABLESAMPLE parameters to have null values, but I'm pretty strongly tempted to get rid of that: remove the paramisnull[] argument and make the core code reject null values. I can't see any benefit in allowing null values that would justify the extra code and risks-of-omission involved in making every tablesample method check this for itself. * I specified that omitting NextSampleBlock is allowed and causes the core code to do a standard seqscan, including syncscan support, which is a behavior that's impossible with the current API. If we fix the bernoulli code to have history-independent sampling behavior, I see no reason that syncscan shouldn't be enabled for it. Barring objections, I'll press forward with turning this into code over the next few days. regards, tom lane /*- * * tsmapi.h * API for tablesample methods * * Copyright (c) 2015, PostgreSQL Global Development Group * * src/include/access/tsmapi.h * *- */ #ifndef TSMAPI_H #define TSMAPI_H #include nodes/execnodes.h #include nodes/relation.h /* * Callback function signatures --- see tablesample-method.sgml for more info. */ typedef void (*SampleScanCost_function) (PlannerInfo *root, RelOptInfo *baserel, List *paramexprs, BlockNumber *pages, double *tuples); typedef void (*BeginSampleScan_function) (SampleScanState *node, int eflags, uint32 seed, Datum *params, bool *paramisnull, int nparams); typedef BlockNumber (*NextSampleBlock_function) (SampleScanState *node); typedef OffsetNumber (*NextSampleTuple_function) (SampleScanState *node, BlockNumber blockno, OffsetNumber maxoffset); typedef void (*ReScanSampleScan_function) (SampleScanState *node); typedef void (*EndSampleScan_function) (SampleScanState *node); /* * TsmRoutine is the struct returned by a tablesample method's handler * function. It provides pointers to the callback functions needed by the * planner and executor, as well as additional information about the method. * * More function pointers are likely to be added in the future. * Therefore it's recommended that the handler initialize the struct with * makeNode(TsmRoutine) so that all fields are set to NULL. This will * ensure that no fields are accidentally left undefined. */ typedef struct TsmRoutine { NodeTag type; /* List of datatype OIDs for the arguments of the TABLESAMPLE clause */ List *parameterTypes; /* Can method produce repeatable samples across, or even within, queries? */ bool repeatable_across_queries; bool
Re: [HACKERS] security labels on databases are bad for dump restore
On Wed, Jul 15, 2015 at 11:08:53AM +0200, Andres Freund wrote: On 2015-07-15 12:04:40 +0300, Alvaro Herrera wrote: Andres Freund wrote: One thing worth mentioning is that arguably the problem is caused by the fact that we're here emitting database level information in pg_dump, normally only done for dumpall. Consistency with existing practice would indeed have pg_dump ignore pg_shseclabel and have pg_dumpall reproduce its entries. ... the reason for which is probably the lack of CURRENT_DATABASE as a database specifier. It might make sense to add the rest of database-level information to pg_dump output, when we get that. I'm not sure. I mean, it's not that an odd idea to assign a label to a database and then restore data into it, and expect the explicitly assigned label to survive. Not sure if there actually is a good behaviour either way here :/ In a greenfield, I would make pg_dump --create reproduce pertinent entries from datacl, pg_db_role_setting, pg_shseclabel and pg_shdescription. I would make non-creating pg_dump reproduce none of those. Moreover, I would enable --create by default. Restoring into a user-provided shell database is specialized compared to reproducing a database from scratch. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] xlc 12.1 miscompiles 32-bit ginCompareItemPointers()
In a 32-bit build, xlc 12.1 for AIX miscompiles three inline expansions of ginCompareItemPointers(), all in ginget.c (line numbers per commit 9aa6634): 739 Assert(!ItemPointerIsValid(entry-curItem) || 740ginCompareItemPointers(entry-curItem, advancePast) = 0); 847 } while (ginCompareItemPointers(entry-curItem, advancePast) = 0); 915 if (ginCompareItemPointers(key-curItem, advancePast) 0) For one of the arguments, instead of computing hi 32 | lo 16 | posid it computes (lo 16) 32 | lo 16 | posid PostgreSQL 9.4, which introduced the inline ginCompareItemPointers(), is the oldest version affected. The problem remained invisible until my recent commit 43d89a2; the quiet inline configure test had been failing, leaving PG_USE_INLINE unset. I tried some workarounds. Introducing intermediate variables or moving parts of the calculation down into other inline functions did not help. The code compiles fine when not inlined. Changing hi 32 to hi 33 worked (we need just 48 of the 64 bits), but it presumably reduces performance on ABIs where the current bit shifts boil down to a no-op. I propose to expand the gin_private.h #ifdef PG_USE_INLINE test to exclude xlc 32-bit configurations. The last 32-bit AIX kernel exited support on 2012-04-30. There's little reason to prefer 32-bit PostgreSQL under a 64-bit kernel, so that configuration can do without the latest GIN optimization. One alternative would be to distill a configure-time test detecting the bug, so fixed xlc versions reacquire the optimization. Another alternative is to change the bit layout within the uint64; for example, we could use the most-significant 48 bits instead of the least-significant 48 bits. I dislike the idea of a niche configuration driving that choice. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] xlc 12.1 miscompiles 32-bit ginCompareItemPointers()
On July 19, 2015 9:50:33 PM GMT+02:00, Noah Misch n...@leadboat.com wrote: In a 32-bit build, xlc 12.1 for AIX miscompiles three inline expansions of ginCompareItemPointers(), all in ginget.c (line numbers per commit 9aa6634): 739Assert(!ItemPointerIsValid(entry-curItem) || 740 ginCompareItemPointers(entry-curItem, advancePast) = 0); 847} while (ginCompareItemPointers(entry-curItem, advancePast) = 0); 915if (ginCompareItemPointers(key-curItem, advancePast) 0) For one of the arguments, instead of computing hi 32 | lo 16 | posid it computes (lo 16) 32 | lo 16 | posid PostgreSQL 9.4, which introduced the inline ginCompareItemPointers(), is the oldest version affected. The problem remained invisible until my recent commit 43d89a2; the quiet inline configure test had been failing, leaving PG_USE_INLINE unset. I tried some workarounds. Introducing intermediate variables or moving parts of the calculation down into other inline functions did not help. The code compiles fine when not inlined. Changing hi 32 to hi 33 worked (we need just 48 of the 64 bits), but it presumably reduces performance on ABIs where the current bit shifts boil down to a no-op. I propose to expand the gin_private.h #ifdef PG_USE_INLINE test to exclude xlc 32-bit configurations. The last 32-bit AIX kernel exited support on 2012-04-30. I vote to simply error out in that case then. Trying to fix individual compiler bugs in an niche OS sounds like a bad idea. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On 2015-07-19 11:28:25 -0400, Tom Lane wrote: Andres Freund and...@anarazel.de writes: So, to get to the actual meat: My goal was to essentially get rid of an exclusive lock over relation extension alltogether. I think I found a way to do that that addresses the concerns made in this thread. Thew new algorithm basically is: 1) Acquire victim buffer, clean it, and mark it as pinned 2) Get the current size of the relation, save buffer into blockno 3) Try to insert an entry into the buffer table for blockno 4) If the page is already in the buffer table, increment blockno by 1, goto 3) 5) Try to read the page. In most cases it'll not yet exist. But the page might concurrently have been written by another backend and removed from shared buffers already. If already existing, goto 1) 6) Zero out the page on disk. I think this does handle the concurrency issues. The need for (5) kind of destroys my faith in this really being safe: it says there are non-obvious race conditions here. It's not simple, I agree. I'm doubtful that an significantly simpler approach exists. For instance, what about this scenario: * Session 1 tries to extend file, allocates buffer for page 42 (so it's now between steps 4 and 5). * Session 2 tries to extend file, sees buffer for 42 exists, allocates buffer for page 43 (so it's also now between 4 and 5). * Session 2 tries to read page 43, sees it's not there, and writes out page 43 with zeroes (now it's done). * Session 1 tries to read page 42, sees it's there and zero-filled (not because anybody wrote it, but because holes in files read as 0). At this point session 1 will go and create page 44, won't it, and you just wasted a page. My local code now recognizes that case and uses the page. We just need to do an PageIsNew(). Also, the file is likely to end up badly physically fragmented when the skipped pages are finally filled in. One of the good things about the relation extension lock is that the kernel sees the file as being extended strictly sequentially, which it should handle fairly well as far as filesystem layout goes. This way might end up creating a mess on-disk. I don't think that'll actually happen with any recent filesystems. Pretty much all of them do delayed allocation. But it definitely is a concern with older filesystems. I've just measured and with ext4 the number of extents per segment in a 300GB relation don't show a significant difference when compared between the existing and new code. We could try to address this by optionally using posix_fallocate() to do the actual extension - then there'll not be sparse regions, but actually allocated disk blocks. Perhaps even more to the point, you've added a read() kernel call that was previously not there at all, without having removed either the lseek() or the write(). Perhaps that scales better when what you're measuring is saturation conditions on a many-core machine, but I have a very hard time believing that it's not a significant net loss under less-contended conditions. Yes, this has me worried too. I'm inclined to think that a better solution in the long run is to keep the relation extension lock but find a way to extend files more than one page per lock acquisition. I doubt that'll help as much. As long as you have to search and write out buffers under an exclusive lock that'll be painful. You might be able to make that an infrequent occurrance by extending in larger amounts and entering the returned pages into the FSM, but you'll have rather noticeable latency increases everytime that happens. And not just in the extending relation - all the other relations will wait for the one doing the extending. We could move that into some background process, but at that point things have gotten seriously complex. The more radical solution would be to have some place in memory that'd store the current number of blocks. Then all the extension specific locking we'd need would be around incrementing that. But how and where to store that isn't easy. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] xlc 12.1 miscompiles 32-bit ginCompareItemPointers()
Andres Freund and...@anarazel.de writes: On July 19, 2015 9:50:33 PM GMT+02:00, Noah Misch n...@leadboat.com wrote: I propose to expand the gin_private.h #ifdef PG_USE_INLINE test to exclude xlc 32-bit configurations. The last 32-bit AIX kernel exited support on 2012-04-30. I vote to simply error out in that case then. Trying to fix individual compiler bugs in an niche OS sounds like a bad idea. I think I'm with Andres --- are there really enough users of this configuration to justify working around such a bug? More: if the compiler does have a bug like that, how much confidence can we have, really, that there are no other miscompiled places and won't be any in the future? If we are going to support this configuration, I think what the patch ought to do is disable PG_USE_INLINE globally when this compiler is detected. That would revert the behavior to what it was before 43d89a2. We have absolutely no field experience justifying the assumption that a narrower fix would be safe. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
Andres Freund and...@anarazel.de writes: So, to get to the actual meat: My goal was to essentially get rid of an exclusive lock over relation extension alltogether. I think I found a way to do that that addresses the concerns made in this thread. Thew new algorithm basically is: 1) Acquire victim buffer, clean it, and mark it as pinned 2) Get the current size of the relation, save buffer into blockno 3) Try to insert an entry into the buffer table for blockno 4) If the page is already in the buffer table, increment blockno by 1, goto 3) 5) Try to read the page. In most cases it'll not yet exist. But the page might concurrently have been written by another backend and removed from shared buffers already. If already existing, goto 1) 6) Zero out the page on disk. I think this does handle the concurrency issues. The need for (5) kind of destroys my faith in this really being safe: it says there are non-obvious race conditions here. For instance, what about this scenario: * Session 1 tries to extend file, allocates buffer for page 42 (so it's now between steps 4 and 5). * Session 2 tries to extend file, sees buffer for 42 exists, allocates buffer for page 43 (so it's also now between 4 and 5). * Session 2 tries to read page 43, sees it's not there, and writes out page 43 with zeroes (now it's done). * Session 1 tries to read page 42, sees it's there and zero-filled (not because anybody wrote it, but because holes in files read as 0). At this point session 1 will go and create page 44, won't it, and you just wasted a page. Now we do have mechanisms for reclaiming such pages but they may not kick in until VACUUM, so you could end up with a whole lot of table bloat. Also, the file is likely to end up badly physically fragmented when the skipped pages are finally filled in. One of the good things about the relation extension lock is that the kernel sees the file as being extended strictly sequentially, which it should handle fairly well as far as filesystem layout goes. This way might end up creating a mess on-disk. Perhaps even more to the point, you've added a read() kernel call that was previously not there at all, without having removed either the lseek() or the write(). Perhaps that scales better when what you're measuring is saturation conditions on a many-core machine, but I have a very hard time believing that it's not a significant net loss under less-contended conditions. I'm inclined to think that a better solution in the long run is to keep the relation extension lock but find a way to extend files more than one page per lock acquisition. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On 2015-07-19 11:56:47 -0400, Tom Lane wrote: Andres Freund and...@anarazel.de writes: On 2015-07-19 11:28:25 -0400, Tom Lane wrote: At this point session 1 will go and create page 44, won't it, and you just wasted a page. My local code now recognizes that case and uses the page. We just need to do an PageIsNew(). Er, what? How can you tell whether an all-zero page was or was not just written by another session? The check is only done while holding the io lock on the relevant page (have to hold that anyway), after reading it in ourselves, just before setting BM_VALID. As we only can get to that point when there wasn't any other entry for the page in the buffer table, that guarantees that no other backend isn't currently expanding into that page. Others might wait to read it, but those'll wait behind the IO lock. The situation the read() protect us against is that two backends try to extend to the same block, but after one of them succeeded the buffer is written out and reused for an independent page. So there is no in-memory state telling the slower backend that that page has already been used. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in bttext_abbrev_convert()
On 7/6/15 3:52 PM, Alvaro Herrera wrote: Kevin Grittner wrote: Alvaro Herrera alvhe...@2ndquadrant.com wrote: Peter Geoghegan wrote: It would be nice to always have a html report from gcov always available on the internet. That would be something useful to automate, IMV. http://pgci.eisentraut.org/jenkins/job/postgresql_master_coverage/Coverage/ What would it take to get something like that which uses the check-world target instead of just the check target? I suggest CC'ing Peter as a first measure. I already suggested this (or something similar) to him months ago. I would like to do that, but the current host has no more capacity and the hoster is complaining, so I need to look for a new hosting solution before I can expand what is being built. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLock deadlock and gdb advice
On Thu, Jul 16, 2015 at 12:03 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Both. Here's the patch. Previously, LWLockAcquireWithVar set the variable associated with the lock atomically with acquiring it. Before the lwlock-scalability changes, that was straightforward because you held the spinlock anyway, but it's a lot harder/expensive now. So I changed the way acquiring a lock with a variable works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that the current lock holder has updated the variable. The LWLockAcquireWithVar function is gone - you now just use LWLockAcquire(), which always clears the LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if you want to set the variable immediately. LWLockWaitForVar() always waits if the flag is not set, i.e. it will not return regardless of the variable's value, if the current lock-holder has not updated it yet. I ran this for a while without casserts and it seems to work. But with casserts, I get failures in the autovac process on the GIN index. I don't see how this is related to the LWLock issue, but I didn't see it without your patch. Perhaps the system just didn't survive long enough to uncover it without the patch (although it shows up pretty quickly). It could just be an overzealous Assert, since the casserts off didn't show problems. bt and bt full are shown below. Cheers, Jeff #0 0x003dcb632625 in raise () from /lib64/libc.so.6 #1 0x003dcb633e05 in abort () from /lib64/libc.so.6 #2 0x00930b7a in ExceptionalCondition ( conditionName=0x9a1440 !(((PageHeader) (page))-pd_special = (__builtin_offsetof (PageHeaderData, pd_linp))), errorType=0x9a12bc FailedAssertion, fileName=0x9a12b0 ginvacuum.c, lineNumber=713) at assert.c:54 #3 0x004947cf in ginvacuumcleanup (fcinfo=0x7fffee073a90) at ginvacuum.c:713 It now looks like this *is* unrelated to the LWLock issue. The assert that it is tripping over was added just recently (302ac7f27197855afa8c) and so I had not been testing under its presence until now. It looks like it is finding all-zero pages (index extended but then a crash before initializing the page?) and it doesn't like them. (gdb) f 3 (gdb) p *(char[8192]*)(page) $11 = '\000' repeats 8191 times Presumably before this assert, such pages would just be permanently orphaned. Cheers, Jeff
Re: [HACKERS] creating extension including dependencies
On Sat, Jul 18, 2015 at 8:00 AM, Petr Jelinek p...@2ndquadrant.com wrote: On 2015-07-15 06:07, Michael Paquier wrote: On Fri, Jul 10, 2015 at 11:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@anarazel.de writes: On July 10, 2015 4:16:59 PM GMT+02:00, Tom Lane t...@sss.pgh.pa.us wrote: Would that propagate down through multiple levels of CASCADE? (Although I'm not sure it would be sensible for a non-relocatable extension to depend on a relocatable one, so maybe the need doesn't arise in practice.) I'd day so. I was thinking it'd adding a flag that allows to pass a schema to a non relocatable extension. That'd then be passed down. I agree that it's unlikely to be used often. Yeah, I was visualizing it slightly differently, as a separate internal-only option schema_if_needed, but it works out to the same thing either way. I added DEFAULT SCHEMA option into CREATE EXTENSION which behaves like SCHEMA but only for extension that don't specify their schema and is mutually exclusive with just SCHEMA. The DEFAULT SCHEMA propagates when CASCADE is used while the SCHEMA option does not propagate. I'd like to hear opinions about this behavior. It would be possible to propagate SCHEMA as DEFAULT SCHEMA but that might have surprising results (installing dependencies in same schema as the current ext). Hm... First, the current patch has a surprising behavior because it fails to create an extension in cascade when creation is attempted on a custom schema: =# create schema po; CREATE SCHEMA =# create extension hstore_plperl with schema po cascade; NOTICE: 0: installing required extension hstore LOCATION: CreateExtension, extension.c:1483 NOTICE: 0: installing required extension plperl LOCATION: CreateExtension, extension.c:1483 ERROR: 42704: type hstore does not exist LOCATION: typenameType, parse_type.c:258 Time: 30.071 ms To facilitate the life of users, I think that the parent extensions should be created on the same schema as its child by default. In this case hstore should be created in po, not public. And actually by looking at the code I can guess that when DEFAULT SCHEMA is used and that a non-relocatable extension with a schema defined is created in cascade this will error-out as well. That's not really user-friendly.. Hence, as a schema can only be specified in a control file for a non-relocatable extension, I think that the schema name propagated to the root extensions should be the one specified in the control file of the child if it is defined in it instead of the one specified by user (imagine for example an extension created in cascade that requires adminpack, extension that can only be deployed in pg_catalog), and have the root node use its own schema if it has one in its control file by default. For example in this case: foo1 (WITH SCHEMA hoge) - foo2 (schema = pg_catalog) - foo2_1 | |-- foo2_2 --- foo3 (no schema) With this command: CREATE EXTENSION foo1 WITH SCHEMA hoge; foo3 is created on schema po. foo2, as well its required dependencies foo2_1 and foo2_2 are created on pg_catalog. Now DEFAULT SCHEMA is trying to achieve: Hey, I want to define foo2_1 and foo2_2 on schema hoge. This may be worth achieving (though IMO I think that foo1 should have direct dependencies with foo2_1 and foo2_2), but I think that we should decide what is the default behavior we want, and implement the additional behavior in a second patch, separated from the patch of this thread. Personally I am more a fan of propagating to parent extensions the schema of the child and not of its grand-child by default, but I am not alone here :) The list of contrib modules excluded from build in the case of MSVC needs to include test_extensions ($contrib_excludes in src/tools/msvc/Mkvcbuild.pm) or build on Windows using MS of VC will fail. commit_ts does that for example. Done, hopefully correctly, I don't have access to MSVC. That's done correctly. + /* +* Propagate the CASCADE option and add current extenssion +* to the list for checking the cyclic dependencies. +*/ s/extenssion/extension/ + /* Check for cyclic dependency between extension. */ s/extension/extensions/ psql tab completion should be completed with cascade. See the part with WITH SCHEMA in tab-complete.c. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLock deadlock and gdb advice
On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/30/2015 11:24 PM, Andres Freund wrote: On 2015-06-30 22:19:02 +0300, Heikki Linnakangas wrote: Hm. Right. A recheck of the value after the queuing should be sufficient to fix? That's how we deal with the exact same scenarios for the normal lock state, so that shouldn't be very hard to add. Yeah. It's probably more efficient to not release the spinlock between checking the value and LWLockQueueSelf() though. Right now LWLockQueueSelf() takes spinlocks etc itself, and is used that way in a bunch of callsites... So that'd be harder. Additionally I'm planning to get rid of the spinlocks around queuing (they show up as bottlenecks in contended workloads), so it seems more future proof not to assume that either way. On top of that I think we should, when available (or using the same type of fallback as for 32bit?), use 64 bit atomics for the var anyway. I'll take a stab at fixing this tomorrow. Thanks! Do you mean both or just the second issue? Both. Here's the patch. Previously, LWLockAcquireWithVar set the variable associated with the lock atomically with acquiring it. Before the lwlock-scalability changes, that was straightforward because you held the spinlock anyway, but it's a lot harder/expensive now. So I changed the way acquiring a lock with a variable works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that the current lock holder has updated the variable. The LWLockAcquireWithVar function is gone - you now just use LWLockAcquire(), which always clears the LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if you want to set the variable immediately. LWLockWaitForVar() always waits if the flag is not set, i.e. it will not return regardless of the variable's value, if the current lock-holder has not updated it yet. This passes make check, but I haven't done any testing beyond that. Does this look sane to you? After applying this patch to commit fdf28853ae6a397497b79f, it has survived testing long enough to convince that this fixes the problem. Cheers, Jeff
Re: [HACKERS] optimizing vacuum truncation scans
On 16 July 2015 at 06:51, Haribabu Kommi kommi.harib...@gmail.com wrote: I marked the patch as ready for committer. The most recent results seem to indicate that the prefetch isn't worth pursuing, but the vm test is. Can someone repeat the perf tests on something larger so we can see, when/if there is a benefit? Jeff, can you add detailed comments to explain the theory of operation of these patches? The patches add the code, but don't say why. We will forget, so lets put the detail in there now please. Thanks. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Parallel Seq Scan
On Fri, Jul 17, 2015 at 1:22 PM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Thu, Jul 16, 2015 at 1:10 PM, Amit Kapila amit.kapil...@gmail.com wrote: Thanks, I will fix this in next version of patch. I am posting in this thread as I am not sure, whether it needs a separate thread or not? I gone through the code and found that the newly added funnel node is is tightly coupled with partial seq scan, in order to add many more parallel plans along with parallel seq scan, we need to remove the integration of this node with partial seq scan. This assumption is wrong, Funnel node can execute any node beneath it (Refer ExecFunnel-funnel_getnext-ExecProcNode, similarly you can see exec_parallel_stmt). Yes, currently nodes supported under Funnel nodes are limited like partialseqscan, result (due to reasons mentioned upthread like readfuncs.s doesn't have support to read Plan nodes which is required for worker backend to read the PlannedStmt, ofcourse we can add them, but as we are supportting parallelism for limited nodes, so I have not enhanced the readfuncs.c) but in general the basic infrastructure is designed such a way that it can support other nodes beneath it. To achieve the same, I have the following ideas. Execution: The funnel execution varies based on the below plan node. 1) partial scan - Funnel does the local scan also and returns the tuples 2) partial agg - Funnel does the merging of aggregate results and returns the final result. Basically Funnel will execute any node beneath it, the Funnel node itself is not responsible for doing local scan or any form of consolidation of results, as of now, it has these 3 basic properties – Has one child, runs multiple copies in parallel. – Combines the results into a single tuple stream. – Can run the child itself if no workers available. Any other better ideas to achieve the same? Refer slides 16-19 in Parallel Sequential Scan presentation in PGCon https://www.pgcon.org/2015/schedule/events/785.en.html I don't have very clear idea what is the best way to transform the nodes in optimizer, but I think we can figure that out later unless majority people see that as blocking factor. Thanks for looking into patch! With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Implementation of global temporary tables?
On 20/07/15 15:00, Pavel Stehule wrote: 2015-07-19 21:39 GMT+02:00 Josh Berkus j...@agliodbs.com mailto:j...@agliodbs.com: Pavel, All: Just to be clear, the idea of a global temp table is that the table def is available to all users, but the data is private to each session? yes. Pavel -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com Just wondering... Would it be difficult to add the ability for one user to share the contents with a list of named other users (roles)? -Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementation of global temporary tables?
2015-07-20 5:33 GMT+02:00 Gavin Flower gavinflo...@archidevsys.co.nz: On 20/07/15 15:00, Pavel Stehule wrote: 2015-07-19 21:39 GMT+02:00 Josh Berkus j...@agliodbs.com mailto: j...@agliodbs.com: Pavel, All: Just to be clear, the idea of a global temp table is that the table def is available to all users, but the data is private to each session? yes. Pavel -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com Just wondering... Would it be difficult to add the ability for one user to share the contents with a list of named other users (roles)? Probably it is possible, but not for temporary data - short data are in process memory, so it are not accessible from other sessions. This sharing tables needs: 1. some infrastructure to hold data about sharing - who can share with what 2. who will clean data? temporary data are cleaned on end of transaction or end of session 3. data should be saved in shared memory instead process memory So it is possible, but partially different -Gavin
Re: [HACKERS] Relation extension scalability
Andres Freund and...@anarazel.de writes: On 2015-07-19 11:28:25 -0400, Tom Lane wrote: At this point session 1 will go and create page 44, won't it, and you just wasted a page. My local code now recognizes that case and uses the page. We just need to do an PageIsNew(). Er, what? How can you tell whether an all-zero page was or was not just written by another session? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers