Re: [HACKERS] Proposal : Parallel Merge Join
On Fri, Feb 24, 2017 at 4:49 PM, Dilip Kumarwrote: > On Fri, Feb 24, 2017 at 3:04 PM, Amit Kapila wrote: >> What advantage do you see for considering such a path when the cost of >> innerpath is more than cheapest_total_inner? Remember the more paths >> we try to consider, the more time we spend in the planner. By any >> chance are you able to generate any query where it will give benefit >> by considering costlier innerpath? > > Changed > It seems you have forgotten to change in the below check: + if (innerpath != NULL && + innerpath->parallel_safe && + (cheapest_startup_inner == NULL || + cheapest_startup_inner->parallel_safe == false || + compare_path_costs(innerpath, cheapest_startup_inner, + STARTUP_COST) < 0)) + /* Found a cheap (or even-cheaper) sorted parallel safer path */ typo /safer/safe Note - Change the patch status in CF app (to Needs Review) whenever you post a new patch. I could see that your other patch also needs a similar update in CF app. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 9:47 PM, Robert Haaswrote: > > And there are no bugs, right? :-) Yeah yeah absolutely nothing. Just like any other feature committed to Postgres so far ;-) I need to polish the chain conversion patch a bit and also add missing support for redo, hash indexes etc. Support for hash indexes will need overloading of ip_posid bits in the index tuple (since there are no free bits left in hash tuples). I plan to work on that next and submit a fully functional patch, hopefully before the commit-fest starts. (I have mentioned the idea of overloading ip_posid bits a few times now and haven't heard any objection so far. Well, that could either mean that nobody has read those emails seriously or there is general acceptance to that idea.. I am assuming latter :-)) Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)
2017-02-25 1:40 GMT+09:00 Claudio Freire: > On Fri, Feb 24, 2017 at 1:24 PM, Robert Haas wrote: >> On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai wrote: >>> This attached patch re-designed the previous v2 according to Robert's >>> comment. >>> All I had to do is putting entrypoint for ForeignScan/CustomScan at >>> ExecShutdownNode, >>> because it is already modified to call child node first, earlier than >>> ExecShutdownGather >>> which releases DSM segment. >> >> Aside from the documentation, which needs some work, this looks fine >> to me on a quick read. > > LGTM too. > > You may want to clarify in the docs when the hook is called in > relation to other hooks too. > I tried to add a description how custom-scan callbacks performs under the executor, and when invoked roughly. However, it is fundamentally not easy for most of people because it assumes FDW/CSP author understand the overall behavior of optimizer / executor, not only APIs specifications. Thanks, -- KaiGai Kohei parallel-finish-fdw_csp.v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GUC for cleanup indexes threshold.
On Sat, Feb 25, 2017 at 3:40 AM, Peter Geogheganwrote: > On Fri, Feb 24, 2017 at 9:26 AM, Robert Haas wrote: >> I think this thread is pretty short on evidence that would let us make >> a smart decision about what to do here. I see three possibilities. >> The first is that this patch is a good idea whether we do something >> about the issue of half-dead pages or not. The second is that this >> patch is a good idea if we do something about the issue of half-dead >> pages but a bad idea if we don't. The third is that this patch is a >> bad idea whether or not we do anything about the issue of half-dead >> pages. > +1. I think we can track the stats from IndexBulkDeleteResult->pages_free to see the impact of the patch. > Half-dead pages are not really relevant to this discussion, AFAICT. I > think that both you and Simon mean "recyclable" pages. > Yes, I think so and I think that is just confusion about terminology. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Oblivious nested loop join algorithm
Hi all Is the current implementation of nested loop join in postgres (memory-trace) oblivious? Why or why not?
Re: [HACKERS] Logical replication existing data copy
On 25/02/17 00:39, Erik Rijkers wrote: > On 2017-02-25 00:08, Petr Jelinek wrote: >> >> There is now a lot of fixes for existing code that this patch depends >> on. Hopefully some of the fixes get committed soonish. > > Indeed - could you look over the below list of 8 patches; is it correct > and in the right (apply) order? > > 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch > 0002-Fix-after-trigger-execution-in-logical-replication.patch > 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch > snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch > snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch > > snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch > snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch > 0001-Logical-replication-support-for-initial-data-copy-v6.patch > Yes that should be correct. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-25 00:08, Petr Jelinek wrote: There is now a lot of fixes for existing code that this patch depends on. Hopefully some of the fixes get committed soonish. Indeed - could you look over the below list of 8 patches; is it correct and in the right (apply) order? 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 0002-Fix-after-trigger-execution-in-logical-replication.patch 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch 0001-Logical-replication-support-for-initial-data-copy-v6.patch (they do apply & compile like this...) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On 2017-02-24 15:12:37 -0800, Andres Freund wrote: > On 2017-02-24 18:04:18 -0500, Tom Lane wrote: > > Concretely, something like the attached. This passes regression tests > > but I've not pushed on it any harder than that. > > Heh, I'd just gotten something that didn't immediately crash anymore ;) > > Running your patch against Jeff's test-case, verified before that I > could easily reproduce the O(N^2) cost. Oh, that didn't take as long as I was afraid (optimized/non-assert build): postgres[26268][1]=# SET work_mem = '13GB'; SET Time: 2.591 ms postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid); ┌───┐ │ count │ ├───┤ │ 0 │ └───┘ (1 row) Time: 268043.710 ms (04:28.044) Profile: 13.49% postgres[.] ExecHashTableInsert 11.16% postgres[.] BufFileRead 9.20% postgres[.] ExecHashIncreaseNumBatches 9.19% libc-2.24.so[.] __memmove_avx_unaligned_erms 6.74% postgres[.] dense_alloc.isra.0 5.53% postgres[.] ExecStoreMinimalTuple 5.14% postgres[.] ExecHashJoinGetSavedTuple.isra.0 4.68% postgres[.] AllocSetAlloc 4.12% postgres[.] AllocSetFree 2.31% postgres[.] palloc 2.06% [kernel][k] free_pcppages_bulk I'd previously run the test for 30min, with something like 99% of the profile being in AllocSetFree(). - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On 2017-02-24 18:04:18 -0500, Tom Lane wrote: > Concretely, something like the attached. This passes regression tests > but I've not pushed on it any harder than that. Heh, I'd just gotten something that didn't immediately crash anymore ;) Running your patch against Jeff's test-case, verified before that I could easily reproduce the O(N^2) cost. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 25/02/17 00:05, Erik Rijkers wrote: > On 2017-02-24 22:58, Petr Jelinek wrote: >> On 23/02/17 01:41, Petr Jelinek wrote: >>> On 23/02/17 01:02, Erik Rijkers wrote: On 2017-02-22 18:13, Erik Rijkers wrote: > On 2017-02-22 14:48, Erik Rijkers wrote: >> On 2017-02-22 13:03, Petr Jelinek wrote: >> >>> 0001-Skip-unnecessary-snapshot-builds.patch >>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch >>> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch >>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch >>> 0002-Fix-after-trigger-execution-in-logical-replication.patch >>> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch >>> 0001-Logical-replication-support-for-initial-data-copy-v5.patch >> >> It works well now, or at least my particular test case seems now >> solved. > > Cried victory too early, I'm afraid. > I got into a 'hung' state while repeating pgbench_derail2.sh. Below is some state. I notice that master pg_stat_replication.syaye is 'startup'. Maybe I should only start the test after that state has changed. Any of the other possible values (catchup, streaming) wuold be OK, I would think. >>> >>> I think that's known issue (see comment in tablesync.c about hanging >>> forever). I think I may have fixed it locally. >>> >>> I will submit patch once I fixed the other snapshot issue (I managed to >>> reproduce it as well, although very rarely so it's rather hard to test). >>> >> >> Hi, >> >> Here it is. But check also the snapbuild related thread for updated >> patches related to that (the issue you had with this not copying all >> rows is yet another pre-existing Postgres bug). >> > > > The four earlier snapbuild patches apply cleanly, but > then I get errors while applying > 0001-Logical-replication-support-for-initial-data-copy-v6.patch: > > > patching file src/test/regress/expected/sanity_check.out > (Stripping trailing CRs from patch.) > patching file src/test/regress/expected/subscription.out > Hunk #2 FAILED at 25. > 1 out of 2 hunks FAILED -- saving rejects to file > src/test/regress/expected/subscription.out.rej > (Stripping trailing CRs from patch.) > patching file src/test/regress/sql/object_address.sql > (Stripping trailing CRs from patch.) > patching file src/test/regress/sql/subscription.sql > (Stripping trailing CRs from patch.) > patching file src/test/subscription/t/001_rep_changes.pl > Hunk #9 succeeded at 175 with fuzz 2. > Hunk #10 succeeded at 193 (offset -9 lines). > (Stripping trailing CRs from patch.) > patching file src/test/subscription/t/002_types.pl > (Stripping trailing CRs from patch.) > can't find file to patch at input line 4296 > Perhaps you used the wrong -p or --strip option? > The text leading up to this was: > -- > |diff --git a/src/test/subscription/t/003_constraints.pl > b/src/test/subscription/t/003_constraints.pl > |index 17d4565..9543b91 100644 > |--- a/src/test/subscription/t/003_constraints.pl > |+++ b/src/test/subscription/t/003_constraints.pl > -- > File to patch: > > > Can you have a look? > Hi, I think you are missing the other fixes (this specific error says that you didn't apply 0002-Fix-after-trigger-execution-in-logical-replication.patch). There is now a lot of fixes for existing code that this patch depends on. Hopefully some of the fixes get committed soonish. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-24 22:58, Petr Jelinek wrote: On 23/02/17 01:41, Petr Jelinek wrote: On 23/02/17 01:02, Erik Rijkers wrote: On 2017-02-22 18:13, Erik Rijkers wrote: On 2017-02-22 14:48, Erik Rijkers wrote: On 2017-02-22 13:03, Petr Jelinek wrote: 0001-Skip-unnecessary-snapshot-builds.patch 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 0002-Fix-after-trigger-execution-in-logical-replication.patch 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch 0001-Logical-replication-support-for-initial-data-copy-v5.patch It works well now, or at least my particular test case seems now solved. Cried victory too early, I'm afraid. I got into a 'hung' state while repeating pgbench_derail2.sh. Below is some state. I notice that master pg_stat_replication.syaye is 'startup'. Maybe I should only start the test after that state has changed. Any of the other possible values (catchup, streaming) wuold be OK, I would think. I think that's known issue (see comment in tablesync.c about hanging forever). I think I may have fixed it locally. I will submit patch once I fixed the other snapshot issue (I managed to reproduce it as well, although very rarely so it's rather hard to test). Hi, Here it is. But check also the snapbuild related thread for updated patches related to that (the issue you had with this not copying all rows is yet another pre-existing Postgres bug). The four earlier snapbuild patches apply cleanly, but then I get errors while applying 0001-Logical-replication-support-for-initial-data-copy-v6.patch: patching file src/test/regress/expected/sanity_check.out (Stripping trailing CRs from patch.) patching file src/test/regress/expected/subscription.out Hunk #2 FAILED at 25. 1 out of 2 hunks FAILED -- saving rejects to file src/test/regress/expected/subscription.out.rej (Stripping trailing CRs from patch.) patching file src/test/regress/sql/object_address.sql (Stripping trailing CRs from patch.) patching file src/test/regress/sql/subscription.sql (Stripping trailing CRs from patch.) patching file src/test/subscription/t/001_rep_changes.pl Hunk #9 succeeded at 175 with fuzz 2. Hunk #10 succeeded at 193 (offset -9 lines). (Stripping trailing CRs from patch.) patching file src/test/subscription/t/002_types.pl (Stripping trailing CRs from patch.) can't find file to patch at input line 4296 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl |index 17d4565..9543b91 100644 |--- a/src/test/subscription/t/003_constraints.pl |+++ b/src/test/subscription/t/003_constraints.pl -- File to patch: Can you have a look? thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
Jeff Janeswrites: > On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane wrote: >> Uh, what? In a doubly-linked list, you can remove an element in O(1) >> time, you don't need any searching. > Currently it is walking the chain to identify which block holds the chunk > in the first place, not just to get the pointer to the previous block. But > I see that that could be fixed by pointer arithmetic once there is a reason > to fix it. Which there isn't a reason to as long as you need to walk the > chain to get the prev pointer anyway. Like this:? > targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ); Right. We're just turning around the existing address calculation. It'd be a good idea to provide some cross-check that we found a sane-looking block header, but that's not that hard --- we know what ought to be in aset, freeptr, and endptr for a single-chunk block's header, and that seems like enough of a crosscheck to me. Concretely, something like the attached. This passes regression tests but I've not pushed on it any harder than that. One argument against this is that it adds some nonzero amount of overhead to block management; but considering that we are calling malloc or realloc or free alongside any one of these manipulations, that overhead should be pretty negligible. I'm a bit more worried about increasing the alloc block header size by 8 bytes, but that would only really matter with a very small allocChunkLimit. regards, tom lane diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index 4dfc3ec..c250bae 100644 *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** typedef AllocSetContext *AllocSet; *** 201,207 typedef struct AllocBlockData { AllocSet aset; /* aset that owns this block */ ! AllocBlock next; /* next block in aset's blocks list */ char *freeptr; /* start of free space in this block */ char *endptr; /* end of space in this block */ } AllocBlockData; --- 201,208 typedef struct AllocBlockData { AllocSet aset; /* aset that owns this block */ ! AllocBlock prev; /* prev block in aset's blocks list, if any */ ! AllocBlock next; /* next block in aset's blocks list, if any */ char *freeptr; /* start of free space in this block */ char *endptr; /* end of space in this block */ } AllocBlockData; *** AllocSetContextCreate(MemoryContext pare *** 522,528 --- 523,532 block->aset = set; block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block->endptr = ((char *) block) + blksize; + block->prev = NULL; block->next = set->blocks; + if (block->next) + block->next->prev = block; set->blocks = block; /* Mark block as not to be released at reset time */ set->keeper = block; *** AllocSetReset(MemoryContext context) *** 604,609 --- 608,614 VALGRIND_MAKE_MEM_NOACCESS(datastart, block->freeptr - datastart); #endif block->freeptr = datastart; + block->prev = NULL; block->next = NULL; } else *** AllocSetAlloc(MemoryContext context, Siz *** 710,725 #endif /* ! * Stick the new block underneath the active allocation block, so that ! * we don't lose the use of the space remaining therein. */ if (set->blocks != NULL) { block->next = set->blocks->next; set->blocks->next = block; } else { block->next = NULL; set->blocks = block; } --- 715,734 #endif /* ! * Stick the new block underneath the active allocation block, if any, ! * so that we don't lose the use of the space remaining therein. */ if (set->blocks != NULL) { + block->prev = set->blocks; block->next = set->blocks->next; + if (block->next) + block->next->prev = block; set->blocks->next = block; } else { + block->prev = NULL; block->next = NULL; set->blocks = block; } *** AllocSetAlloc(MemoryContext context, Siz *** 900,906 --- 909,918 VALGRIND_MAKE_MEM_NOACCESS(block->freeptr, blksize - ALLOC_BLOCKHDRSZ); + block->prev = NULL; block->next = set->blocks; + if (block->next) + block->next->prev = block; set->blocks = block; } *** AllocSetFree(MemoryContext context, void *** 960,988 { /* * Big chunks are certain to have been allocated as single-chunk ! * blocks. Find the containing block and return it to malloc(). */ ! AllocBlock block = set->blocks; ! AllocBlock prevblock = NULL; ! while (block != NULL) ! { ! if (chunk == (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ)) ! break; ! prevblock = block; ! block = block->next; ! } ! if (block == NULL) elog(ERROR, "could not find block containing chunk %p", chunk); - /* let's just make sure chunk is the only one in
Re: [HACKERS] btree_gin and btree_gist for enums
On 02/24/2017 02:55 PM, Andrew Dunstan wrote: > > On 02/24/2017 11:02 AM, Tom Lane wrote: >>> I don't know what to call it either. In my test I used >>> CallerContextFunctionCall2 - not sure if that's quite right, but should >>> be close. >> CallerInfo? CallerFInfo? Or we could spell out CallerFmgrInfo but >> that seems a bit verbose. >> >> > I'll go with CallerFInfoFunctionCall2 etc. > > In the btree_gist system the calls to the routines like enum_cmp are > buried about three levels deep. I'm thinking I'll just pass the flinfo > down the stack and just call these routines at the bottom level. > > > It's occurred to me that we could reduce the code clutter in fmgr.c a bit by turning the DirectFunctionCall{n]Coll functions into macros calling these functions and passing NULL as the flinfo param. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GUC for cleanup indexes threshold.
On Fri, Feb 24, 2017 at 9:26 AM, Robert Haaswrote: > I think this thread is pretty short on evidence that would let us make > a smart decision about what to do here. I see three possibilities. > The first is that this patch is a good idea whether we do something > about the issue of half-dead pages or not. The second is that this > patch is a good idea if we do something about the issue of half-dead > pages but a bad idea if we don't. The third is that this patch is a > bad idea whether or not we do anything about the issue of half-dead > pages. Half-dead pages are not really relevant to this discussion, AFAICT. I think that both you and Simon mean "recyclable" pages. There are several levels of indirection involved here, to keep the locking very granular, so it gets tricky to talk about. B-Tree page deletion is like a page split in reverse. It has a symmetry with page splits, which have two phases (atomic operations). There are also two phases for deletion, the first of which leaves the target page without a downlink in its parent, and marks it half dead. By the end of the first phase, there are still sibling pointers, so an index scan can land on them before the second phase of deletion begins -- they can visit a half-dead page before such time as the second phase of deletion begins, where the sibling link goes away. So, the sibling link isn't stale as such, but the page is still morally dead. (Second phase is where we remove even the sibling links, and declare it fully dead.) Even though there are two phases of deletion, the second still occurs immediately after the first within VACUUM. The need to have two phases is hard to explain, so I won't try, but it suffices to say that VACUUM does not actually ever leave a page half dead unless there is a hard crash. Recall that VACUUMing of a B-Tree is performed sequentially, so blocks can be recycled without needing to be found via a downlink or sibling link by VACUUM. What is at issue here, then, is VACUUM's degree of "eagerness" around putting *fully* dead B-Tree pages in the FSM for recycling. The interlock with RecentGlobalXmin is what makes it impossible for VACUUM to generally fully delete pages, *as well as* mark them as recyclable (put them in the FSM) all at once. Maybe you get this already, since, as I said, the terminology is tricky in this area, but I can't tell. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: two slab-like memory allocators
Hi, On 2017-02-21 01:43:46 +0100, Tomas Vondra wrote: > Attached is v9 of this patch series. This addresses most of the points > raised in the review, namely: Cool, thanks. > 3) get rid of the block-level bitmap tracking free chunks > > Instead of the bitmap, I've used a simple singly-linked list, using int32 > chunk indexes. Perhaps it could use the slist instead, but I'm not quite > sure MAXALIGN is guaranteed to be greater than pointer. I'm pretty sure it's guaranteed to be >= sizeof(void*). But this seems ok, so ... Attached are changes I made on the path to committing the patch. These look large enough that I wanted to give you a chance to comment: - The AllocFreeInfo changes aren't correct for 32bit systems with 64bit, longs (like linux, unlike windows) - using %zu is better (z is code for size_t sized) - Split off the reorderbuffer changes - Removed SlabBlock/SlabChunk / renamed SlabBlockData - +#ifdef MEMORY_CONTEXT_CHECKING +#define SLAB_CHUNK_USED \ + (offsetof(SlabChunkData, requested_size) + sizeof(Size)) +#else doesn't work anymore, because requested_size isn't a top-level field. I first redefined it as (without surrounding ifdef) #define SLAB_CHUNK_USED \ (offsetof(SlabChunkData, header) + sizeof(StandardChunkHeader)) but I'm not really sure there's a whole lot of point in the define rather than just using sizeof() on the whole thing - there's no trailing padding? - SLAB_CHUNK_PUBLIC and SLAB_BLOCKHDRSZ are unused? - renamed 'set' variables (and comments) to slab. - used castNode(SlabContext, context) instead of manual casts - I rephrased + * + * We cache indexes of the first empty chunk on each block (firstFreeChunk), + * and freelist index for blocks with least free chunks (minFreeChunks), so + * that we don't have to search the freelist and block on every SlabAlloc() + * call, which is quite expensive. so it's not referencing firstFreeChunk anymore, since that seems to make less sense now that firstFreeChunk is essentially just the head of the list of free chunks. - added typedefs.list entries and pgindented slab.c - "mark the chunk as unused (zero the bit)" referenced non-existant bitmap afaics. - valgrind was triggering on block->firstFreeChunk = *(int32 *) SlabChunkGetPointer(chunk); because that was previously marked as NOACCESS (via wipe_mem). Explicitly marking as DEFINED solves that. - removed * XXX Perhaps we should not be gentle at all and simply fails in all cases, * to eliminate the (mostly pointless) uncertainty. - you'd included MemoryContext tup_context; in 0002, but it's not really useful yet (and the comments above it in reorderbuffer.h don't apply) - SlabFreeInfo/SlabAllocInfo didn't actually compile w/ HAVE_ALLOCINFO defined (pre StandardChunkHeader def) - some minor stuff. I'm kinda inclined to drop SlabFreeInfo/SlabAllocInfo. I've not yet looked a lot at the next type of context - I want to get this much committed first... I plan to give this another pass sometime this weekend and then push soon. - Andres >From cec3f8372137d2392ff7ac0ab1b2db11fc96e8b3 Mon Sep 17 00:00:00 2001 From: Andres FreundDate: Thu, 23 Feb 2017 22:35:44 -0800 Subject: [PATCH 1/3] Make useful infrastructure from aset.c generally available. An upcoming patch introduces a new type of memory context. To avoid duplicating debugging infrastructure with aset.c move useful pieces to memdebug.[ch]. While touching aset.c, fix printf format AllocFree* debug macros. Author: Tomas Vondra Reviewed-By: Andres Freund Discussion: https://postgr.es/m/b3b2245c-b37a-e1e5-ebc4-857c914bc...@2ndquadrant.com --- src/backend/utils/mmgr/Makefile | 2 +- src/backend/utils/mmgr/aset.c | 115 +- src/backend/utils/mmgr/memdebug.c | 93 ++ src/include/utils/memdebug.h | 48 4 files changed, 144 insertions(+), 114 deletions(-) create mode 100644 src/backend/utils/mmgr/memdebug.c diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile index 1842bae386..fc5f793b7f 100644 --- a/src/backend/utils/mmgr/Makefile +++ b/src/backend/utils/mmgr/Makefile @@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global -OBJS = aset.o dsa.o freepage.o mcxt.o portalmem.o +OBJS = aset.o dsa.o freepage.o mcxt.o memdebug.o portalmem.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index 4dfc3ec260..8056c00ae4 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -41,46 +41,6 @@ * chunks as chunks. Anything "large" is passed off to malloc(). Change * the number of freelists to change the small/large boundary. * - * - * About CLOBBER_FREED_MEMORY: - * - * If this symbol is defined, all freed memory is overwritten with 0x7F's. - *
Re: [HACKERS] snapbuild woes
On 22/02/17 03:05, Petr Jelinek wrote: > On 13/12/16 00:38, Petr Jelinek wrote: >> On 12/12/16 23:33, Andres Freund wrote: >>> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote: On 12/12/16 22:42, Andres Freund wrote: > Hi, > > On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote: >> Hi, >> First one is outright bug, which has to do with how we track running >> transactions. What snapbuild basically does while doing initial snapshot >> is read the xl_running_xacts record, store the list of running txes and >> then wait until they all finish. The problem with this is that >> xl_running_xacts does not ensure that it only logs transactions that are >> actually still running (to avoid locking PGPROC) so there might be xids >> in xl_running_xacts that already committed before it was logged. > > I don't think that's actually true? Notice how LogStandbySnapshot() > only releases the lock *after* the LogCurrentRunningXacts() iff > wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you > observed must actually be a bit more complex :( > Hmm, interesting, I did see the transaction commit in the WAL before the xl_running_xacts that contained the xid as running. I only seen it on production system though, didn't really manage to easily reproduce it locally. >>> >>> I suspect the reason for that is that RecordTransactionCommit() doesn't >>> conflict with ProcArrayLock in the first place - only >>> ProcArrayEndTransaction() does. So they're still running in the PGPROC >>> sense, just not the crash-recovery sense... >>> >> >> That looks like reasonable explanation. BTW I realized my patch needs >> bit more work, currently it will break the actual snapshot as it behaves >> same as if the xl_running_xacts was empty which is not correct AFAICS. >> > > Hi, > > I got to work on this again. Unfortunately I haven't found solution that > I would be very happy with. What I did is in case we read > xl_running_xacts which has all transactions we track finished, we start > tracking from that new xl_running_xacts again with the difference that > we clean up the running transactions based on previously seen committed > ones. That means that on busy server we may wait for multiple > xl_running_xacts rather than just one, but at least we have chance to > finish unlike with current coding which basically waits for empty > xl_running_xacts. I also removed the additional locking for logical > wal_level in LogStandbySnapshot() since it does not work. Not hearing any opposition to this idea so I decided to polish this and also optimize it a bit. That being said, thanks to testing from Erik Rijkers I've identified one more bug in how we do the initial snapshot. Apparently we don't reserve the global xmin when we start building the initial exported snapshot for a slot (we only reserver catalog_xmin which is fine for logical decoding but not for the exported snapshot) so the VACUUM and heap pruning will happily delete old versions of rows that are still needed by anybody trying to use that exported snapshot. Attached are updated versions of patches: 0001 - Fixes the above mentioned global xmin tracking issues. Needs to be backported all the way to 9.4 0002 - Removes use of the logical decoding saved snapshots for initial exported snapshot since those snapshots only work for catalogs and not user data. Also needs to be backported all the way to 9.4. 0003 - Changes handling of the xl_running_xacts in initial snapshot build to what I wrote above and removes the extra locking from LogStandbySnapshot introduced by logical decoding. 0004 - Improves performance of initial snapshot building by skipping catalog snapshot build for transactions that don't do catalog changes. The 0001 and 0002 are bug fixes because without them the exported snapshots are basically corrupted. The 0003 and 0004 are performance improvements, but on busy servers the snapshot export might never happen so it's for rather serious performance issues. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From 67a44702ff146756b33e8d15e91a02f5d9e86792 Mon Sep 17 00:00:00 2001 From: Petr JelinekDate: Fri, 24 Feb 2017 21:39:03 +0100 Subject: [PATCH 1/4] Reserve global xmin for create slot snasphot export Otherwise the VACUUM or pruning might remove tuples still needed by the exported snapshot. --- src/backend/replication/logical/logical.c | 31 +++ 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c index 5529ac8..9062244 100644 --- a/src/backend/replication/logical/logical.c +++ b/src/backend/replication/logical/logical.c @@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin, * the slot machinery about the new limit. Once that's done the
[HACKERS] Re: proposal - psql: possibility to specify sort for describe commands, when size is printed
Hi 2017-02-23 12:17 GMT+01:00 Pavel Stehule: > Hi > > Currently is not possible to control sort columns for \d* commands. > Usually schema and table name is used. Really often task is collect the > most big objects in database. "\dt+, \di+" shows necessary information, but > not in practical order. > > Instead introduction some additional flags to backslash commands, I > propose a special psql variable that can be used for specification of order > used when some plus command is used. > > some like > > set EXTENDED_DESCRIBE_SORT size_desc > \dt+ > \l+ > \di+ > > Possible variants: schema_table, table_schema, size_desc, size_asc > > Comments, notes? > here is a patch Regards Pavel > > Regards > > Pavel > diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index ae58708aae..b4dfd1f71c 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -3495,6 +3495,18 @@ bar +VERBOSE_SORT + + +This variable can be set to the values schema_name, +name_schema, size_asc, or +size_desc to control the order of content of +decrible command. + + + + + VERBOSITY diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index e2e4cbcc08..7ae5992b90 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -263,7 +263,18 @@ describeTablespaces(const char *pattern, bool verbose) NULL, "spcname", NULL, NULL); - appendPQExpBufferStr(, "ORDER BY 1;"); + + if (verbose && pset.sversion >= 90200) + { + if (pset.verbose_sort == PSQL_SORT_SIZE_ASC) + appendPQExpBufferStr(, "ORDER BY pg_catalog.pg_tablespace_size(oid), 1;"); + else if (pset.verbose_sort == PSQL_SORT_SIZE_DESC) + appendPQExpBufferStr(, "ORDER BY pg_catalog.pg_tablespace_size(oid) DESC, 1;"); + else + appendPQExpBufferStr(, "ORDER BY 1;"); + } + else + appendPQExpBufferStr(, "ORDER BY 1;"); res = PSQLexec(buf.data); termPQExpBuffer(); @@ -822,7 +833,21 @@ listAllDbs(const char *pattern, bool verbose) processSQLNamePattern(pset.db, , pattern, false, false, NULL, "d.datname", NULL, NULL); - appendPQExpBufferStr(, "ORDER BY 1;"); + if (verbose && pset.sversion >= 80200) + { + if (pset.verbose_sort == PSQL_SORT_SIZE_ASC) + appendPQExpBuffer(, + "ORDER BY CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')\n" + " THEN pg_catalog.pg_database_size(d.datname) END ASC, 1;\n"); + else if (pset.verbose_sort == PSQL_SORT_SIZE_DESC) + appendPQExpBuffer(, + "ORDER BY CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')\n" + " THEN pg_catalog.pg_database_size(d.datname) END DESC, 1;\n"); + else + appendPQExpBufferStr(, "ORDER BY 1;"); + } + else + appendPQExpBufferStr(, "ORDER BY 1;"); res = PSQLexec(buf.data); termPQExpBuffer(); if (!res) @@ -3258,7 +3283,29 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys "n.nspname", "c.relname", NULL, "pg_catalog.pg_table_is_visible(c.oid)"); - appendPQExpBufferStr(, "ORDER BY 1,2;"); + if (verbose && pset.sversion >= 80100) + { + if (pset.verbose_sort == PSQL_SORT_SCHEMA_NAME) + appendPQExpBufferStr(, "ORDER BY 1,2;"); + else if (pset.verbose_sort == PSQL_SORT_NAME_SCHEMA) + appendPQExpBufferStr(, "ORDER BY 2,1;"); + else + { + if (pset.sversion >= 9) +appendPQExpBufferStr(, + "ORDER BY pg_catalog.pg_table_size(c.oid) "); + else +appendPQExpBufferStr(, + "ORDER BY pg_catalog.pg_relation_size(c.oid) "); + + if (pset.verbose_sort == PSQL_SORT_SIZE_DESC) +appendPQExpBufferStr(, "DESC, 1,2;"); + else +appendPQExpBufferStr(, "ASC, 1,2;"); + } + } + else + appendPQExpBufferStr(, "ORDER BY 1,2;"); res = PSQLexec(buf.data); termPQExpBuffer(); diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c index 3e3cab4941..09c1a49413 100644 --- a/src/bin/psql/help.c +++ b/src/bin/psql/help.c @@ -327,7 +327,7 @@ helpVariables(unsigned short int pager) * Windows builds currently print one more line than non-Windows builds. * Using the larger number is fine. */ - output = PageOutput(88, pager ? &(pset.popt.topt) : NULL); + output = PageOutput(90, pager ? &(pset.popt.topt) : NULL); fprintf(output, _("List of specially treated variables\n\n")); @@ -364,6 +364,8 @@ helpVariables(unsigned short int pager) fprintf(output, _(" SINGLESTEP single-step mode (same as -s option)\n")); fprintf(output, _(" USER the currently connected database user\n")); fprintf(output, _(" VERBOSITY controls verbosity of error reports [default, verbose, terse]\n")); + fprintf(output, _(" VERBOSE_SORT controls sort of result in verbose mode\n" + " [schema_name, name_schema, size_desc, size_asc]\n")); fprintf(output, _("\nDisplay
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 10:30 PM, Bruce Momjianwrote: > On Fri, Feb 24, 2017 at 10:09:50PM +0200, Ants Aasma wrote: >> On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian wrote: >> > Oh, that's why we will hopefully eventually change the page checksum >> > algorithm to use the special CRC32 instruction, and set a new checksum >> > version --- got it. I assume there is currently no compile-time way to >> > do this. >> >> Using CRC32 as implemented now for the WAL would be significantly >> slower than what we have now due to instruction latency. Even the best >> theoretical implementation using the CRC32 instruction would still be >> about the same speed than what we have now. I haven't seen anybody >> working on swapping out the current algorithm. And I don't really see >> a reason to, it would introduce a load of headaches for no real gain. > > Uh, I am confused. I thought you said we were leaving some performance > on the table. What is that? I though CRC32 was SSE4.1. Why is CRC32 > good for the WAL but bad for the page checksums? What about the WAL > page images? The page checksum algorithm was designed to take advantage of CPUs that provide vectorized 32bit integer multiplication. On x86 this was introduced with SSE4.1 extensions. This means that by default we can't take advantage of the design. The code is written in a way that compiler auto vectorization works on it, so only using appropriate compilation flags are needed to compile a version that does use vector instructions. However to enable it on generic builds, a runtime switch between different levels of vectorization support is needed. This is what is leaving the performance on the table. The page checksum algorithm we have is extremely fast, memcpy fast. Even without vectorization it is right up there with Murmurhash3a and xxHash. With vectorization it's 4x faster. And it works this fast on most modern CPUs, not only Intel. The downside is that it only works well for large blocks, and only fixed power-of-2 size with the current implementation. WAL page images have the page hole removed so can't easily take advantage of this. That said, I haven't really seen either the hardware accelerated CRC32 calculation nor the non-vectorized page checksum take a noticeable amount of time on real world workloads. The benchmarks presented in this thread seem to corroborate this observation. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: About CMake v2
On Wed, Feb 8, 2017 at 04:52:19PM -0500, Tom Lane wrote: > Peter Eisentrautwrites: > > On 2/8/17 6:21 AM, Yuriy Zhuravlev wrote: > >> Support two build systems it's not big deal really. I have been working > >> on this past year without any big troubles. > >> Also we have second perl build system... > > > The perl/msvc build system pulls in information from the makefiles. So > > when you add a file or something basic like that, you don't have to > > update it. So it's really more like 1.5 build systems. > > Really it's more like 1.1 build systems, in that the MSVC scripts do that > just well enough that you *usually* don't have to think about them. But > then when they fail, and you have to figure out why, it can be a pain. If cmake isn't going to be able to query the Makefiles and adjust to changes we make there, changing our Windows build system from MSVC to cmake takes us from maintaining 1.1 build systems to two build systems, and I don't think anyone wants that. If we go to cmake, I think we need to agree we will eventually _only_ use cmake. I don't think having two build systems we have to maintain is better than 1.1 build systems where we can mostly ignore 0.1 of that. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 10:09:50PM +0200, Ants Aasma wrote: > On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjianwrote: > > Oh, that's why we will hopefully eventually change the page checksum > > algorithm to use the special CRC32 instruction, and set a new checksum > > version --- got it. I assume there is currently no compile-time way to > > do this. > > Using CRC32 as implemented now for the WAL would be significantly > slower than what we have now due to instruction latency. Even the best > theoretical implementation using the CRC32 instruction would still be > about the same speed than what we have now. I haven't seen anybody > working on swapping out the current algorithm. And I don't really see > a reason to, it would introduce a load of headaches for no real gain. Uh, I am confused. I thought you said we were leaving some performance on the table. What is that? I though CRC32 was SSE4.1. Why is CRC32 good for the WAL but bad for the page checksums? What about the WAL page images? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] case_preservation_and_insensitivity = on
On 25/02/17 09:02, Jim Nasby wrote: On 2/24/17 12:28 AM, Robert Haas wrote: On Thu, Feb 23, 2017 at 6:59 PM, Tom Lanewrote: I think these are straw-man arguments, really. Consider the actual use case for such a feature: it's for porting some application that was not written against Postgres to begin with. I'm not sure that's totally true. I think at least some requests for this feature are intended at satisfying somebody's sense of aesthetics. If I had $1 for every time I had to chase someone away from using camelcase I'd be able to sponsor a key at the next conference. And honetly I'd actually like to be able to use camelcase and still get easy to read output from \d & co. IOW, this is definitely NOT driven just by porting efforts. I think the only reason we don't hear more requests about it is people (grudgingly) just muddle on without it. I'd love to be able to successfully use camelcase for things like variable and table names in pg, without having to quote everything - but never felt it worthwhile to ask for it. Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Temporal query processing with range types
On 2/24/17 6:40 AM, Peter Moser wrote: Do you think it is better to remove the syntax for ranges expressed in different columns? It's not that hard to construct a range type on-the-fly from 2 columns, so (without having looked at the patch or really followed the thread) I would think the answer is yes. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea_output output of base64
On 2/24/17 7:44 AM, Kenneth Marshall wrote: Like David suggests, if you want compact, run it through lz4/gzip/lzop...for a much better size return. Speaking of which; any bytea where you care about this is likely to live in an already compressed state in toast. ISTM it would be valuable if we had a way to just spit out the raw compressed data (or a text-safe version of that), at least for COPY's purposes... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FYI: git worktrees as replacement for "rsync the CVSROOT"
On 2/24/17 10:24 AM, Tom Lane wrote: Andrew Dunstanwrites: On 02/24/2017 02:36 AM, Craig Ringer wrote: On 16 January 2017 at 05:01, Jim Nasby wrote: git worktree add ../9.6 REL9_6_STABLE Does this do anythng different from the git contrib script git-new-workdir that I have been using for quite a long while? I think it's basically a more formally supported version of the contrib script. They may have fixed some of the hackier aspects of the contrib script --- I mind in particular the fact that you need to disable git's auto-gc activity when you use git-new-workdir, but I don't see any such restriction in the git-worktree man page. Haven't tried to switch over myself, but maybe I will at some point. One thing to be aware of that I discovered: you may not have 2 checkouts of the same branch, something that is possible with what's currently documented on the wiki. Since the only pain in the wiki workflow is setting up a new branch (which I've scripted, attached) I've pretty much given up on using worktrees. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) #!/bin/sh if [ $# -ne 1 ]; then echo Error exit 1 fi branch=REL`echo $1 | tr . _`_STABLE mkdir i/$1 git clone --reference postgresql.git -b $branch git://git.postgresql.org/git/postgresql.git $1 cd $1 ln -s ../i/$i i cd .git/info ln -sf ../../../git-info-exclude exclude -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 9:49 PM, Jim Nasbywrote: > On 2/24/17 12:30 PM, Tomas Vondra wrote: >> >> In any case, we can't just build x86-64 packages with compile-time >> SSE4.1 checks. > > > Dumb question... since we're already discussing llvm for the executor, would > that potentially be an option here? AIUI that also opens the possibility of > using the GPU as well. Just transferring the block to the GPU would be slower than what we have now. Theoretically LLVM could be used to JIT the checksum calculation, but just precompiling a couple of versions and swithcing between them at runtime would be simpler and would give the same speedup. Regards, Ants saasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GUC for cleanup indexes threshold.
On 2/24/17 11:26 AM, Robert Haas wrote: I think we need to come up with some set of tests to figure out what actually works well in practice here. Theories are a good starting point, but good vacuum behavior is really important, and a patch that changes it ought to be backed up by at least some experimental evidence. I think something else worth considering is that if we had some method of mapping heap TIDs back to indexes then a lot (all?) of these problems would go away. 10+ years ago the idea of keeping such a mapping would probably be untenable, but with resource forks and how much cheaper storage is maybe that's no longer the case. For btree I think this could be done by keeping a second btree ordered by ctid that points either to index entries or even just to whole index pages. At ~ 20 bytes per entry, even a 1B row index would take ~20GB. Page splits are obviously a big issue. Maybe it's safe to update the ctid map for every item that gets moved when a split happens. Would a ctid map work for other indexes as well? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjianwrote: > Oh, that's why we will hopefully eventually change the page checksum > algorithm to use the special CRC32 instruction, and set a new checksum > version --- got it. I assume there is currently no compile-time way to > do this. Using CRC32 as implemented now for the WAL would be significantly slower than what we have now due to instruction latency. Even the best theoretical implementation using the CRC32 instruction would still be about the same speed than what we have now. I haven't seen anybody working on swapping out the current algorithm. And I don't really see a reason to, it would introduce a load of headaches for no real gain. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 10:02 PM, Bruce Momjianwrote: > Uh, as far as I know, the best you are going to get from llvm is > standard assembly, while the SSE4.1 instructions use special assembly > instructions, so they would be faster, and in a way they are a GPU built > into CPUs. Both LLVM and GCC are capable of compiling the code that we have to a vectorized loop using SSE4.1 or AVX2 instructions given the proper compilation flags. This is exactly what was giving the speedup in the test I showed in my e-mail. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] case_preservation_and_insensitivity = on
On 2/24/17 12:28 AM, Robert Haas wrote: On Thu, Feb 23, 2017 at 6:59 PM, Tom Lanewrote: I think these are straw-man arguments, really. Consider the actual use case for such a feature: it's for porting some application that was not written against Postgres to begin with. I'm not sure that's totally true. I think at least some requests for this feature are intended at satisfying somebody's sense of aesthetics. If I had $1 for every time I had to chase someone away from using camelcase I'd be able to sponsor a key at the next conference. And honetly I'd actually like to be able to use camelcase and still get easy to read output from \d & co. IOW, this is definitely NOT driven just by porting efforts. I think the only reason we don't hear more requests about it is people (grudgingly) just muddle on without it. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 01:49:07PM -0600, Jim Nasby wrote: > On 2/24/17 12:30 PM, Tomas Vondra wrote: > >In any case, we can't just build x86-64 packages with compile-time > >SSE4.1 checks. > > Dumb question... since we're already discussing llvm for the executor, would > that potentially be an option here? AIUI that also opens the possibility of > using the GPU as well. Uh, as far as I know, the best you are going to get from llvm is standard assembly, while the SSE4.1 instructions use special assembly instructions, so they would be faster, and in a way they are a GPU built into CPUs. -- Bruce Momjianhttp://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On Fri, Feb 24, 2017 at 10:59 AM, Tom Lanewrote: > Jeff Janes writes: > > On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane wrote: > >> Maybe it's time to convert that to a doubly-linked list. > > > I don't think that would help. You would need some heuristic to guess > > whether the chunk you are looking for is near the front, or near the end. > > Uh, what? In a doubly-linked list, you can remove an element in O(1) > time, you don't need any searching. It basically becomes > item->prev->next = item->next; > item->next->prev = item->prev; > modulo possible special cases for the head and tail elements. > Currently it is walking the chain to identify which block holds the chunk in the first place, not just to get the pointer to the previous block. But I see that that could be fixed by pointer arithmetic once there is a reason to fix it. Which there isn't a reason to as long as you need to walk the chain to get the prev pointer anyway. Like this:? targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ); > >> Although if the > >> hash code is producing a whole lot of requests that are only a bit > bigger > >> than the separate-block threshold, I'd say It's Doing It Wrong. It > should > >> learn to aggregate them into larger requests. > > > Right now it is using compiled-in 32KB chunks. Should it use something > > like max(32kb,work_mem/128) instead? > > I'd say it should double the size of the request each time. That's what > we do in most places. > I thought the design goal there was that the space in old chunks could get re-used into the new chunks in a reasonably fine-grained way. If the last chunk contains just over half of all the data, that couldn't happen. Cheers, Jeff
Re: [HACKERS] case_preservation_and_insensitivity = on
On 2/24/17 11:34 AM, Joel Jacobson wrote: SELECT SomeCol, OtherCol, FooCol, BarCol, MyCol, ExtraCol, LastCol INTO _SomeCol, _OtherCol, _FooCol, _BarCol, _MyCol, _ExtraCol, _LastCol FROM Foo WHERE Bar = 'Baz'; This is to avoid typos that are then visually easy to spot, thanks to all chars being aligned. Why not just use a record or the table composite? I'll commonly do stuff like: DECLARE r record BEGIN SELECT INTO STRICT r blah, foo, bar, baz FROM pirate ; IF r.blah THEN RAISE 'Yaaar!' END IF; ... (Well, to be honest I always try to write pirate apps in plR... ;P) -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] btree_gin and btree_gist for enums
On 02/24/2017 11:02 AM, Tom Lane wrote: > Andrew Dunstanwrites: >> On 02/23/2017 04:41 PM, Tom Lane wrote: >>> The reason this is kind of scary is that it's just blithely assuming >>> that the function won't look at the *other* fields of the FmgrInfo. >>> If it did, it would likely get very confused, since those fields >>> would be describing the GIN support function, not the function we're >>> calling. >>> >>> We could alternatively have this trampoline function set up a fresh, local >>> FmgrInfo struct that it zeroes except for copying fn_extra and fn_mcxt >>> from the caller's struct, and then it copies fn_extra back again on the >>> way out. That's a few more cycles but it would be safer, I think; if the >>> function tried to look at the other fields such as fn_oid it would see >>> obviously bogus data. >> Do we want one or both of these? I'm prepared to code up a patch to >> fmgr.[ch] to provide them. > On reflection I'm not sure that the double-copy approach is all that much > safer than just passing down the caller's flinfo pointer. Most of the > time it would be better, but suppose that the callee updates fn_extra > and then throws elog(ERROR) --- the outcome would be different, probably > creating a leak in fn_mcxt. Maybe this would still be okay, because > perhaps that FmgrInfo is never used again, but I don't think we can assume > that for the case at hand. > > At this point I'd be inclined to just document that the called function > should only use fn_extra/fn_mcxt. fair enough. Will do it that way. > >> I don't know what to call it either. In my test I used >> CallerContextFunctionCall2 - not sure if that's quite right, but should >> be close. > CallerInfo? CallerFInfo? Or we could spell out CallerFmgrInfo but > that seems a bit verbose. > > I'll go with CallerFInfoFunctionCall2 etc. In the btree_gist system the calls to the routines like enum_cmp are buried about three levels deep. I'm thinking I'll just pass the flinfo down the stack and just call these routines at the bottom level. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On 2/24/17 12:30 PM, Tomas Vondra wrote: In any case, we can't just build x86-64 packages with compile-time SSE4.1 checks. Dumb question... since we're already discussing llvm for the executor, would that potentially be an option here? AIUI that also opens the possibility of using the GPU as well. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 08:31:09PM +0200, Ants Aasma wrote: > >> We looked at that when picking the algorithm. At that point it seemed > >> that CRC CPU instructions were not universal enough to rely on them. > >> The algorithm we ended up on was designed to be fast on SIMD hardware. > >> Unfortunately on x86-64 that required SSE4.1 integer instructions, so > >> with default compiles there is a lot of performance left on table. A > >> low hanging fruit would be to do CPU detection like the CRC case and > >> enable a SSE4.1 optimized variant when those instructions are > >> available. IIRC it was actually a lot faster than the naive hardware > >> CRC that is used for WAL and about on par with interleaved CRC. > > > > Uh, I thought already did compile-time testing for SSE4.1 and used them > > if present. Why do you say "with default compiles there is a lot of > > performance left on table?" > > Compile time checks don't help because the compiled binary could be > run on a different host that does not have SSE4.1 (as extremely > unlikely as it is at this point of time). A runtime check is done for Right. > WAL checksums that use a special CRC32 instruction. Block checksums > predate that and use a different algorithm that was picked because it > could be accelerated with vectorized execution on non-Intel > architectures. We just never got around to adding runtime checks for > the architecture to enable this speedup. Oh, that's why we will hopefully eventually change the page checksum algorithm to use the special CRC32 instruction, and set a new checksum version --- got it. I assume there is currently no compile-time way to do this. -- Bruce Momjianhttp://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
Jeff Janeswrites: > On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane wrote: >> Maybe it's time to convert that to a doubly-linked list. > I don't think that would help. You would need some heuristic to guess > whether the chunk you are looking for is near the front, or near the end. Uh, what? In a doubly-linked list, you can remove an element in O(1) time, you don't need any searching. It basically becomes item->prev->next = item->next; item->next->prev = item->prev; modulo possible special cases for the head and tail elements. >> Although if the >> hash code is producing a whole lot of requests that are only a bit bigger >> than the separate-block threshold, I'd say It's Doing It Wrong. It should >> learn to aggregate them into larger requests. > Right now it is using compiled-in 32KB chunks. Should it use something > like max(32kb,work_mem/128) instead? I'd say it should double the size of the request each time. That's what we do in most places. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On Thu, Feb 23, 2017 at 10:47 PM, Andres Freundwrote: > On 2017-02-23 17:28:26 -0500, Tom Lane wrote: > > Jeff Janes writes: > > > The number of new chunks can be almost as as large as the number of old > > > chunks, especially if there is a very popular value. The problem is > that > > > every time an old chunk is freed, the code in aset.c around line 968 > has to > > > walk over all the newly allocated chunks in the linked list before it > can > > > find the old one being freed. This is an N^2 operation, and I think > it has > > > horrible CPU cache hit rates as well. > > > > Maybe it's time to convert that to a doubly-linked list. > > Yes, I do think so. Given that we only have that for full blocks, not > for small chunks, the cost seems neglegible. > > That would also, partially, address the performance issue > http://archives.postgresql.org/message-id/d15dff83-0b37-28ed > -0809-95a5cc7292ad%402ndquadrant.com > addresses, in a more realistically backpatchable manner. > > Jeff, do you have a handy demonstrator? > Not exactly "handy", as it takes about 45 minutes to set up and uses >50GB of disk and 16GB of RAM, but here is a demonstration. create table foobar as select CASE when random()<(420.0/540.0) then 1 else floor(random()*88)::int END as titleid from generate_series(1,54000); create table foobar2 as select distinct titleid from foobar ; analyze; set work_mem TO "13GB"; select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid); This will run for effectively forever, unless it gets killed/dies due to OOM first. If I have other things consuming some of my 16GB of RAM, then it gets OOM (which is not a complaint: it is as one should expect given that I told it work_mem was 13GB). If I have no other demands on RAM, then I can't tell if it would eventually OOM or not because of how long it would take to get that far. This is inspired by the thread https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail.gmail.com#cacw4t0p4lzd6vpwptxgpgotmh2dektqbgu7ntaj1+a0prx1...@mail.gmail.com I ran into this while evaluating Tom's responding patch, but you don't need to apply that patch to run this example and see the effect. I don't have an example of a case that demonstrates the problem in the absence of a degenerate hash bucket. I think there should be non-degenerate cases that trigger it, but I haven't been able to produce one yet. > > Although if the > > hash code is producing a whole lot of requests that are only a bit bigger > > than the separate-block threshold, I'd say It's Doing It Wrong. It > should > > learn to aggregate them into larger requests. > > That's probably right, but we can't really address that in the > back-branches. And to me this sounds like something we should address > in the branches, not just in master. Even if we'd also fix the > hash-aggregation logic, I think such an O(n^2) behaviour in the > allocator is a bad idea in general, and we should fix it anyway. > I don't know how important a back-patch would be. This is a toy case for me. Presumably not for David Hinkle, except he doesn't want a hash join in the first place. While I want one that finishes in a reasonable time. It might depend on what happens to Tom's OOM patch. It would be great if the allocator was made bullet-proof, but I don't think adding reverse links (or anything else back-patchable) is going to be enough to do that. Cheers, Jeff
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 7:47 PM, Bruce Momjianwrote: > On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote: >> > It might be worth looking into using the CRC CPU instruction to reduce this >> > overhead, like we do for the WAL checksums. Since that is a different >> > algorithm it would be a compatibility break and we would need to support >> > the >> > old algorithm for upgraded clusters.. >> >> We looked at that when picking the algorithm. At that point it seemed >> that CRC CPU instructions were not universal enough to rely on them. >> The algorithm we ended up on was designed to be fast on SIMD hardware. >> Unfortunately on x86-64 that required SSE4.1 integer instructions, so >> with default compiles there is a lot of performance left on table. A >> low hanging fruit would be to do CPU detection like the CRC case and >> enable a SSE4.1 optimized variant when those instructions are >> available. IIRC it was actually a lot faster than the naive hardware >> CRC that is used for WAL and about on par with interleaved CRC. > > Uh, I thought already did compile-time testing for SSE4.1 and used them > if present. Why do you say "with default compiles there is a lot of > performance left on table?" Compile time checks don't help because the compiled binary could be run on a different host that does not have SSE4.1 (as extremely unlikely as it is at this point of time). A runtime check is done for WAL checksums that use a special CRC32 instruction. Block checksums predate that and use a different algorithm that was picked because it could be accelerated with vectorized execution on non-Intel architectures. We just never got around to adding runtime checks for the architecture to enable this speedup. The attached test runs 1M iterations of the checksum about 3x faster when compiled with SSE4.1 and vectorization, 4x if AVX2 is added into the mix. Test: gcc $CFLAGS -Isrc/include -DN=100 testchecksum.c -o testchecksum && time ./testchecksum Results: CFLAGS="-O2": 2.364s CFLAGS="-O2 -msse4.1 -ftree-vectorize": 0.752s CFLAGS="-O2 -mavx2 -ftree-vectorize": 0.552s That 0.552s is 15GB/s per core on a 3 year old laptop. Regards, Ants Aasma #include "postgres.h" #include "storage/checksum_impl.h" void main() { char page[8192] = {0}; uint32 i, sum = 0; for (i = 0; i < N; i++) sum ^= pg_checksum_page(page, i); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On 02/24/2017 06:47 PM, Bruce Momjian wrote: On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote: It might be worth looking into using the CRC CPU instruction to reduce this overhead, like we do for the WAL checksums. Since that is a different algorithm it would be a compatibility break and we would need to support the old algorithm for upgraded clusters.. We looked at that when picking the algorithm. At that point it seemed that CRC CPU instructions were not universal enough to rely on them. The algorithm we ended up on was designed to be fast on SIMD hardware. Unfortunately on x86-64 that required SSE4.1 integer instructions, so with default compiles there is a lot of performance left on table. A low hanging fruit would be to do CPU detection like the CRC case and enable a SSE4.1 optimized variant when those instructions are available. IIRC it was actually a lot faster than the naive hardware CRC that is used for WAL and about on par with interleaved CRC. Uh, I thought already did compile-time testing for SSE4.1 and used them if present. Why do you say "with default compiles there is a lot of performance left on table?" Compile-time is not enough - we build binary packages that may then be installed on machines without the SSE4.1 instructions available. On Intel this may not be a huge issue - the first microarchitecture with SSE4.1 was "Nehalem", announced in 2007, so we're only left with very old boxes based on "Intel Core" (and perhaps the even older P6). On AMD, it's a bit worse - the first micro-architecture with SSE4.1 was Bulldozer (late 2011). So quite a few CPUs out there, even if most people use Intel. In any case, we can't just build x86-64 packages with compile-time SSE4.1 checks. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On Thu, Feb 23, 2017 at 2:28 PM, Tom Lanewrote: > Jeff Janes writes: > > The number of new chunks can be almost as as large as the number of old > > chunks, especially if there is a very popular value. The problem is that > > every time an old chunk is freed, the code in aset.c around line 968 has > to > > walk over all the newly allocated chunks in the linked list before it can > > find the old one being freed. This is an N^2 operation, and I think it > has > > horrible CPU cache hit rates as well. > > Maybe it's time to convert that to a doubly-linked list. I don't think that would help. You would need some heuristic to guess whether the chunk you are looking for is near the front, or near the end. And in this case, the desired chunk starts out at the front, and then keeps moving down the list with each iteration as new things are added to the front, until near the end of the ExecHashIncreaseNumBatches it is freeing them from near the end. But in between, it is freeing them from the middle, so I don't think a doubly-linked list would change it from N^2, just lower the constant, even if you always knew which end to start at. Or am I misunderstanding what the implications for a doubly-linked-list are? What would really help here is if it remembered the next pointer of the just-freed chunk, and started the scan from that location the next time, cycling around to the head pointer if it doesn't find anything. But I don't think that that is a very general solution. Or, if you could pass a flag when creating the context telling it whether memory will be freed mostly-LIFO or mostly-FIFO, and have it use a stack or a queue accordingly. > Although if the > hash code is producing a whole lot of requests that are only a bit bigger > than the separate-block threshold, I'd say It's Doing It Wrong. It should > learn to aggregate them into larger requests. > Right now it is using compiled-in 32KB chunks. Should it use something like max(32kb,work_mem/128) instead? Cheers, Jeff
Re: [HACKERS] Checksums by default?
On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote: > > It might be worth looking into using the CRC CPU instruction to reduce this > > overhead, like we do for the WAL checksums. Since that is a different > > algorithm it would be a compatibility break and we would need to support the > > old algorithm for upgraded clusters.. > > We looked at that when picking the algorithm. At that point it seemed > that CRC CPU instructions were not universal enough to rely on them. > The algorithm we ended up on was designed to be fast on SIMD hardware. > Unfortunately on x86-64 that required SSE4.1 integer instructions, so > with default compiles there is a lot of performance left on table. A > low hanging fruit would be to do CPU detection like the CRC case and > enable a SSE4.1 optimized variant when those instructions are > available. IIRC it was actually a lot faster than the naive hardware > CRC that is used for WAL and about on par with interleaved CRC. Uh, I thought already did compile-time testing for SSE4.1 and used them if present. Why do you say "with default compiles there is a lot of performance left on table?" -- Bruce Momjianhttp://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] case_preservation_and_insensitivity = on
On Thu, Feb 23, 2017 at 8:04 AM, Robert Haaswrote: > > It doesn't sound like a good solution to me, because there can be SQL > code inside stored procedures that clients never see. In our code base, we use CamelCase in all PL/pgSQL functions, both for columns and variables, e.g. SELECT UserID INTO _UserID FROM Users WHERE Username = 'foo'; Here, it's not a problem that the column name is e.g. "userid", since the case-insensitive feature makes it work. What type of case problem do you foresee for stored procedures? I've only experienced the case-folding to be a problem outside of SPs, since the casing *is* preserved in the PL/pgSQL source code (since it's stored as-is, without any modifications). What *would* be a problem though, is if in a future PL/pgSQL 3, a PL/pgSQL query like, SELECT UserID FROM Users WHERE Username = 'foo'; would automatically export the column "UserID" to the current scope as a PL/pgSQL 3 variable named "userid", since then you would actually want the value of the userid column to be exported to a variable named "UserID". Such a feature would be nice, since a very common code-pattern in PL/pgSQL is to just have lots of meaningless identical lists of columns and then an identical list of variables with the same names as the columns. When the list is short, it's not a problem, but when selecting lots of columns, it gets ugly. What I usually end up with is to align the columns and variables on two rows, e.g.: SELECT SomeCol, OtherCol, FooCol, BarCol, MyCol, ExtraCol, LastCol INTO _SomeCol, _OtherCol, _FooCol, _BarCol, _MyCol, _ExtraCol, _LastCol FROM Foo WHERE Bar = 'Baz'; This is to avoid typos that are then visually easy to spot, thanks to all chars being aligned. Imagine if, thanks to case-preservation, if you should simply do: SELECT SomeCol, OtherCol, FooCol, BarCol, MyCol, ExtraCol, LastCol FROM Foo WHERE Bar = 'Baz'; And all the columns would be exported to the variables SomeCol, OtherCol, FooCol, BarCol, MyCol, ExtraCol, LastCol, instead of somecol, othercol, foocol, barcol, mycol, extracol, lastcol; This would be a huge win in avoiding unnecessary code repetition. Then of course, if you want a column Foo to instead be exported to Bar, then you simply do "SELECT Foo AS Bar". Thoughts? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GUC for cleanup indexes threshold.
On Fri, Feb 24, 2017 at 8:49 AM, Amit Kapilawrote: >> IIUC, I think that we need to have the number of half-dead pages in meta >> page. > > Don't you think we need to consider backward compatibility if we want > to do that? Yeah, that would be an on-disk format break. I think this thread is pretty short on evidence that would let us make a smart decision about what to do here. I see three possibilities. The first is that this patch is a good idea whether we do something about the issue of half-dead pages or not. The second is that this patch is a good idea if we do something about the issue of half-dead pages but a bad idea if we don't. The third is that this patch is a bad idea whether or not we do anything about the issue of half-dead pages. Unfortunately, we have no evidence at all that would let us figure out which of those three things is true. The original post didn't include any relevant benchmarks or test results. Simon's reply, which suggested that the problem of half-dead pages, didn't include any benchmarks or test results. In fact, in neither place were any tests suggested, even hypothetically, which would help us decide what to do. I had a hunch when I saw this thread that it was a good idea, and Simon has a hunch that this btree page recycling thing needs to be fixed first, and he might be right. Or we might both be wrong. Or ... who knows, really? I think we need to come up with some set of tests to figure out what actually works well in practice here. Theories are a good starting point, but good vacuum behavior is really important, and a patch that changes it ought to be backed up by at least some experimental evidence. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make subquery alias optional in FROM clause
On Fri, Feb 24, 2017 at 9:35 AM, David Fetterwrote: > > > => SELECT "?column"? FROM (select 1+1 as "?column?", 1+1) AS x; > > ERROR: 42703: column "?column" does not exist > > LINE 2: SELECT "?column"? FROM (select 1+1 as "?column?", 1+1) AS x; > >^ > > HINT: Perhaps you meant to reference the column "x.?column?" or the > > column "x.?column?". > This is indirectly pointing out the duplication since the hint is specifying the exact same name twice... I don't know how far comparing apples and oranges gets us here...and the assignment of names to expression columns lacking aliases is a bit smarter than given credit for here - e.g., it uses the name of the function in a simple function call expression. There is no risk of naming conflicts in pre-existing queries. I say we do something like: pg_subquery_n and make it known that the value for "n" will be chosen independent of names already present in the query. We've recently reserved the pg_ prefix for roles we might as well leverage that. These names need only be available for internal needs; as a user I'd expect it is be noted as an implementation detail that should not be relied upon. Whether it needs to get exposed for technical reasons (e.g., dump/restore and explain) I do not know. David J.
Re: [HACKERS] Hash support for grouping sets
> "Thom" == Thom Brownwrites: Thom> This doesn't apply cleanly to latest master. Could you please Thom> post a rebased patch? Sure. -- Andrew (irc:RhodiumToad) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index c9e0a3e..480a07e 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -996,6 +996,10 @@ ExplainNode(PlanState *planstate, List *ancestors, pname = "HashAggregate"; strategy = "Hashed"; break; + case AGG_MIXED: + pname = "MixedAggregate"; + strategy = "Mixed"; + break; default: pname = "Aggregate ???"; strategy = "???"; @@ -1924,6 +1928,19 @@ show_grouping_set_keys(PlanState *planstate, ListCell *lc; List *gsets = aggnode->groupingSets; AttrNumber *keycols = aggnode->grpColIdx; + const char *keyname; + const char *keysetname; + + if (aggnode->aggstrategy == AGG_HASHED || aggnode->aggstrategy == AGG_MIXED) + { + keyname = "Hash Key"; + keysetname = "Hash Keys"; + } + else + { + keyname = "Group Key"; + keysetname = "Group Keys"; + } ExplainOpenGroup("Grouping Set", NULL, true, es); @@ -1938,7 +1955,7 @@ show_grouping_set_keys(PlanState *planstate, es->indent++; } - ExplainOpenGroup("Group Keys", "Group Keys", false, es); + ExplainOpenGroup(keysetname, keysetname, false, es); foreach(lc, gsets) { @@ -1962,12 +1979,12 @@ show_grouping_set_keys(PlanState *planstate, } if (!result && es->format == EXPLAIN_FORMAT_TEXT) - ExplainPropertyText("Group Key", "()", es); + ExplainPropertyText(keyname, "()", es); else - ExplainPropertyListNested("Group Key", result, es); + ExplainPropertyListNested(keyname, result, es); } - ExplainCloseGroup("Group Keys", "Group Keys", false, es); + ExplainCloseGroup(keysetname, keysetname, false, es); if (sortnode && es->format == EXPLAIN_FORMAT_TEXT) es->indent--; diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index aa08152..f059560 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -122,12 +122,19 @@ * specific). * * Where more complex grouping sets are used, we break them down into - * "phases", where each phase has a different sort order. During each - * phase but the last, the input tuples are additionally stored in a - * tuplesort which is keyed to the next phase's sort order; during each - * phase but the first, the input tuples are drawn from the previously - * sorted data. (The sorting of the data for the first phase is handled by - * the planner, as it might be satisfied by underlying nodes.) + * "phases", where each phase has a different sort order (except phase 0 + * which is reserved for hashing). During each phase but the last, the + * input tuples are additionally stored in a tuplesort which is keyed to the + * next phase's sort order; during each phase but the first, the input + * tuples are drawn from the previously sorted data. (The sorting of the + * data for the first phase is handled by the planner, as it might be + * satisfied by underlying nodes.) + * + * Hashing can be mixed with sorted grouping. To do this, we have an + * AGG_MIXED strategy that populates the hashtables during the first sorted + * phase, and switches to reading them out after completing all sort phases. + * We can also support AGG_HASHED with multiple hash tables and no sorting + * at all. * * From the perspective of aggregate transition and final functions, the * only issue regarding grouping sets is this: a single call site (flinfo) @@ -139,8 +146,6 @@ * sensitive to the grouping set for which the aggregate function is * currently being called. * - * TODO: AGG_HASHED doesn't support multiple grouping sets yet. - * * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * @@ -432,6 +437,7 @@ typedef struct AggStatePerGroupData */ typedef struct AggStatePerPhaseData { + AggStrategy aggstrategy; /* strategy for this phase */ int numsets; /* number of grouping sets (or 0) */ int *gset_lengths; /* lengths of grouping sets */ Bitmapset **grouped_cols; /* column groupings for rollup */ @@ -440,7 +446,31 @@ typedef struct AggStatePerPhaseData Sort *sortnode; /* Sort node for input ordering for phase */ } AggStatePerPhaseData; +/* + * AggStatePerHashData - per-hashtable state + * + * When doing grouping sets with hashing, we have one of these for each + * grouping set. (When doing hashing without grouping sets, we have just one of + * them.) + */ + +typedef struct AggStatePerHashData +{ + TupleHashTable hashtable; /* hash table with one entry per group */ + TupleHashIterator hashiter; /* for iterating through hash table */ + TupleTableSlot *hashslot; /* slot for loading hash table */ + FmgrInfo *hashfunctions; /*
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 02:14:23PM +0530, Pavan Deolasee wrote: > > > On Thu, Feb 23, 2017 at 11:53 PM, Bruce Momjianwrote: > > On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote: > > Bruce Momjian wrote: > > > > > As I remember, WARM only allows > > > a single index-column change in the chain. Why are you seeing such a > > > large performance improvement? I would have thought it would be that > > > high if we allowed an unlimited number of index changes in the chain. > > > > The second update in a chain creates another non-warm-updated tuple, so > > the third update can be a warm update again, and so on. > > Right, before this patch they would be two independent HOT chains. It > still seems like an unexpectedly-high performance win. Are two > independent HOT chains that much more expensive than joining them via > WARM? > > > In these tests, there are zero HOT updates, since every update modifies some > index column. With WARM, we could reduce regular updates to half, even when we > allow only one WARM update per chain (chain really has a single tuple for this > discussion). IOW approximately half updates insert new index entry in *every* > index and half updates > insert new index entry *only* in affected index. That itself does a good bit > for performance. > > So to answer your question: yes, joining two HOT chains via WARM is much > cheaper because it results in creating new index entries just for affected > indexes. OK, all my questions have been answered, including the use of flag bits. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)
On Fri, Feb 24, 2017 at 1:24 PM, Robert Haaswrote: > On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai wrote: >> This attached patch re-designed the previous v2 according to Robert's >> comment. >> All I had to do is putting entrypoint for ForeignScan/CustomScan at >> ExecShutdownNode, >> because it is already modified to call child node first, earlier than >> ExecShutdownGather >> which releases DSM segment. > > Aside from the documentation, which needs some work, this looks fine > to me on a quick read. LGTM too. You may want to clarify in the docs when the hook is called in relation to other hooks too. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make subquery alias optional in FROM clause
On Thu, Feb 23, 2017 at 01:27:29PM +, Greg Stark wrote: > On 22 February 2017 at 15:08, Tom Lanewrote: > > Indeed. When I wrote the comment you're referring to, quite a few years > > ago now, I thought that popular demand might force us to allow omitted > > aliases. But the demand never materialized. At this point it seems > > clear to me that there isn't really good reason to exceed the spec here. > > It just encourages people to write unportable SQL code. > > > Oh my. This bothers me all the time. I always assumed the reason it > was like this was because the grammar would be ambiguous without it > and it would require extreme measures to hack the grammar to work. If > it's this easy I would totally be for it. > > Offhand I think there are plenty of solutions for the problem of > inventing names and I suspect any of them would work fine: > > 1) Don't assign a name -- I would guess this would require some > adjustments in the rule deparsing (i.e. views). > > 2) Assign a name but add a flag indicating the name is autogenerated > and shouldn't be used for resolving references and shouldn't be > dumped. Then it shouldn't really matter if there's a conflict since > the name is only used for things like error messages, not resolving > references. > > 3) thumb through all the names in the query and pick one that doesn't > conflict. > > For what it's worth while it wouldn't be a *bad* thing to avoid > conflicts I think this is being held to an inconsistent standard here. > It's not like there aren't similar situations elsewhere in the > codebase where we just don't worry about this kind of thing: > > => SELECT "?column"? FROM (select 1+1 as "?column?", 1+1) AS x; > ERROR: 42703: column "?column" does not exist > LINE 2: SELECT "?column"? FROM (select 1+1 as "?column?", 1+1) AS x; >^ > HINT: Perhaps you meant to reference the column "x.?column?" or the > column "x.?column?". That's because you transposed the two characters after column in your target list: XX SELECT "?column"? FROM (select 1+1 as "?column?", 1+1) AS x; SELECT "?column?" FROM (select 1+1 as "?column?", 1+1) AS x; This is what you get when you do the second, which I'm assuming is what you meant to do: ERROR: column reference "?column?" is ambiguous LINE 1: SELECT "?column?" FROM (select 1+1 as "?column?", 1+1) AS x; Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FYI: git worktrees as replacement for "rsync the CVSROOT"
Andrew Dunstanwrites: > On 02/24/2017 02:36 AM, Craig Ringer wrote: >> On 16 January 2017 at 05:01, Jim Nasby wrote: >>> git worktree add ../9.6 REL9_6_STABLE > Does this do anythng different from the git contrib script > git-new-workdir that I have been using for quite a long while? I think it's basically a more formally supported version of the contrib script. They may have fixed some of the hackier aspects of the contrib script --- I mind in particular the fact that you need to disable git's auto-gc activity when you use git-new-workdir, but I don't see any such restriction in the git-worktree man page. Haven't tried to switch over myself, but maybe I will at some point. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)
On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGaiwrote: > This attached patch re-designed the previous v2 according to Robert's comment. > All I had to do is putting entrypoint for ForeignScan/CustomScan at > ExecShutdownNode, > because it is already modified to call child node first, earlier than > ExecShutdownGather > which releases DSM segment. Aside from the documentation, which needs some work, this looks fine to me on a quick read. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] utility commands benefiting from parallel plan
On Fri, Feb 24, 2017 at 11:43 AM, Haribabu Kommiwrote: > Here I attached an implementation patch that allows > utility statements that have queries underneath such as > CREATE TABLE AS, CREATE MATERIALIZED VIEW > and REFRESH commands to benefit from parallel plan. > > These write operations not performed concurrently by the > parallel workers, but the underlying query that is used by > these operations are eligible for parallel plans. > > Currently the write operations are implemented for the > tuple dest types DestIntoRel and DestTransientRel. > > Currently I am evaluating other write operations that can > benefit with parallelism without side effects in enabling them. > > comments? I think a lot more work than this will be needed. See: https://www.postgresql.org/message-id/CA+TgmoZC5ft_t9uQWSO5_1vU6H8oVyD=zyuLvRnJqTN==fv...@mail.gmail.com ...and the discussion which followed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make subquery alias optional in FROM clause
Craig Ringerwrites: > Personally I think we need to generate one, if nothing else for error > messages where we try to emit qualified names of columns. Also for EXPLAIN, where there has to be a way to name everything. > But I don't see that the name needs to be anything we can refer to > elsewhere or anything faintly sane to type. Something like: > "" -1. "Make it ugly as sin and then pretend that nobody could conflict with it" is neither formally correct nor nice to look at in the contexts where people have to look at it. I'm for something along the lines of "subquery_n" where we simply keep incrementing n until we find a name that is not present in the query already. This is basically what ruleutils.c does now when it has to cons up a unique table alias, which it must do in cases involving inheritance. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] safer node casting
On 2/24/17 10:54, Tom Lane wrote: > Andres Freundwrites: >> Those aren't actually equivalent, because of the !nodeptr. IsA() crashes >> for NULL pointers, but the new code won't. Which means 9ba8a9ce4548b et >> al actually weakened some asserts. > >> Should we perhaps have one NULL accepting version (castNodeNull?) and >> one that separately asserts that ptr != NULL? > > -1 ... if you're going to use something in a way that requires it not to > be null, your code will crash quite efficiently on a null, with or > without an assert. I don't think we need the extra cogitive burden of > two distinct macros for this. I think we should just add some Assert(thepointer) where necessary. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 4:06 PM, Pavan Deolaseewrote: >> Wow, OK. In my view, that makes the chain conversion code pretty much >> essential, because if you had WARM without chain conversion then the >> visibility map gets more or less irrevocably less effective over time, >> which sounds terrible. > > Yes. I decided to complete chain conversion patch when I realised that IOS > will otherwise become completely useful if large percentage of rows are > updated just once. So I agree. It's not an optional patch and should get in > with the main WARM patch. Right, and it's not just index-only scans. VACUUM gets permanently more expensive, too, which is probably a much worse problem. >> But it sounds to me like even with the chain >> conversion, it might take multiple vacuum passes before all visibility >> map bits are set, which isn't such a great property (thus e.g. >> fdf9e21196a6f58c6021c967dc5776a16190f295). > > The chain conversion algorithm first converts the chains during vacuum and > then checks if the page can be set all-visible. So I'm not sure why it would > take multiple vacuums before a page is set all-visible. The commit you quote > was written to ensure that we make another attempt to set the page > all-visible after al dead tuples are removed from the page. Similarly, we > will convert all WARM chains to HOT chains and then check for all-visibility > of the page. OK, that sounds good. And there are no bugs, right? :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] btree_gin and btree_gist for enums
Andrew Dunstanwrites: > On 02/23/2017 04:41 PM, Tom Lane wrote: >> The reason this is kind of scary is that it's just blithely assuming >> that the function won't look at the *other* fields of the FmgrInfo. >> If it did, it would likely get very confused, since those fields >> would be describing the GIN support function, not the function we're >> calling. >> >> We could alternatively have this trampoline function set up a fresh, local >> FmgrInfo struct that it zeroes except for copying fn_extra and fn_mcxt >> from the caller's struct, and then it copies fn_extra back again on the >> way out. That's a few more cycles but it would be safer, I think; if the >> function tried to look at the other fields such as fn_oid it would see >> obviously bogus data. > Do we want one or both of these? I'm prepared to code up a patch to > fmgr.[ch] to provide them. On reflection I'm not sure that the double-copy approach is all that much safer than just passing down the caller's flinfo pointer. Most of the time it would be better, but suppose that the callee updates fn_extra and then throws elog(ERROR) --- the outcome would be different, probably creating a leak in fn_mcxt. Maybe this would still be okay, because perhaps that FmgrInfo is never used again, but I don't think we can assume that for the case at hand. At this point I'd be inclined to just document that the called function should only use fn_extra/fn_mcxt. > I don't know what to call it either. In my test I used > CallerContextFunctionCall2 - not sure if that's quite right, but should > be close. CallerInfo? CallerFInfo? Or we could spell out CallerFmgrInfo but that seems a bit verbose. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] safer node casting
Andres Freundwrites: > Those aren't actually equivalent, because of the !nodeptr. IsA() crashes > for NULL pointers, but the new code won't. Which means 9ba8a9ce4548b et > al actually weakened some asserts. > Should we perhaps have one NULL accepting version (castNodeNull?) and > one that separately asserts that ptr != NULL? -1 ... if you're going to use something in a way that requires it not to be null, your code will crash quite efficiently on a null, with or without an assert. I don't think we need the extra cogitive burden of two distinct macros for this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] utility commands benefiting from parallel plan
On Fri, Feb 24, 2017 at 11:43 AM, Haribabu Kommiwrote: > Here I attached an implementation patch that allows > utility statements that have queries underneath such as > CREATE TABLE AS, CREATE MATERIALIZED VIEW > and REFRESH commands to benefit from parallel plan. > > These write operations not performed concurrently by the > parallel workers, but the underlying query that is used by > these operations are eligible for parallel plans. > > Currently the write operations are implemented for the > tuple dest types DestIntoRel and DestTransientRel. > > Currently I am evaluating other write operations that can > benefit with parallelism without side effects in enabling them. The Idea looks good to me. Since we are already modifying heap_prepare_insert, I am thinking that we can as well enable queries like "insert into .. select from .." with minor modification? - * For now, parallel operations are required to be strictly read-only. - * Unlike heap_update() and heap_delete(), an insert should never create a - * combo CID, so it might be possible to relax this restriction, but not - * without more thought and testing. + * For now, parallel operations are required to be strictly read-only in + * parallel worker. This statement is still not true, we can not do heap_update in the leader even though worker are doing the read-only operation (update with select). We can change the comments such that it appears more specific to insert I think. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On Friday, February 24, 2017, Simon Riggswrote: > > 2. I know that DB2 handles this by having the user specify WITH ROW > MOVEMENT to explicitly indicate they accept the issue and want update > to work even with that. We could have an explicit option to allow > that. This appears to be the only way we could avoid silent errors for > foreign table partitions. > > This does, however, make the partitioning very non-transparent to every update query just because it is remotely possible a partition-moving update might occur concurrently. I dislike an error. I'd say that making partition "just work" here is material for another patch. In this one an update of the partition key can be documented as shorthand for delete-returning-insert with all the limitations that go with that. If someone acceptably solves the ctid following logic later it can be committed - I'm assuming there would be no complaints to making things just work in a case where they only sorta worked. David J.
Re: [HACKERS] btree_gin and btree_gist for enums
On 02/23/2017 04:41 PM, Tom Lane wrote: > Andrew Dunstanwrites: >> I'm not entirely sure how I should replace those DirectFunctionCall2 calls. > Basically what we want is for the called function to be able to use the > fcinfo->flinfo->fn_extra and fcinfo->flinfo->fn_mcxt fields of the > FmgrInfo struct that GIN has set up for the btree_gin support function. > > The fast but somewhat scary way to do it would just be to pass through > the flinfo pointer as-is. Imagine that fmgr.c grows a set of functions > like > > Datum > DontKnowWhatToCallThisFunctionCall2(PGFunction func, > FmgrInfo *flinfo, Oid collation, > Datum arg1, Datum arg2) > { > FunctionCallInfoData fcinfo; > Datumresult; > > InitFunctionCallInfoData(fcinfo, flinfo, 2, collation, NULL, NULL); > > fcinfo.arg[0] = arg1; > fcinfo.arg[1] = arg2; > fcinfo.argnull[0] = false; > fcinfo.argnull[1] = false; > > result = (*func) (); > > /* Check for null result, since caller is clearly not expecting one */ > if (fcinfo.isnull) > elog(ERROR, "function %p returned NULL", (void *) func); > > return result; > } > > and then you'd just pass through the fcinfo->flinfo you got. > > The reason this is kind of scary is that it's just blithely assuming > that the function won't look at the *other* fields of the FmgrInfo. > If it did, it would likely get very confused, since those fields > would be describing the GIN support function, not the function we're > calling. > > We could alternatively have this trampoline function set up a fresh, local > FmgrInfo struct that it zeroes except for copying fn_extra and fn_mcxt > from the caller's struct, and then it copies fn_extra back again on the > way out. That's a few more cycles but it would be safer, I think; if the > function tried to look at the other fields such as fn_oid it would see > obviously bogus data. > > Do we want one or both of these? I'm prepared to code up a patch to fmgr.[ch] to provide them. I don't know what to call it either. In my test I used CallerContextFunctionCall2 - not sure if that's quite right, but should be close. The technique is somewhat similar to what we do in plperl.c where we fake up a function call for handling DO blocks. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Thu, Feb 23, 2017 at 10:37 PM, Bruce Momjianwrote: > On Sat, Jan 21, 2017 at 12:46:05PM -0500, Stephen Frost wrote: > > * Petr Jelinek (petr.jeli...@2ndquadrant.com) wrote: > > > As we don't know the performance impact is (there was no benchmark done > > > on reasonably current code base) I really don't understand how you can > > > judge if it's worth it or not. > > > > Because I see having checksums as, frankly, something we always should > > have had (as most other databases do, for good reason...) and because > > they will hopefully prevent data loss. I'm willing to give us a fair > > bit to minimize the risk of losing data. > > Do these other databases do checksums because they don't do > full_page_writes? They just detect torn pages rather than repair them > like we do? > Torn page detection is usually/often done by other means than checksums. I don't think those are necessarily related. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)
Hello, This attached patch re-designed the previous v2 according to Robert's comment. All I had to do is putting entrypoint for ForeignScan/CustomScan at ExecShutdownNode, because it is already modified to call child node first, earlier than ExecShutdownGather which releases DSM segment. Thanks, 2017-02-20 9:25 GMT+09:00 Kouhei Kaigai: >> -Original Message- >> From: pgsql-hackers-ow...@postgresql.org >> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas >> Sent: Monday, February 20, 2017 2:20 AM >> To: Kaigai Kouhei(海外 浩平) >> Cc: Claudio Freire ; Amit Kapila >> ; pgsql-hackers >> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside >> ExecEndGather) >> >> On Fri, Feb 17, 2017 at 12:46 PM, Kouhei Kaigai >> wrote: >> > The attached patch is revised one. >> > >> > Invocation of Exec(Foreign|Custom)ParallelFinish was moved to >> > ExecParallelRetrieveInstrumentation() not to walk on the plan- state >> > tree twice. >> > One (hypothetical) downside is, FDW/CSP can retrieve its own run-time >> > statistics only when query is executed under EXPLAIN ANALYZE. >> > >> > This enhancement allows FDW/CSP to collect its specific run- time >> > statistics more than Instrumentation, then show them as output of >> > EXPLAIN. My expected examples are GPU's kernel execution time, DMA >> > transfer ratio and so on. These statistics will never appear in the >> > Instrumentation structure, however, can be a hot- point of performance >> > bottleneck if CustomScan works on background workers. >> >> Would gather_shutdown_children_first.patch from >> https://www.postgresql.org/message-id/CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbn >> b8eb10c-6aywjdxb...@mail.gmail.com >> help with this problem also? Suppose we did that, and then also added an >> ExecShutdownCustom method. Then you'd definitely be able to get control >> before the DSM went away, either from ExecEndNode() or ExecShutdownNode(). >> > Ah, yes, I couldn't find any problem around the above approach. > ExecShutdownGather() can be called by either ExecShutdownNode() or > ExecEndGather(). This patch allows to have an entrypoint for CSP/FDW > prior to release of the DSM. > > Thanks, > > PG-Strom Project / NEC OSS Promotion Center > KaiGai Kohei > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- KaiGai Kohei parallel-finish-fdw_csp.v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea_output output of base64
On Thu, Feb 23, 2017 at 03:52:46PM -0800, David Fetter wrote: > On Thu, Feb 23, 2017 at 05:55:37PM -0500, Bruce Momjian wrote: > > On Thu, Feb 23, 2017 at 04:08:58PM -0500, Tom Lane wrote: > > > Bruce Momjianwrites: > > > > Is there a reason we don't support base64 as a bytea_output output > > > > option, except that no one has implemented it? > > > > > > How about "we already have one too many bytea output formats"? > > > I don't think forcing code to try to support still another one > > > is a great thing ... especially not if it couldn't be reliably > > > distinguished from the hex format. > > > > Is there a reason we chose hex over base64? > > Whether there was or not, there's not a compelling reason now to break > people's software. When people want compression, methods a LOT more > effective than base64 are common. Gzip, for example. > > Best, > David. First, hex encoding is very simple to perform. Second, most applications have routines to handle it trivially. And third, base64 encoding has some padding constraints that can complicate is processing. Like David suggests, if you want compact, run it through lz4/gzip/lzop...for a much better size return. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FYI: git worktrees as replacement for "rsync the CVSROOT"
On 02/24/2017 02:36 AM, Craig Ringer wrote: > On 16 January 2017 at 05:01, Jim Nasbywrote: >> Not sure how many people still use [1], as referenced by our git wiki[2], >> but it appears git worktrees are a viable replacement for that technique. In >> short, if you're already in your checkout: >> >> git worktree add ../9.6 REL9_6_STABLE >> >> would give you a checkout of 9.6 in the ../9.6 directory. >> >> BTW, I learned about this from this "git year in review" article[3]. > Looks handy enough to merit adding to the Pg developer FAQ. Please? > > It looks cleaner than my current approach of doing a local clone or > re-cloning from upstream with a local repo as a --reference . > Does this do anythng different from the git contrib script git-new-workdir that I have been using for quite a long while? Essentially it symlinks a bunch of things from the old workdir to the new one. I copied the technique in the buildfarm's git_use_workdirs feature. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Temporal query processing with range types
2017-02-22 19:43 GMT+01:00 Peter Eisentraut: > On 2/16/17 07:41, Robert Haas wrote: >> Also, it sounds like all of this is intended to work with ranges that >> are stored in different columns rather than with PostgreSQL's built-in >> range types. > > Yeah, that should certainly be changed. Our syntax supports PostgreSQL's built-in range types and ranges that are stored in different columns. For instance, for range types an ALIGN query would look like this: SELECT * FROM (r ALIGN s ON q WITH (r.t, s.t)) c ... and for ranges in different columns like this: SELECT * FROM (r ALIGN s ON q WITH (r.ts, r.te, s.ts, s.te)) c ... where r and s are input relations, q can be any join qualifier, and r.t, s.t, r.ts, r.te, s.ts, and s.te can be any column name. The latter represent the valid time intervals, that is time point start, and time point end of each tuple for each input relation. These can be defined as four scalars, or two half-open, i.e., [), range typed values. It would reduce the size of our patch and simplify the overall structure, if we would remove the possibility to express valid time start points and end points in different columns. Do you think it is better to remove the syntax for ranges expressed in different columns? However, internally we still need to split the range types into two separate points, because NORMALIZE does not make a distinction between start and end timepoints while grouping, therefore we have only one timepoint attribute there (i.e., P1), which is the union of start and end timepoints (see executor/nodeTemporalAdjustment.c). A second constraint is, that we support currently only half-open intervals, that is, any interval definition (open/closed/half-open) different from the PostgreSQL's default, i.e., [), leads to undefined results. Best regards, Anton, Johann, Michael, Peter -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] safer node casting
Hi, I was about to add a few more uses of castNode, which made me think. You proposed replacing: On 2016-12-31 12:08:22 -0500, Peter Eisentraut wrote: > There is a common coding pattern that goes like this: > > RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc); > Assert(IsA(rinfo, RestrictInfo)); with > +#define castNode(_type_,nodeptr) (AssertMacro(!nodeptr || > IsA(nodeptr,_type_)), (_type_ *)(nodeptr)) (now an inline function, but that's besides my point) Those aren't actually equivalent, because of the !nodeptr. IsA() crashes for NULL pointers, but the new code won't. Which means 9ba8a9ce4548b et al actually weakened some asserts. Should we perhaps have one NULL accepting version (castNodeNull?) and one that separately asserts that ptr != NULL? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] increasing the default WAL segment size
On Fri, Feb 24, 2017 at 12:47 PM, Beena Emersonwrote: > > Hello, > > The recent commit c29aff959dc64f7321062e7f33d8c6ec23db53d has again changed > the code and the second patch cannot be applied cleanly. Please find > attached the rebased 02 patch. 01 patch is the same . > I've done an initial review of the patch. The objective of the patch is to modify the wal-segsize as an initdb-time parameter instead of a compile time parameter. The patch introduces following three different techniques to expose the XLogSize to different modules: 1. Directly read XLogSegSize from the control file This is used by default, i.e., StartupXLOG() and looks good to me. 2. Run the SHOW wal_segment_size command to fetch and set the XLogSegSize + if (!RetrieveXLogSegSize(conn)) + disconnect_and_exit(1); + You need the same logic in pg_receivewal.c as well. 3. Retrieve the XLogSegSize by reading the file size of WAL files + if (private.inpath != NULL) + sprintf(full_path, "%s/%s", private.inpath, fname); + else + strcpy(full_path, fname); + + stat(full_path, ); + + if (!IsValidXLogSegSize(fst.st_size)) + { + fprintf(stderr, + _("%s: file size %d is invalid \n"), + progname, (int) fst.st_size); + + return EXIT_FAILURE; + + } + + XLogSegSize = (int) fst.st_size; I see couple of issues with this approach: * You should check the return value of stat() before going ahead. Something like, if (stat(filename, ) < 0) error "file doesn't exist" * You're considering any WAL file with a power of 2 as valid. Suppose, the correct WAL seg size is 64mb. For some reason, the server generated a 16mb invalid WAL file(maybe it crashed while creating the WAL file). Your code seems to treat this as a valid file which I think is incorrect. Do you agree with that? Is it possible to unify these different techniques of reading XLogSegSize in a generalized function with a proper documentation describing the scope and limitations of each approach? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keep ECPG comment for log_min_duration_statement
> As I said in the first e-mail, there are some restrictions of comment > position in my implementation. I am concerned that an error will > occur in .pgc files users made in old version. > So, this feature should be a new option. I see, this makes sense. > When the pre-compiler(ECPG) converts C with embedded SQL to normal C > code, gram.y is used for syntactic analysis. I need to change gram.y > for comments in SQL statement. > But I do not come up with better idea that gram.y is not affected. > If you are interested in my implementation in detail, please check > the [WIP]patch I attached. I'm not sure we would want to change the backend parser for something only used in ecpg. Actually I'm pretty sure we don't. I can see two possible solutions. One would be to replace the parser rules. Please see parse.pl for details. Some rules are already replaced by ecpg specific ones. However, the more rules we replace the more manual syncing effort we need for changes in gram.y. The other option I can see, albeit without looking into details, is allowing all comments and then testing it for the right syntax after parsing. This could potentially also solve the above mentioned option problem. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Meskes at (Debian|Postgresql) dot Org Jabber: michael at xmpp dot meskes dot org VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : Parallel Merge Join
On Fri, Feb 24, 2017 at 3:04 PM, Amit Kapilawrote: > What advantage do you see for considering such a path when the cost of > innerpath is more than cheapest_total_inner? Remember the more paths > we try to consider, the more time we spend in the planner. By any > chance are you able to generate any query where it will give benefit > by considering costlier innerpath? Changed > > 2. > +static void generate_parallel_mergejoin_paths(PlannerInfo *root, > + RelOptInfo *joinrel, > + RelOptInfo *innerrel, > + Path *outerpath, > + JoinType jointype, > + JoinPathExtraData *extra, > + Path *inner_cheapest_total, > + List *merge_pathkeys); > > It is better to name this function as > generate_partial_mergejoin_paths() as we are generating only partial > paths in this function and accordingly change the comment on top of > the function. I see that you might be naming it based on > consider_parallel_*, however, I think it is better to use partial in > the name as that is what we are doing inside that function. Also, I > think this function has removed/changed some handling related to > unique outer and full joins, so it is better to mention that in the > function comments, something like "unlike above function, this > function doesn't expect to handle join types JOIN_UNIQUE_OUTER or > JOIN_FULL" and add Assert for both of those types. Done > > 3. > A test case is still missing. Added -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com parallel_mergejoin_v6.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
On 12 January 2017 at 13:34, Peter Eisentrautwrote: > On 1/11/17 5:27 AM, Simon Riggs wrote: >> The main area of "design doubt" remains the implementation of the >> recovery_target parameter set. Are we happy with the user interface >> choices in the patch, given the understanding that the situation was >> more comple than at first thought? > > Could you summarize the current proposal(s)? > > Personally, I don't immediately see the need to change anything from the > parameter names that I currently see in recovery.conf.sample. New patch version implementing everything you requested, incl docs and tap tests. The patch as offered here is what I've been asked to do by everybody as well as I can do it. I'm very happy to receive comments and to rework the design based upon further feedback. I'm not completely convinced this is a great design, so I'm happy to hear input. pg_basebackup -R is the main wrinkle. The timeline handling has a bug at present that I'm working on, but I'm not worried it constitutes a major problem. Obviously it will be fixed before commit, but the patch needs more discussion now/yesterday. All parameters are set at PGC_POSTMASTER for now. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services newRecoveryAPI.v102f.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 3:42 PM, Robert Haaswrote: > > > Wow, OK. In my view, that makes the chain conversion code pretty much > essential, because if you had WARM without chain conversion then the > visibility map gets more or less irrevocably less effective over time, > which sounds terrible. Yes. I decided to complete chain conversion patch when I realised that IOS will otherwise become completely useful if large percentage of rows are updated just once. So I agree. It's not an optional patch and should get in with the main WARM patch. > But it sounds to me like even with the chain > conversion, it might take multiple vacuum passes before all visibility > map bits are set, which isn't such a great property (thus e.g. > fdf9e21196a6f58c6021c967dc5776a16190f295). > > The chain conversion algorithm first converts the chains during vacuum and then checks if the page can be set all-visible. So I'm not sure why it would take multiple vacuums before a page is set all-visible. The commit you quote was written to ensure that we make another attempt to set the page all-visible after al dead tuples are removed from the page. Similarly, we will convert all WARM chains to HOT chains and then check for all-visibility of the page. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Proposal : Parallel Merge Join
On Fri, Feb 24, 2017 at 3:42 PM, Dilip Kumarwrote: > On Fri, Feb 24, 2017 at 3:04 PM, Amit Kapila wrote: > > Thanks for the comments. > >> What advantage do you see for considering such a path when the cost of >> innerpath is more than cheapest_total_inner? Remember the more paths >> we try to consider, the more time we spend in the planner. By any >> chance are you able to generate any query where it will give benefit >> by considering costlier innerpath? > > If inner_cheapest_total path is not parallel safe then I am trying to > find the cheapest parallel safe path and generate partial merge join. > Not sure, if we can just ignore the cheapest inner path because it is not parallel safe. It is also possible that you pick costly inner path just because it is parallel safe and then, later on, you have to discard it because the overall cost of partial merge join is much more. > I agree that it will consider one extra path while considering every > partial merge join path > Hmm, AFAICS, it will consider two extra paths per sort key (the loop around considering such paths is up to num_sortkeys) > (I think with this addition we will not get > parallel merge path for the any more TPCH query). I feel in general > we can get some better plan by this especially when gather is the > cheapest path for inner relation(which is not parallel safe) but at > the cost of considering some extra paths. I agree in some cases it could be better, but I think benefits are not completely clear, so probably we can leave it as of now and if later any one comes with a clear use case or can see the benefits of such path then we can include it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make subquery alias optional in FROM clause
On Fri, Feb 24, 2017 at 1:04 PM, Craig Ringerwrote: > On 23 February 2017 at 22:20, Tom Lane wrote: >>> * Don't force/generate an alias at all. >> >>> I've no idea for this yet and Tom already was concerned what this might >>> break. There are several places in the transform phase where the >>> refnames are required (e.g. isLockedRefname()). >> >> Yeah. This would be cleaner in some sense but also a lot more delicate. >> Not sure it's worth the trouble. > > Personally I think we need to generate one, if nothing else for error > messages where we try to emit qualified names of columns. > > But I don't see that the name needs to be anything we can refer to > elsewhere or anything faintly sane to type. Something like: > > "" > > in line with our current generation of refcursor names. Isn't that a terribly unfriendly thing to include in an error message? I'd much rather see the column qualified with whatever the alias name is inside the subquery than see it qualified with some internally generated name that's completely meaningless. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On Fri, Feb 24, 2017 at 3:24 PM, Simon Riggswrote: > I'd give the view that we cannot silently ignore this issue, bearing > in mind the point that we're expecting partitioned tables to behave > exactly like normal tables. At the risk of repeating myself, I don't expect that, and I don't think it's a reasonable expectation. It's reasonable to expect partitioning to be notably better than inheritance (which I think it already is) and to provide a good base for future work (which I think it does), but I think getting them to behave exactly like normal tables (except for the things we want to be different) will take another ten years of development work. > In my understanding the issue is that UPDATEs will fail to update a > row when a valid row exists in the case where a row moved between > partitions; that behaviour will be different to a standard table. Right, when at READ COMMITTED and EvalPlanQual would have happened otherwise. > It is of course very good that we have something ready for this > release and can make a choice of what to do. > > Thoughts > > 1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly > almost exactly the same thing. An UPDATE which gets to a > HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition > metadata, or I might be persuaded that in-this-release it is > acceptable to fail when this occurs with an ERROR and a retryable > SQLCODE, since the UPDATE will succeed on next execution. I've got my doubts about whether we can make that bit work that way, considering that we still support pg_upgrade (possibly in multiple steps) from old releases that had VACUUM FULL. We really ought to put some work into reclaiming those old bits, but there's probably no time for that in v10. > 2. I know that DB2 handles this by having the user specify WITH ROW > MOVEMENT to explicitly indicate they accept the issue and want update > to work even with that. We could have an explicit option to allow > that. This appears to be the only way we could avoid silent errors for > foreign table partitions. Yeah, that's a thought. We could give people a choice between (a) updates that cause rows to move between partitions just fail and (b) such updates work but with EPQ-related deficiencies. I had previously thought that, given those two choices, everybody would like (b) better than (a), but maybe not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Monitoring roles patch
Per the discussion at https://www.postgresql.org/message-id/CA%2BOCxoyYxO%2BJmzv2Micj4uAaQdAi6nq0w25BPQgLLxsrvTmREw%40mail.gmail.com, attached is a patch that implements the following: - Adds a default role called pg_monitor - Gives members of the pg_monitor role full access to: pg_ls_logdir() and pg_ls_waldir() pg_stat_* views and functions pg_tablespace_size() and pg_database_size() Contrib modules: pg_buffercache, pg_freespacemap, pgrowlocks, pg_stat_statements, pgstattuple and pg_visibility (but NOT pg_truncate_visibility_map() ) - Adds a default role called pg_read_all_gucs - Allows members of pg_read_all_gucs to, well, read all GUCs - Grants pg_read_all_gucs to pg_monitor Note that updates to contrib modules followed the strategy recently used in changes to pgstattuple following discussion here, in which the installation SQL script is left at the prior version, and an update script is added and default version number bumped to match that of the upgrade script. Patch includes doc updates, and is dependent on my pg_ls_logdir() and pg_ls_waldir() patch (https://www.postgresql.org/message-id/CA+OCxow-X=D2fWdKy+HP+vQ1LtrgbsYQ=cshzzbqyft5joy...@mail.gmail.com). -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile index 497dbeb229..18f7a87452 100644 --- a/contrib/pg_buffercache/Makefile +++ b/contrib/pg_buffercache/Makefile @@ -4,8 +4,9 @@ MODULE_big = pg_buffercache OBJS = pg_buffercache_pages.o $(WIN32RES) EXTENSION = pg_buffercache -DATA = pg_buffercache--1.2.sql pg_buffercache--1.1--1.2.sql \ - pg_buffercache--1.0--1.1.sql pg_buffercache--unpackaged--1.0.sql +DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \ + pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \ + pg_buffercache--unpackaged--1.0.sql PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time" ifdef USE_PGXS diff --git a/contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql b/contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql new file mode 100644 index 00..b37ef0112e --- /dev/null +++ b/contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql @@ -0,0 +1,7 @@ +/* contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.3'" to load this file. \quit + +GRANT EXECUTE ON FUNCTION pg_buffercache_pages() TO pg_monitor; +GRANT SELECT ON pg_buffercache TO pg_monitor; diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control index a4d664f3fa..8c060ae9ab 100644 --- a/contrib/pg_buffercache/pg_buffercache.control +++ b/contrib/pg_buffercache/pg_buffercache.control @@ -1,5 +1,5 @@ # pg_buffercache extension comment = 'examine the shared buffer cache' -default_version = '1.2' +default_version = '1.3' module_pathname = '$libdir/pg_buffercache' relocatable = true diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile index 7bc0e9555d..0a2f000ec6 100644 --- a/contrib/pg_freespacemap/Makefile +++ b/contrib/pg_freespacemap/Makefile @@ -4,8 +4,8 @@ MODULE_big = pg_freespacemap OBJS = pg_freespacemap.o $(WIN32RES) EXTENSION = pg_freespacemap -DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.0--1.1.sql \ - pg_freespacemap--unpackaged--1.0.sql +DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.1--1.2.sql \ + pg_freespacemap--1.0--1.1.sql pg_freespacemap--unpackaged--1.0.sql PGFILEDESC = "pg_freespacemap - monitoring of free space map" ifdef USE_PGXS diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.1--1.2.sql b/contrib/pg_freespacemap/pg_freespacemap--1.1--1.2.sql new file mode 100644 index 00..490bb3bf46 --- /dev/null +++ b/contrib/pg_freespacemap/pg_freespacemap--1.1--1.2.sql @@ -0,0 +1,7 @@ +/* contrib/pg_freespacemap/pg_freespacemap--1.1--1.2.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_freespacemap UPDATE TO '1.2'" to load this file. \quit + +GRANT EXECUTE ON FUNCTION pg_freespace(regclass, bigint) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_freespace(regclass) TO pg_monitor; diff --git a/contrib/pg_freespacemap/pg_freespacemap.control b/contrib/pg_freespacemap/pg_freespacemap.control index 764db30d18..ac8fc5050a 100644 --- a/contrib/pg_freespacemap/pg_freespacemap.control +++ b/contrib/pg_freespacemap/pg_freespacemap.control @@ -1,5 +1,5 @@ # pg_freespacemap extension comment = 'examine the free space map (FSM)' -default_version = '1.1' +default_version = '1.2' module_pathname = '$libdir/pg_freespacemap' relocatable = true diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 3:31 PM, Pavan Deolaseewrote: > On Fri, Feb 24, 2017 at 3:23 PM, Robert Haas wrote: >> I don't immediately see how this will work with index-only scans. If >> the tuple is HOT updated several times, HOT-pruned back to a single >> version, and then the page is all-visible, the index entries are >> guaranteed to agree with the remaining tuple, so it's fine to believe >> the data in the index tuple. But with WARM, that would no longer be >> true, unless you have some trick for that... > > Well the trick is to not allow index-only scans on such pages by not marking > them all-visible. That's why when a tuple is WARM updated, we carry that > information in the subsequent versions even when later updates are HOT > updates. The chain conversion algorithm will handle this by clearing those > bits and thus allowing index-only scans again. Wow, OK. In my view, that makes the chain conversion code pretty much essential, because if you had WARM without chain conversion then the visibility map gets more or less irrevocably less effective over time, which sounds terrible. But it sounds to me like even with the chain conversion, it might take multiple vacuum passes before all visibility map bits are set, which isn't such a great property (thus e.g. fdf9e21196a6f58c6021c967dc5776a16190f295). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : Parallel Merge Join
On Fri, Feb 24, 2017 at 3:04 PM, Amit Kapilawrote: Thanks for the comments. > What advantage do you see for considering such a path when the cost of > innerpath is more than cheapest_total_inner? Remember the more paths > we try to consider, the more time we spend in the planner. By any > chance are you able to generate any query where it will give benefit > by considering costlier innerpath? If inner_cheapest_total path is not parallel safe then I am trying to find the cheapest parallel safe path and generate partial merge join. I agree that it will consider one extra path while considering every partial merge join path (I think with this addition we will not get parallel merge path for the any more TPCH query). I feel in general we can get some better plan by this especially when gather is the cheapest path for inner relation(which is not parallel safe) but at the cost of considering some extra paths. What is your thought on this? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 3:23 PM, Robert Haaswrote: > > I don't immediately see how this will work with index-only scans. If > the tuple is HOT updated several times, HOT-pruned back to a single > version, and then the page is all-visible, the index entries are > guaranteed to agree with the remaining tuple, so it's fine to believe > the data in the index tuple. But with WARM, that would no longer be > true, unless you have some trick for that... > > Well the trick is to not allow index-only scans on such pages by not marking them all-visible. That's why when a tuple is WARM updated, we carry that information in the subsequent versions even when later updates are HOT updates. The chain conversion algorithm will handle this by clearing those bits and thus allowing index-only scans again. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
[HACKERS] Making clausesel.c Smarter
I've stumbled over an interesting query out in the wild where the query was testing a nullable timestamptz column was earlier than some point in time, and also checking the column IS NOT NULL. The use pattern here was that records which required processing had this timestamp set to something other than NULL, a worker would come along and process those, and UPDATE the record to NULL to mark the fact that it was now processed. So what we are left with was a table with a small number of rows with a value in this timestamp column, and an ever-growing number of rows with a NULL value. A highly simplified version of the query was checking for records that required processing before some date, say: SELECT * FROM a WHERE ts < '2017-02-24' AND ts IS NOT NULL; Of course, the ts IS NOT NULL is not required here, but I can understand how someone might make the mistake of adding this. The simple solution to the problem was to have that null test removed, but that seemingly innocent little mistake caused some pain due to very slow running queries which held on to a nested loop plan 33 times longer than it should have been doing. Basically what was happening here is that clauselist_selectivity() was applying the selectivity with the ~0.97 null_fact from pg_statistic over the top of the already applied estimate on the number of rows below the constant timestamp. Since the diagnosis of this problem was not instant, and some amount of pain was suffered before the solution was found, I wondered if it might be worth smartening up the planner a bit in this area. We're already pretty good at handling conditions like: SELECT * FROM a WHERE x < 10 and x < 1; where we'll effectively ignore the x < 10 estimate since x < 1 is more restrictive, so if we just build on that ability a bit we could get enough code to cover the above case. I've attached a draft patch which improves the situation here. Given the test case: create table ts (ts timestamptz); insert into ts select case when x%1000=0 then '2017-01-01'::date + (x::text || ' sec')::interval else null end from generate_series(1,100) x ( x ); analyze ts; explain analyze select count(*) from ts where ts <= '2017-02-01'::timestamptz and ts is not null; With the patch we get: QUERY PLAN - Aggregate (cost=15938.83..15938.84 rows=1 width=8) (actual time=101.003..101.003 rows=1 loops=1) -> Seq Scan on ts (cost=0.00..15937.00 rows=733 width=0) (actual time=0.184..100.868 rows=1000 loops=1) Filter: ((ts IS NOT NULL) AND (ts <= '2017-02-01 00:00:00+13'::timestamp with time zone)) Rows Removed by Filter: 999000 Planning time: 0.153 ms Execution time: 101.063 ms Whereas master gives us: QUERY PLAN --- Aggregate (cost=15937.00..15937.01 rows=1 width=8) (actual time=119.256..119.256 rows=1 loops=1) -> Seq Scan on ts (cost=0.00..15937.00 rows=1 width=0) (actual time=0.172..119.062 rows=1000 loops=1) Filter: ((ts IS NOT NULL) AND (ts <= '2017-02-01 00:00:00+13'::timestamp with time zone)) Rows Removed by Filter: 999000 Planning time: 0.851 ms Execution time: 119.401 ms A side effect of this is that we're now able to better detect impossible cases such as: postgres=# explain analyze select count(*) from ts where ts <= '2017-02-01'::timestamptz and ts is null; QUERY PLAN -- Aggregate (cost=15937.00..15937.01 rows=1 width=8) (actual time=135.012..135.012 rows=1 loops=1) -> Seq Scan on ts (cost=0.00..15937.00 rows=1 width=0) (actual time=135.007..135.007 rows=0 loops=1) Filter: ((ts IS NULL) AND (ts <= '2017-02-01 00:00:00+13'::timestamp with time zone)) Rows Removed by Filter: 100 Planning time: 0.067 ms Execution time: 135.050 ms (6 rows) Master is not able to see the impossibility of this case: postgres=# explain analyze select count(*) from ts where ts <= '2017-02-01'::timestamptz and ts is null; QUERY PLAN Aggregate (cost=15938.83..15938.84 rows=1 width=8) (actual time=131.681..131.681 rows=1 loops=1) -> Seq Scan on ts (cost=0.00..15937.00 rows=733 width=0) (actual time=131.676..131.676 rows=0 loops=1) Filter: ((ts IS NULL) AND (ts <= '2017-02-01 00:00:00+13'::timestamp with time zone)) Rows Removed by Filter: 100 Planning time: 0.090 ms Execution time: 131.719 ms (6 rows) Now one thing I was unsure of in the patch was this "Other clauses" concept
Re: [HACKERS] UPDATE of partition key
On 24 February 2017 at 07:02, Robert Haaswrote: > On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar > wrote: >> I am inclined to at least have some option for the user to decide the >> behaviour. In the future we can even consider support for walking >> through the ctid chain across multiple relfilenodes. But till then, we >> need to decide what default behaviour to keep. My inclination is more >> towards erroring out in an unfortunate even where there is an UPDATE >> while the row-movement is happening. One option is to not get into >> finding whether the DELETE was part of partition row-movement or it >> was indeed a DELETE, and always error out the UPDATE when >> heap_update() returns HeapTupleUpdated, but only if the table is a >> leaf partition. But this obviously will cause annoyance because of >> chances of getting such errors when there are concurrent updates and >> deletes in the same partition. But we can keep a table-level option >> for determining whether to error out or silently lose the UPDATE. > > I'm still a fan of the "do nothing and just document that this is a > weirdness of partitioned tables" approach, because implementing > something will be complicated, will ensure that this misses this > release if not the next one, and may not be any better for users. But > probably we need to get some more opinions from other people, since I > can imagine people being pretty unhappy if the consensus happens to be > at odds with my own preferences. I'd give the view that we cannot silently ignore this issue, bearing in mind the point that we're expecting partitioned tables to behave exactly like normal tables. In my understanding the issue is that UPDATEs will fail to update a row when a valid row exists in the case where a row moved between partitions; that behaviour will be different to a standard table. It is of course very good that we have something ready for this release and can make a choice of what to do. Thoughts 1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly almost exactly the same thing. An UPDATE which gets to a HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition metadata, or I might be persuaded that in-this-release it is acceptable to fail when this occurs with an ERROR and a retryable SQLCODE, since the UPDATE will succeed on next execution. 2. I know that DB2 handles this by having the user specify WITH ROW MOVEMENT to explicitly indicate they accept the issue and want update to work even with that. We could have an explicit option to allow that. This appears to be the only way we could avoid silent errors for foreign table partitions. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 2:42 PM, Pavan Deolaseewrote: > Let's take an example. Say, we have a table (a int, b int, c text) and two > indexes on first two columns. > >H W > H > (1, 100, 'foo') -> (1, 100, 'bar') --> (1, 200, 'bar') -> (1, > 200, 'foo') > > The first update will be a HOT update, the second update will be a WARM > update and the third update will again be a HOT update. The first and third > update do not create any new index entry, though the second update will > create a new index entry in the second index. Any further WARM updates to > this chain is not allowed, but further HOT updates are ok. > > If all but the last version become DEAD, HOT-prune will remove all of them > and turn the first line pointer into REDIRECT line pointer. So, when you do the WARM update, the new index entries still point at the original root, which they don't match, not the version where that new value first appeared? I don't immediately see how this will work with index-only scans. If the tuple is HOT updated several times, HOT-pruned back to a single version, and then the page is all-visible, the index entries are guaranteed to agree with the remaining tuple, so it's fine to believe the data in the index tuple. But with WARM, that would no longer be true, unless you have some trick for that... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : Parallel Merge Join
On Tue, Feb 14, 2017 at 5:22 PM, Dilip Kumarwrote: > On Tue, Feb 14, 2017 at 12:25 PM, Amit Kapila wrote: > Apart from this, there was one small problem in an earlier version > which I have corrected in this. > > + /* Consider only parallel safe inner path */ > + if (innerpath != NULL && > + innerpath->parallel_safe && > + (cheapest_total_inner == NULL || > + cheapest_total_inner->parallel_safe == false || > + compare_path_costs(innerpath, cheapest_total_inner, > + TOTAL_COST) < 0)) > > In this comparison, we were only checking if innerpath is cheaper than > the cheapest_total_inner then generate path with this new inner path > as well. Now, I have added one more check if cheapest_total_inner was > not parallel safe then also consider a path with this new inner > (provided this inner is parallel safe). > What advantage do you see for considering such a path when the cost of innerpath is more than cheapest_total_inner? Remember the more paths we try to consider, the more time we spend in the planner. By any chance are you able to generate any query where it will give benefit by considering costlier innerpath? 2. +static void generate_parallel_mergejoin_paths(PlannerInfo *root, + RelOptInfo *joinrel, + RelOptInfo *innerrel, + Path *outerpath, + JoinType jointype, + JoinPathExtraData *extra, + Path *inner_cheapest_total, + List *merge_pathkeys); It is better to name this function as generate_partial_mergejoin_paths() as we are generating only partial paths in this function and accordingly change the comment on top of the function. I see that you might be naming it based on consider_parallel_*, however, I think it is better to use partial in the name as that is what we are doing inside that function. Also, I think this function has removed/changed some handling related to unique outer and full joins, so it is better to mention that in the function comments, something like "unlike above function, this function doesn't expect to handle join types JOIN_UNIQUE_OUTER or JOIN_FULL" and add Assert for both of those types. 3. A test case is still missing. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keep ECPG comment for log_min_duration_statement
Hi, Michael wrote: > The reason for not keeping the comments in the statement was simply to > make the parser simpler. If you added this feature, do we still see a > reason for keeping the old version? Or in other words, shouldn't we > make the "enable-parse-comment" version the default without a new > option? Thank you for your feedback! As I said in the first e-mail, there are some restrictions of comment position in my implementation. I am concerned that an error will occur in .pgc files users made in old version. So, this feature should be a new option. When the pre-compiler(ECPG) converts C with embedded SQL to normal C code, gram.y is used for syntactic analysis. I need to change gram.y for comments in SQL statement. But I do not come up with better idea that gram.y is not affected. If you are interested in my implementation in detail, please check the [WIP]patch I attached. I am appreciated if you give me any idea or opinion. Regards, Okano Naoki Fujitsu [WIP]enable-parse-comment.patch Description: [WIP]enable-parse-comment.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 12:28 AM, Alvaro Herrerawrote: > Bruce Momjian wrote: > > On Thu, Feb 23, 2017 at 03:45:24PM -0300, Alvaro Herrera wrote: > > > > > and potentially trim the first HOT chain as those tuples become > > > > invisible. > > > > > > That can already happen even without WARM, no? > > > > Uh, the point is that with WARM those four early tuples can be removed > > via a prune, rather than requiring a VACUUM. Without WARM, the fourth > > tuple can't be removed until the index is cleared by VACUUM. > > I *think* that the WARM-updated one cannot be pruned either, because > it's pointed to by at least one index (otherwise it'd have been a HOT > update). The ones prior to that can be removed either way. > > No, even the WARM-updated can be pruned and if there are further HOT updates, those can be pruned too. All indexes and even multiple pointers from the same index are always pointing to the root of the WARM chain and that line pointer does not go away unless the entire chain become dead. The only material difference between HOT and WARM is that since there are two index pointers from the same index to the same root line pointer, we must do recheck. But HOT-pruning and all such things remain the same. Let's take an example. Say, we have a table (a int, b int, c text) and two indexes on first two columns. H W H (1, 100, 'foo') -> (1, 100, 'bar') --> (1, 200, 'bar') -> (1, 200, 'foo') The first update will be a HOT update, the second update will be a WARM update and the third update will again be a HOT update. The first and third update do not create any new index entry, though the second update will create a new index entry in the second index. Any further WARM updates to this chain is not allowed, but further HOT updates are ok. If all but the last version become DEAD, HOT-prune will remove all of them and turn the first line pointer into REDIRECT line pointer. At this point, the first index has one index pointer and the second index has two index pointers, but all pointing to the same root line pointer, which has not become REDIRECT line pointer. Redirect o---> (1, 200, 'foo') I think the part you want (be able to prune the WARM updated tuple) is > part of what Pavan calls "turning the WARM chain into a HOT chain", so > not part of the initial patch. Pavan can explain this part better, and > also set me straight in case I'm wrong in the above :-) > > Umm.. it's a bit different. Without chain conversion, we still don't allow further WARM updates to the above chain because that might create a third index pointer and our recheck logic can't cope up with duplicate scans. HOT updates are allowed though. The latest patch that I proposed will handle this case and convert such chains into regular HOT-pruned chains. To do that, we must remove the duplicate (and now wrong) index pointer to the chain. Once we do that and change the state on the heap tuple, we can once again do WARM update to this tuple. Note that in this example the chain has just one tuple, which will be the case typically, but the algorithm can deal with the case where there are multiple tuples but with matching index keys. Hope this helps. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Fri, Feb 24, 2017 at 2:13 PM, Robert Haaswrote: > On Thu, Feb 23, 2017 at 9:21 PM, Bruce Momjian wrote: > > > > > I have what might be a supid question. As I remember, WARM only allows > > a single index-column change in the chain. Why are you seeing such a > > large performance improvement? I would have thought it would be that > > high if we allowed an unlimited number of index changes in the chain. > > I'm not sure how the test case is set up. If the table has multiple > indexes, each on a different column, and only one of the indexes is > updated, then you figure to win because now the other indexes need > less maintenance (and get less bloated). If you have only a single > index, then I don't see how WARM can be any better than HOT, but maybe > I just don't understand the situation. > > That's correct. If you have just one index and if the UPDATE modifies indexed indexed, the UPDATE won't be a WARM update and the patch gives you no benefit. OTOH if the UPDATE doesn't modify any indexed columns, then it will be a HOT update and again the patch gives you no benefit. It might be worthwhile to see if patch causes any regression in these scenarios, though I think it will be minimal or zero. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Feb 23, 2017 at 11:53 PM, Bruce Momjianwrote: > On Thu, Feb 23, 2017 at 03:03:39PM -0300, Alvaro Herrera wrote: > > Bruce Momjian wrote: > > > > > As I remember, WARM only allows > > > a single index-column change in the chain. Why are you seeing such a > > > large performance improvement? I would have thought it would be that > > > high if we allowed an unlimited number of index changes in the chain. > > > > The second update in a chain creates another non-warm-updated tuple, so > > the third update can be a warm update again, and so on. > > Right, before this patch they would be two independent HOT chains. It > still seems like an unexpectedly-high performance win. Are two > independent HOT chains that much more expensive than joining them via > WARM? In these tests, there are zero HOT updates, since every update modifies some index column. With WARM, we could reduce regular updates to half, even when we allow only one WARM update per chain (chain really has a single tuple for this discussion). IOW approximately half updates insert new index entry in *every* index and half updates insert new index entry *only* in affected index. That itself does a good bit for performance. So to answer your question: yes, joining two HOT chains via WARM is much cheaper because it results in creating new index entries just for affected indexes. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Feb 23, 2017 at 9:21 PM, Bruce Momjianwrote: > On Tue, Jan 31, 2017 at 04:52:39PM +0530, Pavan Deolasee wrote: >> The other critical bug I found, which unfortunately exists in the master too, >> is the index corruption during CIC. The patch includes the same fix that I've >> proposed on the other thread. With these changes, WARM stress is running fine >> for last 24 hours on a decently powerful box. Multiple CREATE/DROP INDEX >> cycles >> and updates via different indexed columns, with a mix of FOR SHARE/UPDATE and >> rollbacks did not produce any consistency issues. A side note: while >> performance measurement wasn't a goal of stress tests, WARM has done about >> 67% >> more transaction than master in 24 hour period (95M in master vs 156M in WARM >> to be precise on a 30GB table including indexes). I believe the numbers would >> be far better had the test not dropping and recreating the indexes, thus >> effectively cleaning up all index bloats. Also the table is small enough to >> fit >> in the shared buffers. I'll rerun these tests with much larger scale factor >> and >> without dropping indexes. > > Thanks for setting up the test harness. I know it is hard but > in this case it has found an existing bug and given good performance > numbers. :-) > > I have what might be a supid question. As I remember, WARM only allows > a single index-column change in the chain. Why are you seeing such a > large performance improvement? I would have thought it would be that > high if we allowed an unlimited number of index changes in the chain. I'm not sure how the test case is set up. If the table has multiple indexes, each on a different column, and only one of the indexes is updated, then you figure to win because now the other indexes need less maintenance (and get less bloated). If you have only a single index, then I don't see how WARM can be any better than HOT, but maybe I just don't understand the situation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Thu, Feb 23, 2017 at 11:30 PM, Bruce Momjianwrote: > On Wed, Feb 1, 2017 at 10:46:45AM +0530, Pavan Deolasee wrote: > > > contains a WARM tuple. Alternate ideas/suggestions and review of > the > > design > > > are welcome! > > > > t_infomask2 contains one last unused bit, > > > > > > Umm, WARM is using 2 unused bits from t_infomask2. You mean there is > another > > free bit after that too? > > We are obviously going to use several heap or item pointer bits for > WARM, and once we do that it is going to be hard to undo that. Pavan, > are you saying you could do more with WARM if you had more bits? Are we > sure we have given you all the bits we can? Do we want to commit to a > lesser feature because the bits are not available? > > btree implementation is complete as much as I would like (there are a few TODOs, but no show stoppers), at least for the first release. There is a free bit in btree index tuple header that I could use for chain conversion. In the heap tuples, I can reuse HEAP_MOVED_OFF because that bit will only be set along with HEAP_WARM_TUPLE bit. Since none of the upgraded clusters can have HEAP_WARM_TUPLE bit set, I think we are safe. WARM currently also supports hash indexes, but there is no free bit left in hash index tuple header. I think I can work around that by using a bit from ip_posid (not yet implemented/tested, but seems doable). IMHO if we can do that i.e. support btree and hash indexes to start with, we should be good to go for the first release. We can try to support other popular index AMs in the subsequent release. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Add checklist item for psql completion to commitfest review
On Fri, Feb 24, 2017 at 2:28 AM, Jim Nasbywrote: > I've never messed with completion so I don't know how hard it is, but my > impression is that it gets added after the fact not because of any > intentional decisions but because people simply forget about it. ISTM it > would be more efficient of community resources to deal with completion in > the original patch, unless there's some reason not to. > > IOW, no, don't make it a hard requirement, but don't omit it simply through > forgetfulness. Because our biggest problem around here is that it's too easy to get patches committed, so we should add some more requirements? I think it's great to include tab completion support in initial patches if people are willing to do that, but I'm not prepared to insist on it. Anyway, it's not just a yes-or-no thing; it's pretty common that people go back later and add more tab completion to things that already have some tab completion. So if we adopt the rule you are proposing, then the next question will be whether a given patch has ENOUGH tab completion. Ugh. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On Fri, Feb 24, 2017 at 1:18 PM, David G. Johnstonwrote: > For my own sanity - the move update would complete successfully and break > every ctid chain that it touches. Any update lined up behind it in the lock > queue would discover their target record has been deleted and would > experience whatever behavior their isolation level dictates for such a > situation. So multi-partition update queries will fail to update some > records if they happen to move between partitions even if they would > otherwise match the query's predicate. Right. That's the behavior for which I am advocating, on the grounds that it's the simplest to implement and if we all agree on something else more complicated later, we can do it then. > Is there any difference in behavior between this and a SQL writeable CTE > performing the same thing via delete-returning-insert? Not to my knowledge. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bytea_output output of base64
It undoubtedly would make pg_dump smaller, though I'm not sure how much that's worth since if you care at all about that you'll gzip it. But, the other thing it might do is speed up COPY, especially on input. Some performance tests of that might be interesting. For what it is worth: Ascii85 (aka Base85) is used in PDF by Adobe, with a prefix "<~" and a suffix "~>". It codes 4 bytes as 5 ascii characters, i.e. a 25% loss. There is also Z85 which avoids some special characters: backslash, single quote, double quote. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers