RE: Recovery performance of standby for multiple concurrent truncates on large tables
From: 'Andres Freund' [mailto:and...@anarazel.de] > I'm continuing to work on it, but unfortunately there's a couple > projects that have higher priority atm :(. I'm doubtful I can have a > patchset in a committable shape for v12, but I'm pretty sure I'll have > it in a shape good enough to make progress towards v13. Sorry :( We'd like to do it for PG 12, if Kirk's current proposal doesn't find a good landing point. Could you tell us about how many lines of code you predict, and what part would be difficult? Is there any software you're referring to (e.g. Linux's fsync, MySQL, etc.)? Regards Takayuki Tsunakawa
RE: Recovery performance of standby for multiple concurrent truncates on large tables
From: Robert Haas [mailto:robertmh...@gmail.com] > It's not clear to me whether it would be worth the overhead of doing > something like this. Quite frankly, not really to me, too. > Making relation drops faster at the cost of > making buffer cleaning slower could be a loser. The purpose is not making relation drops faster (on the primary), but keeping failover time within 10 seconds. I don't really know how crucial that requirement is, but I'm feeling it would be good for PostgreSQL to be able to guarantee shorter failover time. Regards Takayuki Tsunakawa
Re: Temporary tables prevent autovacuum, leading to XID wraparound
At Mon, 30 Jul 2018 16:59:16 +0900, Michael Paquier wrote in <20180730075916.gb2...@paquier.xyz> > On Fri, Jul 27, 2018 at 08:27:26AM +, Tsunakawa, Takayuki wrote: > > I don't have a strong opinion, but I wonder which of namespace.c or > > autovacuum.c is suitable, because isTempNamespaceInUse is specific to > > autovacuum. > > I think that there is also a point in allowing other backends to use it > as well, so I left it in namespace.c. I have been doing more testing > with this patch today. In order to catch code paths where this is +1 for namespace.c from me. > triggered I have for example added an infinite loop in do_autovacuum > when finding a temp table which exits once a given on-disk file is > found. This lets plenty of time to attach autovacuum to a debugger, and > play with other sessions in parallel, so I have checked the > transactional assumption this patch relied on, and tested down to v10 as > that's where removal of orphaned temp relations has been made more > aggressive. This patch can also be applied cleanly there. > > The check on MyDatabaseId does not actually matter much as the same > temporary namespace OID would get reused only after an OID wraparound > with a backend using a different database but I left it anyway as that's > more consistent to me to check for database and namespace, and added a > sanity check to make sure that the routine gets called only by a process > connected to a database. > > I am going to be out of town for a couple of days, and the next minor > release planned is close by, so I am not pushing that today, but I'd > like to do so after the next minor release if there are no objections. > Attached is a patch with a proposal of commit message. "int backendID" is left in autovacuum.c as an unused variable. + * decide if a temporary file entry can be marked as orphaned or not. Even + * if this is an atomic operation, this can be safely done lock-less as no "Even if this is *not* an atomic operation" ? + * Mark the proc entry as owning this namespace which autovacuum uses to + * decide if a temporary file entry can be marked as orphaned or not. Even + * if this is an atomic operation, this can be safely done lock-less as no + * temporary relations associated to it can be seen yet by autovacuum when + * scanning pg_class. + */ + MyProc->tempNamespaceId = namespaceId; The comment looks wrong. After a crash having temp tables, pg_class has orphaned temp relations and the namespace can be of reconnected backends especially for low-numbered backends. So autovacuum may look the shared variable while the reconnected backend is writing it. The only harmful misbehavior is false "unused" judgement. This happens only when reusing the same namespace ID but the temp namespace is already cleaned up by InitTempTableNamespace in the case. In the case of false "used" judgement, the later autovacuum rounds will do the clean up correctly. In short, I think it is safely done lock-less but the reason is not no concurrent reads but transient reads do not harm (currently..). | * Mark the proc entry as owning this namespace which autovacuum uses to | * decide if a temporary file entry can be marked as orphaned or not. Even | * if this is not an atomic operation, this can be safely done lock-less as | * a transient read doesn't let autovacuum go wrong. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Reduce amount of WAL generated by CREATE INDEX for gist, gin and sp-gist
With the consent of Anastasia I will improving this patch further. Attachment contains next version of the patch set. 11.07.2018 00:03, Heikki Linnakangas пишет: On 28/02/18 18:03, Anastasia Lubennikova wrote: Implementation is based on generic_xlog. Why? I think we should just add a log_relation() function in xloginsert.c directly, alongside log_newpage_buffer(). I have some arguments to stay this functionality at generic_xlog module: 1. xloginsert.c functions work on low level of abstraction, use buffers and pages. 2. Code size using generic_xlog service functions looks more compact and safe. This makes the assumption that all the pages in these indexes used the standard page layout. I think that's a valid assumption, but needs at least a comment on that. And perhaps an Assert, to check that pd_lower/upper look sane. Done As a further optimization, would it be a win to WAL-log multiple pages in each record? In this version of the patch we use simple optimization: pack XLR_NORMAL_MAX_BLOCK_ID blocks pieces into each WAL-record. This leaves the XLOG_*_CREATE_INDEX WAL record types unused, BTW. Done - Heikki Benchmarks: --- Test: pgbench -f gin-WAL-test.sql -t 5: --- master: Latency average: 27696.299 ms WAL size: 2.66 GB patched Latency average: 22812.103 ms WAL size: 1.23 GB Test: pgbench -f gist-WAL-test.sql -t 5: master: Latency average: 19928.284 ms WAL size: 1.25 GB patched Latency average: 18175.064 ms WAL size: 0.63 GB Test: pgbench -f spgist-WAL-test.sql -t 5: -- master: Latency average: 11529.384 ms WAL size: 1.07 GB patched Latency average: 9330.828 ms WAL size: 0.6 GB -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From db400ce9532536da36812dbf0456e756a0ea4724 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 31 Jul 2018 07:22:17 +0500 Subject: [PATCH 1/4] Relation-into-WAL-function --- src/backend/access/transam/generic_xlog.c | 62 +++ src/include/access/generic_xlog.h | 3 ++ 2 files changed, 65 insertions(+) diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c index ce023548ae..8397b58ee7 100644 --- a/src/backend/access/transam/generic_xlog.c +++ b/src/backend/access/transam/generic_xlog.c @@ -80,6 +80,7 @@ static void computeRegionDelta(PageData *pageData, static void computeDelta(PageData *pageData, Page curpage, Page targetpage); static void applyPageRedo(Page page, const char *delta, Size deltaSize); +static void standard_page_layout_check(Buffer buf); /* * Write next fragment into pageData's delta. @@ -545,3 +546,64 @@ generic_mask(char *page, BlockNumber blkno) mask_unused_space(page); } + +/* + * Check page layout. + * Caller must lock the buffer + */ +static void +standard_page_layout_check(Buffer buf) +{ + PageHeader ph = (PageHeader) BufferGetPage(buf); + + Assert((ph->pd_lower >= SizeOfPageHeaderData) && + (ph->pd_lower <= ph->pd_upper) && + (ph->pd_upper <= ph->pd_special) && + (ph->pd_special <= BLCKSZ) && + (ph->pd_special == MAXALIGN(ph->pd_special))); +} + +/* + * Function to write generic xlog for every existing block of a relation. + * Caller is responsible for locking the relation exclusively. + */ +void +generic_log_relation(Relation rel) +{ + BlockNumber blkno; + BlockNumber nblocks; + + nblocks = RelationGetNumberOfBlocks(rel); + + elog(DEBUG2, "generic_log_relation '%s', nblocks %u BEGIN.", + RelationGetRelationName(rel), nblocks); + + for (blkno = 0; blkno < nblocks; ) + { + GenericXLogState *state; + Bufferbuffer[MAX_GENERIC_XLOG_PAGES]; + int counter, + blocks_pack; + + CHECK_FOR_INTERRUPTS(); + + blocks_pack = ((nblocks-blkno) < MAX_GENERIC_XLOG_PAGES) ? + (nblocks-blkno) : MAX_GENERIC_XLOG_PAGES; + + state = GenericXLogStart(rel); + + for (counter = 0 ; counter < blocks_pack; counter++) + { + buffer[counter] = ReadBuffer(rel, blkno++); + standard_page_layout_check(buffer[counter]); + LockBuffer(buffer[counter], BUFFER_LOCK_EXCLUSIVE); + GenericXLogRegisterBuffer(state, buffer[counter], GENERIC_XLOG_FULL_IMAGE); + } + + GenericXLogFinish(state); + + for (counter = 0 ; counter < blocks_pack; counter++) + UnlockReleaseBuffer(buffer[counter]); + } + elog(DEBUG2, "generic_log_relation '%s' END.", RelationGetRelationName(rel)); +} diff --git a/src/include/access/generic_xlog.h b/src/include/access/generic_xlog.h index b23e1f684b..1f4b3b7030 100644 --- a/src/include/access/generic_xlog.h +++ b/src/include/access/generic_xlog.h @@ -42,4 +42,7 @@ extern const char *generic_identify(uint8 info); extern void generic_desc(StringInfo buf, XLogReaderState *record); extern void generic_mask(char *pagedata, BlockNumber blkno); +/* other utils */ +extern void generic_log_relation(Relation rel); + #endif
Re: Explain buffers wrong counter with parallel plans
On Tue, Jul 31, 2018 at 12:58 AM, Andres Freund wrote: > Hi, > > I'm not an expert in the area of the code, but here's a review anyway. I > did not read through the entire thread. > > > I think we should try to get this fixed soon, to make some progress > towards release-ability. Or just declare it to be entirely unrelated to > the release, and remove it from the open items list; not an unreasonable > choice either. This has been an open item for quite a while... > I am okay either way, but I can work on getting it committed for this release. > >> diff --git a/src/backend/executor/execParallel.c >> b/src/backend/executor/execParallel.c >> index 60aaa822b7e..ac22bedf0e2 100644 >> --- a/src/backend/executor/execParallel.c >> +++ b/src/backend/executor/execParallel.c >> @@ -979,9 +979,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc) >> /* Report workers' query for monitoring purposes */ >> pgstat_report_activity(STATE_RUNNING, debug_query_string); >> >> - /* Prepare to track buffer usage during query execution. */ >> - InstrStartParallelQuery(); >> - >> /* Attach to the dynamic shared memory area. */ >> area_space = shm_toc_lookup(toc, PARALLEL_KEY_DSA, false); >> area = dsa_attach_in_place(area_space, seg); >> @@ -993,6 +990,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc) >> queryDesc->planstate->state->es_query_dsa = area; >> ExecParallelInitializeWorker(queryDesc->planstate, toc); >> >> + /* Prepare to track buffer usage during query execution. */ >> + InstrStartParallelQuery(); >> + >> /* Run the plan */ >> ExecutorRun(queryDesc, ForwardScanDirection, 0L, true); > > Isn't this an independent change? And one with potentially negative > side effects? I think there's some arguments for changing this (and some > against), > This is to ensure that buffer usage stats are tracked consistently in master and worker backends. In the master backend, we don't track the similar stats for ExecutorStart phase. Also, it would help us give the same stats for buffer usage in parallel and non-parallel version of the query. > but I think it's a bad idea to do so in the same patch. > Agreed, we can do this part separately. > >> diff --git a/src/backend/executor/execProcnode.c >> b/src/backend/executor/execProcnode.c >> index 36d2914249c..a0d49ec0fba 100644 >> --- a/src/backend/executor/execProcnode.c >> +++ b/src/backend/executor/execProcnode.c >> @@ -737,6 +737,13 @@ ExecShutdownNode(PlanState *node) >> >> planstate_tree_walker(node, ExecShutdownNode, NULL); >> >> + /* >> + * Allow instrumentation to count stats collected during shutdown for >> + * nodes that are executed atleast once. >> + */ >> + if (node->instrument && node->instrument->running) >> + InstrStartNode(node->instrument); >> + >> switch (nodeTag(node)) >> { >> case T_GatherState: >> @@ -755,5 +762,12 @@ ExecShutdownNode(PlanState *node) >> break; >> } >> >> + /* >> + * Allow instrumentation to count stats collected during shutdown for >> + * nodes that are executed atleast once. >> + */ >> + if (node->instrument && node->instrument->running) >> + InstrStopNode(node->instrument, 0); >> + >> return false; >> } > > Duplicating the comment doesn't seem like a good idea. Just say > something like "/* see comment for InstrStartNode above */". > Okay, will change in the next version. > >> diff --git a/src/backend/executor/nodeLimit.c >> b/src/backend/executor/nodeLimit.c >> index ac5a2ff0e60..cf1851e235f 100644 >> --- a/src/backend/executor/nodeLimit.c >> +++ b/src/backend/executor/nodeLimit.c >> @@ -134,6 +134,8 @@ ExecLimit(PlanState *pstate) >> node->position - node->offset >= >> node->count) >> { >> node->lstate = LIMIT_WINDOWEND; >> + /* Allow nodes to release or shut down >> resources. */ >> + (void) ExecShutdownNode(outerPlan); >> return NULL; >> } >> > > Um, is this actually correct? What if somebody rewinds afterwards? > That won't happen for parallel query currently, but architecturally I > don't think we can do so unconditionally? > Currently, this is done for the forward scans (basically under ScanDirectionIsForward), is that enough, if not, what kind of additional check do you think we need here? I am not sure at this stage how we will design it for backward scans, but I think we already do this at other places in the code, see below code in ExecutePlan. ExecutePlan() { .. if (numberTuples && numberTuples == current_tuple_count) { /* Allow nodes to release or shut down resources. */ (void) ExecShutdownNode(planstate); break; } .. } -- With Regards, Amit Kapila.
pg_default_acl missing 'n' case in doc
While looking at ACL prettyprinting, I noticed that "pg_default_acl" documentation does not say anything about type 'n' for schema (namespace), which seems to be supported according to "\ddp" and the catalog code. Here is a small addition to add that 'n' is allowed for schema. -- Fabien.diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index fffb79f713..3bb48d4ccf 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2850,7 +2850,8 @@ SCRAM-SHA-256$iteration count: r = relation (table, view), S = sequence, f = function, - T = type + T = type, + n = schema
Re: partition tree inspection functions
Hi Jesper, On 2018/07/30 22:21, Jesper Pedersen wrote: > On 07/30/2018 05:21 AM, Amit Langote wrote: >>> As 0 is a valid return value for root nodes I think we should use -1 >>> instead for these cases. >> >> Makes sense, changed to be that way. >> > > Looks good, the documentation for pg_partition_level could be expanded to > describe the -1 scenario. Done. > New status: Ready for Committer Thanks. Updated patch attached. Regards, Amit >From 71ab5dbcbc1985a3b6450c2bce41639e99f232da Mon Sep 17 00:00:00 2001 From: amit Date: Tue, 16 Jan 2018 19:02:13 +0900 Subject: [PATCH v8] Add assorted partition reporting functions --- doc/src/sgml/func.sgml | 89 ++ src/backend/catalog/partition.c | 244 ++- src/backend/utils/cache/lsyscache.c | 22 +++ src/include/catalog/pg_proc.dat | 48 ++ src/include/utils/lsyscache.h| 1 + src/test/regress/expected/partition_info.out | 204 ++ src/test/regress/parallel_schedule | 2 +- src/test/regress/serial_schedule | 1 + src/test/regress/sql/partition_info.sql | 91 ++ 9 files changed, 697 insertions(+), 5 deletions(-) create mode 100644 src/test/regress/expected/partition_info.out create mode 100644 src/test/regress/sql/partition_info.sql diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index edc9be92a6..326f4dbe58 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19995,6 +19995,95 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); The function returns the number of new collation objects it created. + +Partitioning Information Functions + + + Name Return Type Description + + + + + pg_partition_parent(regclass) + regclass + get parent if table is a partition, NULL otherwise + + + pg_partition_root_parent(regclass) + regclass + get topmost parent of a partition within partition tree + + + pg_partition_level(regclass, regclass) + regclass + get level of a partition within partition tree with respect to given parent + + + pg_partition_level(regclass) + regclass + get level of a partition within partition tree with respect to topmost root parent + + + pg_partition_children(regclass, bool) + setof regclass + +get partitions of a table; only immediate partitions are returned, +unless all tables in the partition tree, including itself and +partitions of lower levels, are requested by passing +true for second argument + + + + pg_partition_children(regclass) + setof regclass + Shorthand for pg_partition_children(..., false) + + + pg_partition_leaf_children(regclass) + setof regclass + get all leaf partitions of a given table + + + + + + +If the table passed to pg_partition_root_parent is not +a partition, it is returned as the its own root parent. If the table +passed for the second argument of pg_partition_level +is not same as the first argument or not an ancestor of the table in the +first argument, -1 is returned as the result. + + + +To check the total size of the data contained in +measurement table described in +, use the following +query: + + + +select pg_size_pretty(sum(pg_relation_size(p))) as total_size from pg_partition_children('measurement', true) p; + total_size + + 24 kB +(1 row) + + + +One could have used pg_partition_leaf_children in +this case and got the same result as shown below: + + + +select pg_size_pretty(sum(pg_relation_size(p))) as total_size from pg_partition_leaf_children('measurement') p; + total_size + + 24 kB +(1 row) + + + diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 558022647c..d90aaf8bb8 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -23,13 +23,16 @@ #include "catalog/partition.h" #include "catalog/pg_inherits.h" #include "catalog/pg_partitioned_table.h" +#include "funcapi.h" #include "nodes/makefuncs.h" #include "optimizer/clauses.h" #include "optimizer/prep.h" #include "optimizer/var.h" #include "partitioning/partbounds.h" #include "rewrite/rewriteManip.h" +#include "utils/builtins.h" #include "utils/fmgroids.h" +#include "utils/lsyscache.h" #include "utils/partcache.h" #include "utils/rel.h" #include "utils/syscache.h" @@ -38,16 +41,14 @@ static Oid get_partition_parent_worker(Relation inhRel, Oid relid); static void get_partition_ancestors_worker(Relation inhRel, Oid relid, List **ancestors); +static Oid get_partition_root_parent(Oid relid);
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
(2018/07/31 4:06), Andres Freund wrote: On 2018-07-20 08:38:09 -0400, Robert Haas wrote: I'm going to study this some more now, but I really think this is going in the wrong direction. We're going to have to get somewhere on this topic soon. This thread has been open for nearly half a year, and we're late in the beta phase now. What can we do to get to an agreement soon? I'd like to propose an updated patch as proposed in [1]. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/5B59BA55.30200%40lab.ntt.co.jp
Re: [HACKERS] Parallel Append implementation
On Tue, Jul 31, 2018 at 5:05 AM, Robert Haas wrote: > New version attached. Looks good to me. -- Thomas Munro http://www.enterprisedb.com
Re: Making "COPY partitioned_table FROM" faster
On 31 July 2018 at 06:21, Peter Eisentraut wrote: > On 30/07/2018 15:26, David Rowley wrote: >>> - Add some tests. The if (nBufferedTuples > 0) that flushes the tuples >>> when the partition changes is not currently exercised. >> >> That seems like a good idea. In fact, it uncovered a bug around >> ConvertPartitionTupleSlot() freeing the previously stored tuple out >> the slot which resulted in a crash. I didn't notice before because my >> test had previously not required any tuple conversions. > > I think we need to think of a better place to put that temporary file, > and clean it up properly afterwards. I'm not sure whether we have > existing uses like that. Ideally, I could have written the test in copy2.sql, but since I had to insert over 1000 rows to trigger the multi-insert code, copy from stdin was not practical in the .sql file. Instead, I just followed the lead in copy.source and used the same temp file style as the previous test. It also leaves that file laying around, it just happens to be smaller. I've updated the test to truncate the table again and perform another COPY TO to empty the file. I didn't see any way to remove the file due to lack of standard way of removing files in the OS. \! rm ... is not going to work on windows, for example. I also failed to realise how the .source files work previously. I see there are input and output directories. I'd missed updating the output one in the v5 delta patch. > Also, maybe the test should check afterwards that the right count of > rows ended up in each partition? Agreed. I've added that. The attached v6 delta replaces the v5 delta and should be applied on top of the full v4 patch. I'm using deltas in the hope they're easier to review the few changes than reviewing the entire patch again. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v6_multiinsert_copy_delta.patch Description: Binary data
Re: make installcheck-world in a clean environment
I wrote: > The original complaint about ecpg remains; I'm not terribly excited > about messing with that. Well, I got curious as to why we were seeing such a weird error, and eventually traced it to this stuff in ecpg/test/Makefile.regress: # Standard way to invoke the ecpg preprocessor ECPG = ../../preproc/ecpg --regression -I$(srcdir)/../../include -I$(srcdir) # Files that most or all ecpg preprocessor test outputs depend on ECPG_TEST_DEPENDENCIES = ../../preproc/ecpg$(X) \ $(srcdir)/../regression.h \ $(srcdir)/../../include/sqlca.h \ $(srcdir)/../../include/sqlda.h \ $(srcdir)/../../include/sqltypes.h \ $(srcdir)/../../include/sql3types.h %.c: %.pgc $(ECPG_TEST_DEPENDENCIES) $(ECPG) -o $@ $< Now, the fine gmake manual quoth: `%' in a prerequisite of a pattern rule stands for the same stem that was matched by the `%' in the target. In order for the pattern rule to apply, its target pattern must match the file name under consideration and all of its prerequisites (after pattern substitution) must name files that exist or can be made. So the problem is that (after make clean) ../../preproc/ecpg doesn't exist, and the Makefiles under test/ have no idea how to build it, and thus this pattern rule is inapplicable. Thus you end up getting "No rule to make target" errors. While it seems straightforward to fix this for "make check" scenarios --- just go make ecpg before trying to make the tests --- I think these rules are very surprising for "make installcheck" cases. You'd expect "make installcheck" to test the installed ecpg, as indeed Alexander pointed out at the start of the thread. But it's not doing that. Alexander's USE_INSTALLED_ASSETS patch attempted to fix that, and I now see the point of it, but it still seems pretty hacky and invasive (why should an ecpg-only problem be touching stuff outside ecpg)? Also, unless I'm still missing something, it doesn't fix the problem with "make clean check": ecpg won't have been built before we try to build the test programs. I'd suggest trying to simplify the USE_INSTALLED_ASSETS patch so it doesn't muck with the rules for building pg_regress. I don't find that to be very relevant to the problem. regards, tom lane
Re: Recovery performance of standby for multiple concurrent truncates on large tables
On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas wrote: > On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk > wrote: >> 1. Because the multiple scans of the whole shared buffer per concurrent >> truncate/drop table was the cause of the time-consuming behavior, DURING the >> failover process while WAL is being applied, we temporary delay the scanning >> and invalidating of shared buffers. At the same time, we remember the >> relations/relfilenodes (of dropped/truncated tables) by adding them in a >> hash table called "skip list". >> 2. After WAL is applied, the checkpoint(or bg writer) scans the shared >> buffer only ONCE, compare the pages against the skip list, and invalidates >> the relevant pages. After deleting the relevant pages on the shared memory, >> it will not be written back to the disk. >> >> Assuming the theory works, this design will only affect the behavior of >> checkpointer (or maybe bg writer) during recovery process / failover. Any >> feedback, thoughts? > > How would this work if a relfilenode number that belonged to an old > relation got recycled for a new relation? > > I think something like this could be made to work -- both on the > master and the standby, and not just while waiting for a failover -- > if we did something like this: > > (1) Limit the number of deferred drops to a reasonably small number > (one cache line? 1kB?). > (2) Background writer regularly does scans do clean out everything > from the deferred-drop list. > (3) If a foreground process finds the deferred-drop list is full when > it needs to add to it, it forces a clean-out of the list contents + > whatever new stuff it has in a single pass. > (4) If we are about generate a relfilenode that's in the list, we > either force a clean out or generate a different relfilenode instead. > (5) Every buffer cleaning operation checks the deferred-drop list and > invalidates without write-out if the buffer is found. > > It's not clear to me whether it would be worth the overhead of doing > something like this. Making relation drops faster at the cost of > making buffer cleaning slower could be a loser. Perhaps you could make it a bit more incremental, avoiding the full clean-out scans (your points 2 and 3) while still allowing the background writer to bear the brunt in the best case? Something like this: The zombie relfilenode tracker could be fixed-size and self-cleaning: entries could be dropped when the CLOCK hand completes a full rotation since the entry was created, and if it's full when you're trying to insert a new entry you can scan the buffer pool until you've caused one entry (the oldest, by clock hand-at-time-of-insertion) to be remove to make space (so the worst case is a full pass, but hopefully we can drop one entry before then). The main argument I can think of against that idea is that it is tied to the current buffer eviction strategy, using its buffer index-based clock hand as a horizon. Extracting a useful horizon for algorithms like ARC or CAR would probably need more thought, because their 'hands' jump around following next pointers. Anyway, it's a lot of complexity, and it falls back to a worst cases like today, and can also transfer work to innocent foreground processes. I see why Andres says we should just get a better data structure so we can make the guy doing the dropping pay for it up front, but more efficiently. I suspect we may also want an ordered data structure for write clustering purposes one day. -- Thomas Munro http://www.enterprisedb.com
Re: FailedAssertion on partprune
On 2018-Jul-25, David Rowley wrote: > Thinking again about the patch I submitted upthread; I wonder if it's > actually possible to support pruning with Jamie's query. Without > looking at the code, I don't quite see the reason that the > sub-partitioned table wouldn't be correctly pruned by the run-time > pruning code. It could just be a matter of removing the failing > Assert(). I'll do a bit more testing and confirm. Not looking at the code right now either, but removing that assert and then removing the TABLESAMPLE clause, the query returns identical results with and without pruning, so maybe you're right. No time for further looking now. (SELECT 'Jaime' <> 'Jamie' COLLATE es_EC) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Avoid extra Sort nodes between WindowAggs when sorting can be reused
> On 27 Jul 2018, at 21:12, Alexander Kuzmenkov > wrote: > Thanks for the update. Thank you for reviewing and hacking! > On 07/25/2018 01:37 AM, Daniel Gustafsson wrote: >> >>> Hmm, this is missing the eqop fields of SortGroupClause. I haven't >>> tested yet but does the similar degradation happen if two >>> SortGroupCaluses have different eqop and the same other values? >> I wasn’t able to construct a case showing this, but I also think you’re >> right. >> Do you have an idea of a query that can trigger a regression? The attached >> patch adds a stab at this, but I’m not sure if it’s the right approach. > > To trigger that, in your test example you could order by empno::int8 for one > window, and by empno::int2 for another. But don't I think you have to compare > eqop here, because if eqop differs, sortop will differ too. I removed the > comparison from the patch. Right, that makes sense. > I also clarified (I hope) the comments, and did the optimization I mentioned > earlier: using array instead of list for active clauses. Please see the > attached v6. Thanks, looks good. > Otherwise I think the patch is good. Cool, thanks for reviewing! cheers ./daniel
Re: negative bitmapset member not allowed Error with partition pruning
On 2018-Jul-27, David Rowley wrote: > On 27 July 2018 at 15:14, Tom Lane wrote: > > Well, my thinking is that it helps nobody if call sites have to have > > explicit workarounds for a totally-arbitrary refusal to handle edge > > cases in the primitive functions. I do not think that is good software > > design. If you want to have assertions that particular call sites are > > specifying nonempty ranges, put those in the call sites where it's > > important. But as-is, this seems like, say, defining foreach() to > > blow up on an empty list. > > Okay, that's a fair point. I agree, adding Asserts at the current > call sites seems better. Given the discussion, I pushed two commits: first, bms_add_range returns the input bms if the range is empty, also adding Rajkumar's test case, which I also verified to reproduce the bug, and passes (for me) with the bms_add_range change. The second commit includes the proposed asserts, but not the change to avoid calling bms_add_range when the range is empty. Thanks! -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: make installcheck-world in a clean environment
Alexander Lakhin writes: > 14.07.2018 13:57, Peter Eisentraut wrote: >> On 06.07.18 09:45, Alexander Lakhin wrote: >>> ./configure --enable-tap-tests >>> make install >>> make install -C contrib >>> chown -R postgres:postgres /usr/local/pgsql/ >>> /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data >>> /usr/local/pgsql/bin/pg_ctl start -l logfile -D /usr/local/pgsql/data >>> /make clean/ >>> # Also you can just install binary packages to get the same state. >>> >>> make installcheck-world >>> # This check fails. I remain pretty skeptical that this is a sensible way to proceed, especially not if what you're testing is installed binary packages. You're creating yet *another* hazard for version-skew-like problems, namely that there's no certainty that you gave configure arguments that're compatible with what the installed packages used. But having said that ... >> For me, this fails at >> In file included from ../../src/include/postgres.h:47, >> from chklocale.c:17: >> ../../src/include/utils/elog.h:71:10: fatal error: utils/errcodes.h: No >> such file or directory >> That's probably fixable. But it's different from what you had initially >> reported. > This error wasn't present in master at the time of initial report (in > April). As "git bisect" shows such errors appeared after 372728b. > So in REL_10_STABLE (or in master before 372728b) you could "make > installcheck" (but not installcheck-world) in a clean environment. ... I think this was actually induced by 3b8f6e75f, for which we already had to make some adjustments to ensure that the generated headers got built when needed. This seems like another such case, so I stuck in a couple more submake-generated-headers dependencies to fix it. The original complaint about ecpg remains; I'm not terribly excited about messing with that. regards, tom lane
RE: GSOC 2018 Project - A New Sorting Routine
Hey Tomas! Sorry to bother but it would be great if we can get the test results this week. Regards, Kefan From: Tomas Vondra Sent: July 24, 2018 8:16 AM To: Kefan Yang Cc: Andrey Borodin; Peter Geoghegan; alvhe...@2ndquadrant.com; PostgreSQL Hackers Subject: Re: GSOC 2018 Project - A New Sorting Routine On 07/24/2018 12:21 AM, Kefan Yang wrote: > Hi Tomas! > > I did a few tests on my own Linux machine, but the problem is that my > resources on AWS(CPU, RAM and even Disk space) are very limited. I > considered establishing virtual machine on my own PC but the performance > is even worse. > > My original patch has two main optimizations: (1) switch to heap sort > when depth limit exceeded (2) check whether the array is presorted only > once at the beginning. Now I want to test these optimizations > separately. On AWS EC2 instance, regressions on CREATE INDEX cases seems > to be less significant if we use (1) only, but I can only test up to > 10 records and 512MB memory using your scripts. > > So would you mind re-running the tests using the two patches I provided > in the attachment? That will be very helpful > I can do that, but it'll have to wait a couple of days. I'm currently using the boxes for some other tests. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Segfault logical replication PG 10.4
Thank you Alvaro :) > Le 30 juil. 2018 à 22:33, Alvaro Herrera a écrit : > > On 2018-Jul-28, Alvaro Herrera wrote: > >> Aha, I see, thanks. Here's a complete fix with included testcase. In >> an unpatched assert-enabled build, this crashes this >> >> TRAP: FailedAssertion("!(ActiveSnapshotSet())", File: >> "/pgsql/source/REL_10_STABLE/src/backend/tcop/postgres.c", Line: 788) >> >> Will push on Monday. > > Pushed after changing the constraint in the test case to be less silly. > Thanks for the report and diagnosis. > > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Making "COPY partitioned_table FROM" faster
On Mon, Jul 30, 2018 at 11:21 AM, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 30/07/2018 15:26, David Rowley wrote: > >> - Add some tests. The if (nBufferedTuples > 0) that flushes the tuples > >> when the partition changes is not currently exercised. > > > > That seems like a good idea. In fact, it uncovered a bug around > > ConvertPartitionTupleSlot() freeing the previously stored tuple out > > the slot which resulted in a crash. I didn't notice before because my > > test had previously not required any tuple conversions. > > I think we need to think of a better place to put that temporary file, > and clean it up properly afterwards. I'm not sure whether we have > existing uses like that. > > Also, maybe the test should check afterwards that the right count of > rows ended up in each partition? > > Yea, I actually would suggest changing the data inserted in the third insert statement to have 'Three' in the third column: insert into parted_copytest select x,1,'One' from generate_series(1011,1020) x; And then this check: select count(*) from parted_copytest group by a, b, c;
Re: Segfault logical replication PG 10.4
On 2018-Jul-28, Alvaro Herrera wrote: > Aha, I see, thanks. Here's a complete fix with included testcase. In > an unpatched assert-enabled build, this crashes this > > TRAP: FailedAssertion("!(ActiveSnapshotSet())", File: > "/pgsql/source/REL_10_STABLE/src/backend/tcop/postgres.c", Line: 788) > > Will push on Monday. Pushed after changing the constraint in the test case to be less silly. Thanks for the report and diagnosis. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: documentation about explicit locking
On 20/07/2018 02:30, Amit Langote wrote: > We don't explicitly mention what locks we take on system catalogs > elsewhere, but CREATE COLLATION is different from other commands, so I'm > fine with adding more details as you suggest, so updated the text. committed the documentation change -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Recovery performance of standby for multiple concurrent truncates on large tables
Hi, On 2018-07-30 05:22:48 +, Jamison, Kirk wrote: > BTW, are there any updates whether the community will push through > anytime soon regarding the buffer mapping implementation you > mentioned? I'm continuing to work on it, but unfortunately there's a couple projects that have higher priority atm :(. I'm doubtful I can have a patchset in a committable shape for v12, but I'm pretty sure I'll have it in a shape good enough to make progress towards v13. Sorry :( Greetings, Andres Freund
Re: Recovery performance of standby for multiple concurrent truncates on large tables
Hi, On 2018-07-30 16:01:53 -0400, Robert Haas wrote: > (1) Limit the number of deferred drops to a reasonably small number > (one cache line? 1kB?). Yea, you'd have to, because we'd frequently need to check it, and it'd need to be in shared memory. But that'd still leave us to regress to O(n^2) as soon as a transaction goes over that limit - which I don't think is that infrequent... Greetings, Andres Freund
Re: Recovery performance of standby for multiple concurrent truncates on large tables
On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk wrote: > 1. Because the multiple scans of the whole shared buffer per concurrent > truncate/drop table was the cause of the time-consuming behavior, DURING the > failover process while WAL is being applied, we temporary delay the scanning > and invalidating of shared buffers. At the same time, we remember the > relations/relfilenodes (of dropped/truncated tables) by adding them in a hash > table called "skip list". > 2. After WAL is applied, the checkpoint(or bg writer) scans the shared buffer > only ONCE, compare the pages against the skip list, and invalidates the > relevant pages. After deleting the relevant pages on the shared memory, it > will not be written back to the disk. > > Assuming the theory works, this design will only affect the behavior of > checkpointer (or maybe bg writer) during recovery process / failover. Any > feedback, thoughts? How would this work if a relfilenode number that belonged to an old relation got recycled for a new relation? I think something like this could be made to work -- both on the master and the standby, and not just while waiting for a failover -- if we did something like this: (1) Limit the number of deferred drops to a reasonably small number (one cache line? 1kB?). (2) Background writer regularly does scans do clean out everything from the deferred-drop list. (3) If a foreground process finds the deferred-drop list is full when it needs to add to it, it forces a clean-out of the list contents + whatever new stuff it has in a single pass. (4) If we are about generate a relfilenode that's in the list, we either force a clean out or generate a different relfilenode instead. (5) Every buffer cleaning operation checks the deferred-drop list and invalidates without write-out if the buffer is found. It's not clear to me whether it would be worth the overhead of doing something like this. Making relation drops faster at the cost of making buffer cleaning slower could be a loser. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Explain buffers wrong counter with parallel plans
Hi, I'm not an expert in the area of the code, but here's a review anyway. I did not read through the entire thread. I think we should try to get this fixed soon, to make some progress towards release-ability. Or just declare it to be entirely unrelated to the release, and remove it from the open items list; not an unreasonable choice either. This has been an open item for quite a while... > diff --git a/src/backend/executor/execParallel.c > b/src/backend/executor/execParallel.c > index 60aaa822b7e..ac22bedf0e2 100644 > --- a/src/backend/executor/execParallel.c > +++ b/src/backend/executor/execParallel.c > @@ -979,9 +979,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc) > /* Report workers' query for monitoring purposes */ > pgstat_report_activity(STATE_RUNNING, debug_query_string); > > - /* Prepare to track buffer usage during query execution. */ > - InstrStartParallelQuery(); > - > /* Attach to the dynamic shared memory area. */ > area_space = shm_toc_lookup(toc, PARALLEL_KEY_DSA, false); > area = dsa_attach_in_place(area_space, seg); > @@ -993,6 +990,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc) > queryDesc->planstate->state->es_query_dsa = area; > ExecParallelInitializeWorker(queryDesc->planstate, toc); > > + /* Prepare to track buffer usage during query execution. */ > + InstrStartParallelQuery(); > + > /* Run the plan */ > ExecutorRun(queryDesc, ForwardScanDirection, 0L, true); Isn't this an independent change? And one with potentially negative side effects? I think there's some arguments for changing this (and some against), but I think it's a bad idea to do so in the same patch. > diff --git a/src/backend/executor/execProcnode.c > b/src/backend/executor/execProcnode.c > index 36d2914249c..a0d49ec0fba 100644 > --- a/src/backend/executor/execProcnode.c > +++ b/src/backend/executor/execProcnode.c > @@ -737,6 +737,13 @@ ExecShutdownNode(PlanState *node) > > planstate_tree_walker(node, ExecShutdownNode, NULL); > > + /* > + * Allow instrumentation to count stats collected during shutdown for > + * nodes that are executed atleast once. > + */ > + if (node->instrument && node->instrument->running) > + InstrStartNode(node->instrument); > + > switch (nodeTag(node)) > { > case T_GatherState: > @@ -755,5 +762,12 @@ ExecShutdownNode(PlanState *node) > break; > } > > + /* > + * Allow instrumentation to count stats collected during shutdown for > + * nodes that are executed atleast once. > + */ > + if (node->instrument && node->instrument->running) > + InstrStopNode(node->instrument, 0); > + > return false; > } Duplicating the comment doesn't seem like a good idea. Just say something like "/* see comment for InstrStartNode above */". > diff --git a/src/backend/executor/nodeLimit.c > b/src/backend/executor/nodeLimit.c > index ac5a2ff0e60..cf1851e235f 100644 > --- a/src/backend/executor/nodeLimit.c > +++ b/src/backend/executor/nodeLimit.c > @@ -134,6 +134,8 @@ ExecLimit(PlanState *pstate) > node->position - node->offset >= > node->count) > { > node->lstate = LIMIT_WINDOWEND; > + /* Allow nodes to release or shut down > resources. */ > + (void) ExecShutdownNode(outerPlan); > return NULL; > } > Um, is this actually correct? What if somebody rewinds afterwards? That won't happen for parallel query currently, but architecturally I don't think we can do so unconditionally? Greetings, Andres Freund
Re: Documenting that queries can be run over replication protocol
On 24/07/2018 09:54, Chris Travers wrote: > In the process of building some tooling based on replication we > discovered that PostgreSQL 10 uses COPY to populate the initial table on > logical replication, see commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920 > > This is not currently documented. Attached is a patch which adds that > documentation to the logical replication part of the protocol section. It already says earlier in that same section: "In logical replication walsender mode, the replication commands shown below as well as normal SQL commands can be issued." -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Getting rid of "accept incoming network connections" prompts on OS X
On 26/07/2018 23:45, Tom Lane wrote: > This came up again today, and I've confirmed that the issue still exists > in current macOS. Did you get any response to your bug report, and if > so what did they say? There hasn't been any response to the radar. I think our analysis is correct, it's an OS bug. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
Hi, On 2018-07-20 08:38:09 -0400, Robert Haas wrote: > I'm going to study this some more now, but I really think this is > going in the wrong direction. We're going to have to get somewhere on this topic soon. This thread has been open for nearly half a year, and we're late in the beta phase now. What can we do to get to an agreement soon? Greetings, Andres Freund
Re: Making "COPY partitioned_table FROM" faster
On 30/07/2018 15:26, David Rowley wrote: >> - Add some tests. The if (nBufferedTuples > 0) that flushes the tuples >> when the partition changes is not currently exercised. > > That seems like a good idea. In fact, it uncovered a bug around > ConvertPartitionTupleSlot() freeing the previously stored tuple out > the slot which resulted in a crash. I didn't notice before because my > test had previously not required any tuple conversions. I think we need to think of a better place to put that temporary file, and clean it up properly afterwards. I'm not sure whether we have existing uses like that. Also, maybe the test should check afterwards that the right count of rows ended up in each partition? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Adding a note to protocol.sgml regarding CopyData
Hello Tatsuo-san, Minor suggestions, although I'm not a native English speaker. In libpq.sgml following is stated: Before PostgreSQL protocol 3.0, it was necessary for the application to explicitly send the two characters \. as a final line to indicate to the server that it ISTM That "it" may refer to the server... Put "the application" instead? had finished sending COPY data. While this still works, it is deprecated and the special meaning of \. can be expected to be removed in a future release. Maybe be more assertive, "will be removed"? It is sufficient to call PQendcopy after "It is now sufficient ..."? having sent the actual data. I think this should be mentioned in protocol.sgml as well. Developers who wish to develop programs that understand frontend/backend protocol should be able to focus on protocol.sgml. Attached is a patch for this. -- Fabien.
Re: [HACKERS] PoC: full merge join on comparison clause
El 18/07/18 a las 16:58, Ashutosh Bapat escribió: Thanks for the commit messages. I would use word "in-equality" instead of "comparison" since equality is also a comparison. Fixed. Comparing this with the original code, I think, is_mj_equality should be true if restrictinfo->mergeopfamilies is not NIL. My mistake, fixed. With this work the meaning of oprcanmerge (See pg_operator catalog and also CREATE OPERATOR syntax) changes. Every btree operator can now be used to perform a merge join. oprcanmerge however only indicates whether an operator is an equality or not. Have you thought about that? Do we require to re-define oprcanmerge? For now we can test with old oprcanmerge meaning, not to bump the catalog version. Merge join needs only BTORDER_PROC function, which is required for btree opfamilies. This means that it should be always possible to merge join on operators that correspond to standard btree strategies. We could set oprcanmerge to true for all built-in btree comparison operators, and leave the possibility to disable it for custom operators. I think, it should be possible to use this technique with more than one inequality clauses as long as all the operators require the input to be ordered in the same direction and the clauses are ANDed. In that case the for a given outer tuple the matching inner tuples form a contiguous interval. Consider a table "t(a int, b int)", the value of each column can be 1, 2, 3, 4 and the table contains all possible combinations. If merge condition is "a < 2 and b < 2", for each of the four possible sorting directions, the result set won't be contiguous. Generally speaking, this happens when we have several groups with the same value of first column, and the first column matches the join condition. But inside each group, for some rows the second column doesn't match. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From c817ac1f93b83bcf43afac4af2dbaed37403a4a2 Mon Sep 17 00:00:00 2001 From: Alexander Kuzmenkov Date: Tue, 10 Apr 2018 12:31:21 +0300 Subject: [PATCH 2/2] Inequality merge join. Perform merge joins on inequality clause. The current merge join algorithm requires minimal modification to support one inequality clause at the final position. This has performance benefits in some cases, and also allows to perform full joins on inequality, which was not possible before. This commit modifies the merge join path generation logic and cost functions to account for inequality clauses, and adds some tests. --- src/backend/executor/nodeMergejoin.c | 136 +-- src/backend/optimizer/path/costsize.c| 27 ++- src/backend/optimizer/path/joinpath.c| 27 ++- src/backend/optimizer/path/pathkeys.c| 218 --- src/backend/optimizer/plan/initsplan.c | 28 ++- src/backend/utils/adt/selfuncs.c | 40 +++-- src/backend/utils/cache/lsyscache.c | 40 + src/include/nodes/execnodes.h| 5 + src/include/optimizer/paths.h| 3 +- src/include/utils/lsyscache.h| 1 + src/test/regress/expected/join.out | 250 +++ src/test/regress/expected/partition_join.out | 4 + src/test/regress/sql/join.sql| 66 ++- src/test/regress/sql/partition_join.sql | 5 + 14 files changed, 750 insertions(+), 100 deletions(-) diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c index 7298e1c..d6e5556 100644 --- a/src/backend/executor/nodeMergejoin.c +++ b/src/backend/executor/nodeMergejoin.c @@ -89,6 +89,45 @@ * proceed to another state. This state is stored in the node's * execution state information and is preserved across calls to * ExecMergeJoin. -cim 10/31/89 + * + * This algorithm can work almost as-is when the last join clause + * is an inequality. This introduces an additional restriction to + * the ordering of the inputs: when moving to the next outer tuple, + * the beginning of the matching interval of inner tuples must not + * change. For example, if the join operator is ">=", inputs must + * be in ascending order. + * + * Consider this example: + * outer >= innner + * 1 0 - first match for outer = 1, 2 + * 2 1 - last match for outer = 1 + * 2 - last match for outer = 2 + * + * And if the inputs were sorted in descending order: + * outer >= inner + * 2 2 - first match for outer = 2 + * 1 1 - first match for outer = 1 + * 0 - last match for outer = 1, 2 + * + * It can be seen that the beginning of the matching interval of + * inner tuples changes when we move to the next outer tuple. + * Supporting this, i.e. testing and advancing the marked tuple, + * would complicate the join algorithm. Instead of that, we have + * the planner ensure that the inputs are suitably ordered, and + * recheck this on
Re: request for new parameter for disable promote (slave only mode)
On Fri, Jul 27, 2018 at 12:05 PM, Ioseph Kim wrote: > I want to build one master & multi slave environments to use physical > replication. > Slave nodes have low hardware spec, so I changed max_connection server > parameters, and try start slave node. > But I could not start slave nodes, > because CheckRequiredParameterValues function (in > src/backend/access/transam/xlog.c) reject to set less values then master’s > values. > > It is maybe, > This concept is base on some slave node can be promte master. No, it's because the machinery that computes MVCC snapshots on slaves relies on max_connections being set high enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Parallel Append implementation
On Sun, Jul 29, 2018 at 5:49 PM, Thomas Munro wrote: > On Thu, May 10, 2018 at 7:08 AM, Robert Haas wrote: >> [parallel-append-doc-v2.patch] > > +plans just as they can in any other plan. However, in a parallel plan, > +it is also possible that the planner may choose to substitute a > +Parallel Append node. > > Maybe drop "it is also possible that "? It seems a bit unnecessary > and sounds a bit odd followed by "may ", but maybe it's just me. Changed. > +Also, unlike a regular Append node, which can only > have > +partial children when used within a parallel plan, Parallel > +Append node can have both partial and non-partial child plans. > > Missing "a" before "Parallel". Fixed. > +Non-partial children will be scanned by only a single worker, since > > Are we using "worker" in a more general sense that possibly includes > the leader? Hmm, yes, other text on this page does that too. Ho hum. Tried to be more careful about this. New version attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company parallel-append-doc-v3.patch Description: Binary data
Re: Would like to help with documentation for Postgres 11
Thanks for the replies, I'll investigate further.. On Sun, Jul 29, 2018 at 7:11 PM Tatsuo Ishii wrote: > > Justin Pryzby writes: > >> On Sun, Jul 29, 2018 at 11:50:40AM -0500, Michael Goldshteyn wrote: > >>> I would like to offer some help writing and improving the English > >>> documentation for some of the new features and changes in Postgres 11. > If I > >>> can get an email of where such help would be appreciated, so I can > choose a > >>> feature I am familiar with, I would be glad to help. > > > >> The documentation is expected to be commited with the feature, so what's > >> currently in place is expected to be adequate and accuate. > > > > There is, however, a fair amount of stuff that was written by people > whose > > first language isn't English, and it shows. So plain old copy-editing > > is surely welcome, if you have a mind to do that. > > BTW, the CommitFest application seems not to pick up threads other > than in pgsql-hackers. This is usually fine. However if people want to > contribute document *only* patches and send it to pgsql-docs then > tries to register it into CF, the CF app does not recognize it. I am > not sure if this is intended behavior or not. > > Best regards, > -- > Tatsuo Ishii > SRA OSS, Inc. Japan > English: http://www.sraoss.co.jp/index_en.php > Japanese:http://www.sraoss.co.jp >
Re: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack
On 7/29/18, 7:35 PM, "Michael Paquier" wrote: > Yeah, I was testing that yesterday night and bumped on this case when > trying do a REINDEX SCHEMA pg_class. The point is that you can simplify > the check and remove pg_database_ownercheck as there is already an ACL > check on the database/system/schema at the top of the routine, so you > are already sure that pg_database_ownercheck() or > pg_namespace_ownercheck would return true. This shaves a few cycles as > well for each relation checked. Makes sense. >> I also noticed that this patch causes shared relations to be skipped >> silently. Perhaps we should emit a WARNING or DEBUG message when this >> happens, at least for REINDEXOPT_VERBOSE. > > That's intentional. I thought about that as well, but I am hesitant to > do so as we don't bother mentioning the other relations skipped. > REINDEX VERBOSE also shows up what are the tables processed, so it is > easy to guess what are the tables skipped, still more painful. And the > documentation changes added cover the gap. Okay, that seems reasonable to me, too. > +1, I have included your suggestions. The patch attached applies easily > down to 9.5 where REINDEX SCHEMA was added. For 9.4 and 9.3, there is > no schema case, still the new check is similar. The commit message is > slightly changed so as there is no mention of REINDEX SCHEMA. 9.3 needs > a slight change compared to 9.4 as well. For 9.3 and 9.4, it might be nice to add the "even if the user is the owner of the specified database" part, too. > So attached are patches for 9.5~master, 9.4 and 9.3 with commit > messages. Does that look fine to folks of this thread? Looks good to me. Since REINDEX can be used to block calls to load_critical_index() from new connections, back-patching seems appropriate. Nathan
Re: GiST VACUUM
On 29/07/18 14:47, Andrey Borodin wrote: Fixed both problems. PFA v14. Thanks, took a really quick look at this. The text being added to README is outdated for these latest changes. In second step I still use paloc's memory, but only to store two bitmaps: bitmap of internal pages and bitmap of empty leafs. Second physical scan only reads internal pages. I can omit that bitmap, if I'll scan everything. Also, I can replace emptyLeafs bitmap with array\list, but I do not really think it will be big. On a typical GiST index, what's the ratio of leaf vs. internal pages? Perhaps an array would indeed be better. If you have a really large index, the bitmaps can take a fair amount of memory, on top of the memory used for tracking the dead TIDs. I.e. that memory will be in addition to maintenance_work_mem. That's not nice, but I think it's OK in practice, and not worth spending too much effort to eliminate. For a 1 TB index with default 8k block size, the two bitmaps will take 32 MB of memory in total. If you're dealing with a database of that size, you ought to have some memory to spare. But if an array would use less memory, that'd be better. If you go with bitmaps, please use the existing Bitmapset instead of rolling your own. Saves some code, and it provides more optimized routines for iterating through all the set bits, too (bms_next_member()). Another possibility would be to use Tidbitmap, in the "lossy" mode, i.e. add the pages with tbm_add_page(). That might save some memory, compared to Bitmapset, if the bitmap is very sparse. Not sure how it compares with a plain array. A straightforward little optimization would be to skip scanning the internal pages, when the first scan didn't find any empty pages. And stop the scan of the internal pages as soon as all the empty pages have been recycled. - Heikki
Re: Explain buffers wrong counter with parallel plans
> On Jul 28, 2018, at 2:14 AM, Amit Kapila wrote: > > On Fri, Jul 27, 2018 at 11:12 PM, Jonathan S. Katz > wrote: >> >>> On Jul 27, 2018, at 8:31 AM, Amit Kapila wrote: >>> >>> >>> Yeah, that would be better. Today, I have tried the patch on both >>> Head and PG11 and I am getting same and correct results. >> >> I have applied the the patch to PG11beta2 and tested. >> > > I think we should backpatch this till 9.6 where the parallel query was > introduced. Note, that for HEAD and 11, the patch is same. For PG10, > the patch code is same, but because surrounding code is different, the > same patch didn't apply. For 9.6, we don't need to collect stats in > ExecShutdownNode. I have tested it in all the back branches and it > works fine. The logic on backpatching seems sounds. I confirmed my tests of the respective patches against 9.6.9 and 10.4. I'll defer to someone else for comments on the code, but from my read it appears to be a consistent approach for each version. Jonathan signature.asc Description: Message signed with OpenPGP
Re: Making "COPY partitioned_table FROM" faster
On 30 July 2018 at 20:33, Peter Eisentraut wrote: > Two more thoughts: > > - Add some tests. The if (nBufferedTuples > 0) that flushes the tuples > when the partition changes is not currently exercised. That seems like a good idea. In fact, it uncovered a bug around ConvertPartitionTupleSlot() freeing the previously stored tuple out the slot which resulted in a crash. I didn't notice before because my test had previously not required any tuple conversions. While ensuring my test was working correctly I noticed that I had a thinko in the logic that decided if another avgTuplesPerPartChange calculation was due. The problem was that the (cstate->cur_lineno - RECHECK_MULTI_INSERT_THRESHOLD) is unsigned and when cur_lineno is below RECHECK_MULTI_INSERT_THRESHOLD it results in an underflow which makes the if test always true until cur_lineno gets up to 1000. I considered just making all those counters signed, but chickened out when it came to changing "processed". I thought it was quite strange to have "processed" unsigned and the other counters that do similar things signed. Of course, signed would have enough range, but it would mean changing the return type of CopyFrom() which I wasn't willing to do. In the end, I just protected the if test so that it only calculates the average again if cur_lineno is at least RECHECK_MULTI_INSERT_THRESHOLD. This slightly changes when avgTuplesPerPartChange is first set, but it'll still be at least 1000 tuples before we start doing multi-inserts. Another option would be to cast the expression as int64... I'm a bit undecided what's best here. > - With proute becoming a function-level variable, > cstate->partition_tuple_routing is obsolete and could be removed. (No > point in saving this in cstate if it's only used in one function anyway.) Good idea. I've removed that. I've attached a delta patch based on the v4 patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v5_multiinsert_copy_delta.patch Description: Binary data
Re: partition tree inspection functions
Hi Amit, On 07/30/2018 05:21 AM, Amit Langote wrote: As 0 is a valid return value for root nodes I think we should use -1 instead for these cases. Makes sense, changed to be that way. Looks good, the documentation for pg_partition_level could be expanded to describe the -1 scenario. New status: Ready for Committer Best regards, Jesper
Re: Usability fail with psql's \dp command
Hello Kyotaro-san, Note that 'No privileges' could be somehow interpreted as "default privileges" (no "special/given" privileges) or as "no permissions at all", so there is still some ambiguity, at least for me. FWIW "No privileges" seems to me as "The user cannot access it at all" with no ambiguity. Ok. For me '' and 'No privileges' still looks like they mean the same, whereas the point of the patch is to solve the ambiguity. Currently the behavior is documented here. (This needs to be edited.) https://www.postgresql.org/docs/10/static/sql-grant.html Sure, but user friendlyness would suggest that the output should not be misleading from the start. So it changes the existing documented behavior. Sure. The behavior is misleading, and documentation is of little help in such a case. \dp is a convenient shortcut for users so the output should be intuitive or easy-to-grasp. Yes! [...] relacl | Access privileges +-- (null) | '(default)' {} | '(no privilege)' This suggestion is better as it avoids the "empty/no" ambiguity. It breaks the documented behavior, but I'm fine with that anyway. The human I am now has to know what "default" permissions are depending on the kind of object, where it could have said it to me directly. Moreover, the line is not self-contained because the default permission depends on the owner, but "\dp" does not tell who the owner is, which is another annoyance. A benefit of your approach is that its coding is easy because it does not have to fetch the owner tuple and reconstruct the default perms depending on the kind of object. A cleaner approach would be to have a NOT NULL column and have the default always explicit, instead of having a lazy instantiation of the field managed on GRANT/REVOKE but not on initialization. However this is probably too big a change for the problem at hand, and it would not solve the ambiguity issue for previous versions. -- Fabien.
Re: Tips on committing
On 23/07/2018 05:58, Tom Lane wrote: > 013f320dc reminds me of something I check for religiously: look for > alternative output files for any regression test you're updating the > output of. > > Actually updating said files, once you notice you need to, can be tricky > in itself. Most of the time it works to just apply the same patch to the > other variants as the delta you're observing for the output file that's > relevant to your own platform. Or you may be able to change configuration > so as to replicate the correct output for another alternative file. But > sometimes you just have to guess (and then watch the buildfarm to see if > you guessed right). There is a recipe for doing this using the "merge" command here: https://wiki.postgresql.org/wiki/Regression_test_authoring#Updating_an_existing_regression_test YMMV -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] pg_hba.conf : new auth option : clientcert=verify-full
On 07/19/2018 03:00 AM, Thomas Munro wrote: Some more comments: if (parsedline->auth_method == uaCert) { - parsedline->clientcert = true; + parsedline->clientcert = clientCertOn; } The "cert" method is technically redundant with this patch, because it's equivalent to "trust clientcert=verify-full", right? That gave me an idea: wouldn't the control flow be a bit nicer if you just treated uaCert the same as uaTrust in the big switch statement, but also enforced that clientcert = clientCertFull in the hunk above? Then you could just have one common code path to reach CheckCertAuth() for all auth methods after that switch statement, instead of the more complicated conditional coding you have now with two ways to reach it. Would that be better? That would result in a couple less LOC and a bit clearer conditionals, I agree. If there are no objections to make uaCert a quasi-alias of uaTrust with clientcert=verify-full, I'll go ahead and change the code accordingly. I'll make sure that the error messages are still handled based on the auth method and not only depending on the clientcert flag. In the "auth-cert" section there are already a couple of examples using lower case "cn": will be sent to the client. The cn (Common Name) is a check that the cn attribute matches the database I guess you should do the same, or if there is a good reason to use upper case "CN" instead then you should change those to match. I see that there are currently no places that use CN right now. However, we refer to Certification Authority as CA in most cases (without the literal tag surrounding it). The different fields of certificates seem to be abbreviated with capital letters in most literature; That was my reasoning for capitalizing it in the first place. I'm open for suggestions, but in absence of objections I might just capitalize all occurrences of CN. There is still a place in the documentation where we refer to "1": In a pg_hba.conf record specifying certificate authentication, the authentication option clientcert is assumed to be 1, and it cannot be turned off since a client certificate is necessary for this method. What the cert method adds to the basic clientcert certificate validity test is a check that the cn attribute matches the database user name. Maybe we should use "verify-ca" here, though as I noted above I think it should really "verify-full" and we should simplify the flow. Yes, we should adopt the new style in all places. I'll rewrite that passage to indicate that cert authentication essentially results in the same behaviour as clientcert=verify-full. The existing text is somewhat ambiguous anyway, in case somebody decides to skip over the restriction described in the second sentence. I think we should also document that "1" and "0" are older spellings still accepted, where the setting names are introduced. +typedef enum ClientCertMode +{ + clientCertOff, + clientCertOn, + clientCertFull +} ClientCertMode; + +1 What do you think about using clientCertCA for the enumerator name instead of clientCertOn? That would correspond better to the names "verify-ca" and "verify-full". +1 I'm not sure if Magnus had any other cases in mind when he named it clientCertOn? Should I mark this entry as "Needs review" in the commitfest? It seems some discussion is still needed to get this commited... kind regards Julian
Re: ssl_library parameter
On 26/06/2018 11:49, Daniel Gustafsson wrote: >> Extracted from the GnuTLS thread/patch, here is a patch to add a >> server-side read-only parameter ssl_library, which currently reports >> either 'OpenSSL' or an empty string, depending on what SSL library was >> built with. This is analogous to the libpq function call >> PQsslAttribute(conn, "library"), but there was no equivalent >> functionality on the server side. > > Looks good to me, +1 committed -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Make deparsing of column defaults faster
On Mon, Jul 30, 2018 at 7:03 AM, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 07/07/2018 20:07, Jeff Janes wrote: > > One case that your patch doesn't improve (neither does my posted one) is > > check constraints. To fix that, pg_get_constraintdef_worker would also > > need to grow a cache as well. I don't know how often people put check > > constraints on most of the columns of a table. Some people like their > > NOT NULL constraints to be named, not implicit. > > > > But from the bigger picture of making pg_upgrade faster, a major issue > > is that while pg_dump -s gets faster for the column default case, the > > restore of that dump is still slow (again, my posted patch also doesn't > > fix that). In that case it is deparse_context_for called from > > StoreAttrDefault which is slow. > > Any thoughts on how to proceed here? It seems there is more work to do > to cover all the issues with dumping and restoring tables with many > columns. Since the original report was in the context of pg_upgrade, we > should surely address at least the pg_restore slowness. > > I'll working on solving the problem using a hash table at the lowest level (making column names unique), for a future commit fest. That should drop it from N^3 to N^2, which since N can't go above 1600 should be good enough. So we can set this to rejected, as that will be an entirely different approach. Your caching patch might be worthwhile on its own, though. Cheers, Jeff
Re: Make deparsing of column defaults faster
On 07/07/2018 20:07, Jeff Janes wrote: > One case that your patch doesn't improve (neither does my posted one) is > check constraints. To fix that, pg_get_constraintdef_worker would also > need to grow a cache as well. I don't know how often people put check > constraints on most of the columns of a table. Some people like their > NOT NULL constraints to be named, not implicit. > > But from the bigger picture of making pg_upgrade faster, a major issue > is that while pg_dump -s gets faster for the column default case, the > restore of that dump is still slow (again, my posted patch also doesn't > fix that). In that case it is deparse_context_for called from > StoreAttrDefault which is slow. Any thoughts on how to proceed here? It seems there is more work to do to cover all the issues with dumping and restoring tables with many columns. Since the original report was in the context of pg_upgrade, we should surely address at least the pg_restore slowness. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fix for documentation of Covering Indexes
On 04/18/2018 12:52 PM, Heikki Linnakangas wrote: On 11/04/18 04:20, Michael Paquier wrote: Hi all, The documentation of covering indexes is incorrect for CREATE and ALTER TABLE: - ALTER TABLE's page is missing the call. - Exclusion constraints can use INCLUDE clauses. In order to simplify the documentation, please let me suggest the attached which moves the definition of the INCLUDE clause into the section index_parameters, which is compatible with what I read from the parser. Committed, thanks! - Heikki Following this change, I believe we need to modify UNIQUE and PRIMARY KEY descriptions in CREATE TABLE as they still mention INCLUDE but not the other index_parameters. The attached patch fixes this inconsistency, as well as adds a separate paragraph for INCLUDE in CREATE TABLE to clarify its purpose and avoid repetition in constraint descriptions. It also reorders the paragraphs in CREATE INDEX to follow the syntax. -- Liudmila Mantrova Technical writer at Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml index 3c1223b..c67f187 100644 --- a/doc/src/sgml/ref/create_index.sgml +++ b/doc/src/sgml/ref/create_index.sgml @@ -145,52 +145,6 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] - INCLUDE - - -The optional INCLUDE clause specifies a -list of columns which will be included in the index -as non-key columns. A non-key column cannot -be used in an index scan search qualification, and it is disregarded -for purposes of any uniqueness or exclusion constraint enforced by -the index. However, an index-only scan can return the contents of -non-key columns without having to visit the index's table, since -they are available directly from the index entry. Thus, addition of -non-key columns allows index-only scans to be used for queries that -otherwise could not use them. - - - -It's wise to be conservative about adding non-key columns to an -index, especially wide columns. If an index tuple exceeds the -maximum size allowed for the index type, data insertion will fail. -In any case, non-key columns duplicate data from the index's table -and bloat the size of the index, thus potentially slowing searches. - - - -Columns listed in the INCLUDE clause don't need -appropriate operator classes; the clause can include -columns whose data types don't have operator classes defined for -a given access method. - - - -Expressions are not supported as included columns since they cannot be -used in index-only scans. - - - -Currently, only the B-tree index access method supports this feature. -In B-tree indexes, the values of columns listed in the -INCLUDE clause are included in leaf tuples which -correspond to heap tuples, but are not included in upper-level -index entries used for tree navigation. - - - - - name @@ -317,6 +271,52 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] + INCLUDE + + +The optional INCLUDE clause specifies a +list of columns which will be included in the index +as non-key columns. A non-key column cannot +be used in an index scan search qualification, and it is disregarded +for purposes of any uniqueness or exclusion constraint enforced by +the index. However, an index-only scan can return the contents of +non-key columns without having to visit the index's table, since +they are available directly from the index entry. Thus, addition of +non-key columns allows index-only scans to be used for queries that +otherwise could not use them. + + + +It's wise to be conservative about adding non-key columns to an +index, especially wide columns. If an index tuple exceeds the +maximum size allowed for the index type, data insertion will fail. +In any case, non-key columns duplicate data from the index's table +and bloat the size of the index, thus potentially slowing searches. + + + +Columns listed in the INCLUDE clause don't need +appropriate operator classes; the clause can include +columns whose data types don't have operator classes defined for +a given access method. + + + +Expressions are not supported as included columns since they cannot be +used in index-only scans. + + + +Currently, only the B-tree index access method supports this feature. +In B-tree indexes, the values of columns listed in the +INCLUDE clause are included in leaf
Re: [PATCH] Include application_name in "connection authorized" log message
On 13/07/2018 20:20, Don Seiler wrote: > See attached for latest revision. This doesn't compile with SSL enabled because there is a comma missing. This implementation doesn't run the application_name through check_application_name(), so it could end up logging application_name values that are otherwise not acceptable. I'm not sure of the best way to fix that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Improve geometric types
> OK, thanks for confirming. I'll get it committed and we'll see what the > animals think soon. Thank you for fixing this. I wanted to preserve this code but wasn't sure about the correct place or whether it is still necessary. There are more places we produce -0. The regression tests have alternative results to cover them. I have the "float-zero" patch for this. Although I am not sure if it is a correct fix. I think we should find the correct fix, and apply it globally to floating point operations. This can be only enabled for platforms which produce -0, so the others don't have to pay the price.
Re: patch to allow disable of WAL recycling
At Mon, 30 Jul 2018 10:43:20 +0200, Peter Eisentraut wrote in > On 19/07/2018 05:59, Kyotaro HORIGUCHI wrote: > > My result is that we cannot disable recycling perfectly just by > > setting min/max_wal_size. > > Maybe the behavior of min_wal_size should be rethought? Elsewhere in > this thread, there was also a complaint that max_wal_size isn't actually > a max. It seems like there might be some interest in making these > settings more accurate. > > I mean, what is the point of the min_wal_size setting if not controlling > this very thing? Sorry, I have forgotten to mention it. The definition of the variable is "We won't reduce segments to no less than this segments (but in MB) even if we don't need such many segments until the next checkpoint". I couldn't guess a proper value for it to indicate the behaior that "I don't want to keep (recycle) preallocated segments even for expected checkpint interval.". In short, since I thought that it's not intuitive at that time. Reconsidering the candidate values: 0 seems to keep segments for the next checkpoit interval. -1 seems that it just disables segment reduction (this is the same with setting the same value with max_wal_size?) Maybe we could -1 for this purpose. guc.c | {"min_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS, | gettext_noop("Sets the minimum size to shrink the WAL to."), + gettext_noop("-1 turns off WAL recycling."), # This seems somewhat.. out-of-the-blue? wal-configuraiton.html | The number of WAL segment files in pg_wal directory depends on | min_wal_size, max_wal_size and the amount of WAL generated in | previous checkpoint cycles. When old log segment files are no | longer needed, they are removed or recycled (that is, renamed | to become future segments in the numbered sequence). If, due to ... | extent. min_wal_size puts a minimum on the amount of WAL files | recycled for future usage; that much WAL is always recycled for | future use, even if the system is idle and the WAL usage | estimate suggests that little WAL is needed. + If you don't need the recycling feature, setting -1 to + min_wal_size disables the feature and WAL files are created on + demand. # I'm not sure this makes sense for readers. Besides the above, I supppose that this also turns off preallcoation of a whole segment at the first use, which could cause problems here and there... If we allowed a string value like 'no-prealloc' for min_wal_size, it might be comprehensive? # Sorry for the scattered thoughts regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: partition tree inspection functions
Hi, On 2018/07/27 21:21, Jesper Pedersen wrote: > Hi Amit, > > On 07/26/2018 10:33 PM, Amit Langote wrote: >> Optional parameter sounds good, so made it get_partition_level(regclass [ >> , regclass ]) in the updated patch. Although, adding that argument is not >> without possible surprises its result might evoke. Like, what happens if >> you try to find the level of the root table by passing a leaf partition >> oid for the root table argument, or pass a totally unrelated table for the >> root table argument. For now, I've made the function return 0 for such >> cases. >> > > As 0 is a valid return value for root nodes I think we should use -1 > instead for these cases. Makes sense, changed to be that way. Thanks, Amit >From bf12e5a4ee378603dec0216201c8234c96d7a973 Mon Sep 17 00:00:00 2001 From: amit Date: Tue, 16 Jan 2018 19:02:13 +0900 Subject: [PATCH v7] Add assorted partition reporting functions --- doc/src/sgml/func.sgml | 86 ++ src/backend/catalog/partition.c | 244 ++- src/backend/utils/cache/lsyscache.c | 22 +++ src/include/catalog/pg_proc.dat | 48 ++ src/include/utils/lsyscache.h| 1 + src/test/regress/expected/partition_info.out | 204 ++ src/test/regress/parallel_schedule | 2 +- src/test/regress/serial_schedule | 1 + src/test/regress/sql/partition_info.sql | 91 ++ 9 files changed, 694 insertions(+), 5 deletions(-) create mode 100644 src/test/regress/expected/partition_info.out create mode 100644 src/test/regress/sql/partition_info.sql diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index edc9be92a6..e1b7ace898 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19995,6 +19995,92 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); The function returns the number of new collation objects it created. + +Partitioning Information Functions + + + Name Return Type Description + + + + + pg_partition_parent(regclass) + regclass + get parent if table is a partition, NULL otherwise + + + pg_partition_root_parent(regclass) + regclass + get topmost parent of a partition within partition tree + + + pg_partition_level(regclass, regclass) + regclass + get level of a partition within partition tree with respect to given parent + + + pg_partition_level(regclass) + regclass + get level of a partition within partition tree with respect to topmost root parent + + + pg_partition_children(regclass, bool) + setof regclass + +get partitions of a table; only immediate partitions are returned, +unless all tables in the partition tree, including itself and +partitions of lower levels, are requested by passing +true for second argument + + + + pg_partition_children(regclass) + setof regclass + Shorthand for pg_partition_children(..., false) + + + pg_partition_leaf_children(regclass) + setof regclass + get all leaf partitions of a given table + + + + + + +If the table passed to pg_partition_root_parent is not +a partition, the same table is returned as the result. + + + +For example, to check the total size of the data contained in +measurement table described in +, use the following +query: + + + +select pg_size_pretty(sum(pg_relation_size(p))) as total_size from pg_partition_children('measurement', true) p; + total_size + + 24 kB +(1 row) + + + +One could have used pg_partition_leaf_children in +this case and got the same result as shown below: + + + +select pg_size_pretty(sum(pg_relation_size(p))) as total_size from pg_partition_leaf_children('measurement') p; + total_size + + 24 kB +(1 row) + + + diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 558022647c..d90aaf8bb8 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -23,13 +23,16 @@ #include "catalog/partition.h" #include "catalog/pg_inherits.h" #include "catalog/pg_partitioned_table.h" +#include "funcapi.h" #include "nodes/makefuncs.h" #include "optimizer/clauses.h" #include "optimizer/prep.h" #include "optimizer/var.h" #include "partitioning/partbounds.h" #include "rewrite/rewriteManip.h" +#include "utils/builtins.h" #include "utils/fmgroids.h" +#include "utils/lsyscache.h" #include "utils/partcache.h" #include "utils/rel.h" #include "utils/syscache.h" @@ -38,16 +41,14 @@ static Oid get_partition_parent_worker(Relation inhRel, Oid relid); static void get_partition_ancestors_worker(Relation inhRel, Oid relid,
Re: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack
At Mon, 30 Jul 2018 09:34:22 +0900, Michael Paquier wrote in <20180730003422.ga2...@paquier.xyz> > On Sun, Jul 29, 2018 at 04:11:38PM +, Bossart, Nathan wrote: > > On 7/27/18, 7:10 PM, "Michael Paquier" wrote: > > This is added to ReindexMultipleTables(), which is used for REINDEX > > SCHEMA, REINDEX SYSTEM, and REINDEX DATABASE. Presently, REINDEX > > SCHEMA requires being the owner of the schema, and REINDEX SYSTEM and > > REINDEX DATABASE require being the owner of the database. So, if a > > role is an owner of a database or the pg_catalog schema, they are able > > to reindex shared catalogs like pg_authid. > > Yeah, I was testing that yesterday night and bumped on this case when > trying do a REINDEX SCHEMA pg_class. The point is that you can simplify > the check and remove pg_database_ownercheck as there is already an ACL > check on the database/system/schema at the top of the routine, so you > are already sure that pg_database_ownercheck() or > pg_namespace_ownercheck would return true. This shaves a few cycles as > well for each relation checked. There's a case where the database owner differs from catalogs' owner. ALTER DATABASE OWNER TO can make such configuration. In this case, the previous code allows REINDEX of a shared catalog for the database owner who is not the catalog's owner. Superuser : alice Database owner : joe Catalog owner : andy Schema owner : joe "REINDEX SCHERMA " by joe allows shared catalog to be reindexed, even he is *not* the owner of the catalog. The revised patch behaves differently to this case. "REINDEX SCHERMA " by joe *skips* shared catalog since he is not the owner of the catalog. # Uh? Alice doesn't involved anywhere.. I feel that just being a database owner doesn't justify to cause this problem innocently. Catalog owner is also doubious but we can carefully configure the ownerships to avoid the problem since only superuser can change it. So I vote +1 for the revised patch. > > I also noticed that this patch causes shared relations to be skipped > > silently. Perhaps we should emit a WARNING or DEBUG message when this > > happens, at least for REINDEXOPT_VERBOSE. > > That's intentional. I thought about that as well, but I am hesitant to > do so as we don't bother mentioning the other relations skipped. > REINDEX VERBOSE also shows up what are the tables processed, so it is > easy to guess what are the tables skipped, still more painful. And the > documentation changes added cover the gap. Even if it is written in the "Notes" section, doesn't the following need some fix? It is the same for the DATBASE item. | Parameters ... | SYSTEM | Recreate all indexes on system catalogs within the current | database. *Indexes on shared system catalogs are included*. | Indexes on user tables are not processed. This form | of REINDEX cannot be executed inside a transaction block. > > I noticed that there is no mention that the owner of a schema can do > > REINDEX SCHEMA, which seems important to note. Also, the proposed > > wording might seem slightly ambiguous for the REINDEX DATABASE case. > > It might be clearer to say something like the following: > > > > Reindexing a single index or table requires being the owner of > > that index of table. REINDEX DATABASE and REINDEX SYSTEM > > require being the owner of the database, and REINDEX SCHEMA > > requires being the owner of the schema (note that the user can > > therefore rebuild indexes of tables owned by other users). > > Reindexing a shared catalog requires being the owner of the > > shared catalog, even if the user is the owner of the specified > > database or schema. Of course, superusers can always reindex > > anything. > > +1, I have included your suggestions. The patch attached applies easily > down to 9.5 where REINDEX SCHEMA was added. For 9.4 and 9.3, there is > no schema case, still the new check is similar. The commit message is > slightly changed so as there is no mention of REINDEX SCHEMA. 9.3 needs > a slight change compared to 9.4 as well. > > So attached are patches for 9.5~master, 9.4 and 9.3 with commit > messages. Does that look fine to folks of this thread? This apparently changes the documented behavior and the problem seems to be a result of a rather malicious/intentional combination of operatoins (especially named vacuum on a shared catalog). I vote -0.5 to backpatch unless we categorize this as a security issue. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: patch to allow disable of WAL recycling
On 19/07/2018 05:59, Kyotaro HORIGUCHI wrote: > My result is that we cannot disable recycling perfectly just by > setting min/max_wal_size. Maybe the behavior of min_wal_size should be rethought? Elsewhere in this thread, there was also a complaint that max_wal_size isn't actually a max. It seems like there might be some interest in making these settings more accurate. I mean, what is the point of the min_wal_size setting if not controlling this very thing? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Making "COPY partitioned_table FROM" faster
On 2018/07/30 17:33, Peter Eisentraut wrote: > - With proute becoming a function-level variable, > cstate->partition_tuple_routing is obsolete and could be removed. (No > point in saving this in cstate if it's only used in one function anyway.) +1. Also seems to apply to transition_capture, which afaict, was added in cstate only because partition_tuple_routing is there. Thanks, Amit
Re: Making "COPY partitioned_table FROM" faster
Two more thoughts: - Add some tests. The if (nBufferedTuples > 0) that flushes the tuples when the partition changes is not currently exercised. - With proute becoming a function-level variable, cstate->partition_tuple_routing is obsolete and could be removed. (No point in saving this in cstate if it's only used in one function anyway.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018/07/28 10:54, David Rowley wrote: > On 27 July 2018 at 19:11, Amit Langote wrote: >> I've attached a delta patch to make the above changes. I'm leaving the >> hash table rename up to you though. > > Thanks for the delta patch. I took all of it, just rewrote a comment slightly. > > I also renamed the hash table to your suggestion and changed a few more > things. > > Attached a delta based on v2 and the full v3 patch. > > This includes another small change to make > PartitionDispatchData->indexes an array that's allocated in the same > memory as the PartitionDispatchData. This will save a palloc() call > and also should be a bit more cache friendly. > > This also required a rebase on master again due to 3e32109049. Thanks for the updated patch. I couldn't find much to complain about in the latest v3, except I noticed a few instances of the word "setup" where I think what's really meant is "set up". + * must be setup, but any sub-partitioned tables can be setup lazily as + * A ResultRelInfo has not been setup for this partition yet, By the way, when going over the updated code, I noticed that the code around child_parent_tupconv_maps could use some refactoring too. Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates child-to-parent map array needed for transition tuple capture even if not needed by any of the leaf partitions. I'm attaching here a patch that applies on top of your v3 to show what I'm thinking we could do. Thanks, Amit From 6ce1654aa929c7f8112c430914af7f464474ed31 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 30 Jul 2018 14:05:17 +0900 Subject: [PATCH 2/2] Some refactoring around child_parent_tupconv_maps Just like parent_child_tupconv_maps, we should allocate it only if needed. Also, if none of the partitions ended up needing a map, we should not have allocated the child_parent_tupconv_maps array, only the child_parent_map_not_required one. So, get rid of ExecSetupChildParentMapForLeaf(), which currently does initial, possibly useless, allocation of both of the above mentioned arrays. Instead, have TupConvMapForLeaf() allocate the needed array, just like ExecInitRoutingInfo does, when it needs to store a parent-to-child map. Finally, rename the function TupConvMapForLeaf to LeafToParentTupConvMapForTC for clarity; TC stands for "Transition Capture". --- src/backend/commands/copy.c| 19 +- src/backend/executor/execPartition.c | 102 - src/backend/executor/nodeModifyTable.c | 4 +- src/include/executor/execPartition.h | 12 ++-- 4 files changed, 59 insertions(+), 78 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 6fc1e2b41c..6d0e9229e0 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2503,22 +2503,9 @@ CopyFrom(CopyState cstate) * CopyFrom tuple routing. */ if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) - { - PartitionTupleRouting *proute; - - proute = cstate->partition_tuple_routing = + cstate->partition_tuple_routing = ExecSetupPartitionTupleRouting(NULL, cstate->rel); - /* -* If we are capturing transition tuples, they may need to be -* converted from partition format back to partitioned table format -* (this is only ever necessary if a BEFORE trigger modifies the -* tuple). -*/ - if (cstate->transition_capture != NULL) - ExecSetupChildParentMapForLeaf(proute); - } - /* * It's more efficient to prepare a bunch of tuples for insertion, and * insert them in one heap_multi_insert() call, than call heap_insert() @@ -2666,8 +2653,8 @@ CopyFrom(CopyState cstate) */ cstate->transition_capture->tcs_original_insert_tuple = NULL; cstate->transition_capture->tcs_map = - TupConvMapForLeaf(proute, saved_resultRelInfo, - leaf_part_index); + LeafToParentTupConvMapForTC(proute, saved_resultRelInfo, + leaf_part_index); } else { diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 2a18a30b3e..d183e8b758 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -400,7 +400,7 @@ ExecExpandRoutingArrays(PartitionTupleRouting *proute)
Re: adding tab completions
On Sun, Jul 29, 2018 at 07:42:43PM -0500, Justin Pryzby wrote: > Your suggestion is good, so attached updated patch. The patch is in good shape. It compiles without errors. The patch doesn't need in documentation. I marked the patch as "Ready for Commiter". > > > Actually..another thought: since toast tables may be VACUUM-ed, should I > > > introduce Query_for_list_of_tpmt ? > > I didn't include this one yet though. > > Feel free to bump to next CF. I think it could be done by a separate patch. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: Temporary tables prevent autovacuum, leading to XID wraparound
On Fri, Jul 27, 2018 at 08:27:26AM +, Tsunakawa, Takayuki wrote: > I don't have a strong opinion, but I wonder which of namespace.c or > autovacuum.c is suitable, because isTempNamespaceInUse is specific to > autovacuum. I think that there is also a point in allowing other backends to use it as well, so I left it in namespace.c. I have been doing more testing with this patch today. In order to catch code paths where this is triggered I have for example added an infinite loop in do_autovacuum when finding a temp table which exits once a given on-disk file is found. This lets plenty of time to attach autovacuum to a debugger, and play with other sessions in parallel, so I have checked the transactional assumption this patch relied on, and tested down to v10 as that's where removal of orphaned temp relations has been made more aggressive. This patch can also be applied cleanly there. The check on MyDatabaseId does not actually matter much as the same temporary namespace OID would get reused only after an OID wraparound with a backend using a different database but I left it anyway as that's more consistent to me to check for database and namespace, and added a sanity check to make sure that the routine gets called only by a process connected to a database. I am going to be out of town for a couple of days, and the next minor release planned is close by, so I am not pushing that today, but I'd like to do so after the next minor release if there are no objections. Attached is a patch with a proposal of commit message. -- Michael From 503868a13ff3c9cf41a3f3295977bd7c41f4ab6f Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 30 Jul 2018 16:54:27 +0900 Subject: [PATCH] Make autovacuum more aggressive to remove orphaned temp tables Commitdafa084, added in 10, made the removal of temporary orphaned tables more aggressive. This commit makes an extra step in the aggressiveness by adding a flag in each backend's MyProc which tracks down any namespace currently in use. The flag is set when the namespace gets created and can be reset if the temporary namespace has been created in a transaction or sub-transaction which got aborted. This flag can be updated in a lock-less fashion, as autovacuum workers would scan pg_class without seeing the temporary namespace or any temporary relations on the transaction creating it gets committed. The base idea of this patch comes from Robert Haas. Author: Tsunakawa Takayuki Reviewed-by: Michael Paquier, Kyotaro Horiguchi Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05 Backpatch: 10- --- src/backend/catalog/namespace.c | 65 - src/backend/postmaster/autovacuum.c | 19 - src/backend/storage/lmgr/proc.c | 2 + src/include/catalog/namespace.h | 1 + src/include/storage/proc.h | 3 ++ 5 files changed, 77 insertions(+), 13 deletions(-) diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index 0f67a122ed..b0fbb0b010 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -47,7 +47,7 @@ #include "parser/parse_func.h" #include "storage/ipc.h" #include "storage/lmgr.h" -#include "storage/sinval.h" +#include "storage/sinvaladt.h" #include "utils/acl.h" #include "utils/builtins.h" #include "utils/catcache.h" @@ -3204,6 +3204,45 @@ isOtherTempNamespace(Oid namespaceId) return isAnyTempNamespace(namespaceId); } +/* + * isTempNamespaceInUse - is the given namespace owned and actively used + * by a backend? + * + * Note: this can be used while scanning relations in pg_class to detect + * orphaned temporary tables or namespaces with a backend connected to a + * given database. + */ +bool +isTempNamespaceInUse(Oid namespaceId) +{ + PGPROC *proc; + int backendId; + + Assert(OidIsValid(MyDatabaseId)); + + backendId = GetTempNamespaceBackendId(namespaceId); + + if (backendId == InvalidBackendId || + backendId == MyBackendId) + return false; + + /* Is the backend alive? */ + proc = BackendIdGetProc(backendId); + if (proc == NULL) + return false; + + /* Is the backend connected to the same database we are looking at? */ + if (proc->databaseId != MyDatabaseId) + return false; + + /* Does the backend own the temporary namespace? */ + if (proc->tempNamespaceId != namespaceId) + return false; + + /* all good to go */ + return true; +} + /* * GetTempNamespaceBackendId - if the given namespace is a temporary-table * namespace (either my own, or another backend's), return the BackendId @@ -3893,6 +3932,14 @@ InitTempTableNamespace(void) myTempNamespace = namespaceId; myTempToastNamespace = toastspaceId; + /* + * Mark MyProc as owning this namespace which other processes can use to + * decide if a temporary namespace is in use or not. Even if this is an + * atomic operation, this can be safely done lock-less as no temporary + * relations associated to it can be seen yet if scanning pg_class. + */ +
RE: How can we submit code patches that implement our (pending) patents?
From: Chris Travers [mailto:chris.trav...@adjust.com] > For example a competitor of yours could copy the relevant pieces of the > PostgreSQL code, refactor this into a library, and then use it as a > derivative work and this would be entirely within the copyright license. > They could then license that library out, and you could not use your patents > or copyrights to stop them. As such, a patent license would probably be > very similar from a business perspective to a global public grant even though > the two strike me as something which might not offer protection in the case > of a clean-room implementation. > > I certainly wouldn't be opposed to accepting patents where a license was > granted to the PostgreSQL Global Developer Group and the community as a > whole to make full use of the patents in any way covered by the copyright > license of PostgreSQL (i.e where any use that involves utilizing the > copyright license for PostgreSQL extends to the patents in question). But > I am not sure that a business case would be able to be made for releasing > any patents under such a license since it means that for anyone else, using > the patents even in commercial software becomes trivial and enforcement > would become very difficult indeed. It looks like you are right in that we can't restrict the use of patents in PostgreSQL code to PostgreSQL-based software... Re-reading Apache License 2.0, it doesn't have any restriction either. But, like Apache License, I think the patents can be useful to prevent litigation. Regards Takayuki Tsunakawa