Re: Needless additional partition check in INSERT?
On 10 May 2018 at 05:33, David Rowleywrote: > On 10 May 2018 at 16:13, Amit Langote wrote: >> The patch to ExecInsert looks good, but I think we also need to do the >> same thing in CopyFrom. > > I think so too. > > Updated patch attached. Patch is good. The cause of this oversight is the lack of comments to explain the original coding, so we need to correct that in this patch, please. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PANIC during crash recovery of a recently promoted standby
Hello, I recently investigated a problem where a standby is promoted to be the new master. The promoted standby crashes shortly thereafter for whatever reason. Upon running the crash recovery, the promoted standby (now master) PANICs with message such as: PANIC,XX000,"WAL contains references to invalid pages""XLogCheckInvalidPages, xlogutils.c:242","" After investigation, I could recreate a reproduction scenario for this problem. The attached TAP test (thanks Alvaro from converting my bash script to a TAP test) demonstrates the problem. The test is probably sensitive to timing, but it reproduces the problem consistently at least at my end. While the original report was for 9.6, I can reproduce it on the master and thus it probably affects all supported releases. Investigations point to a possible bug where we fail to update the minRecoveryPoint after completing the ongoing restart point upon promotion. IMV after promotion the new master must always recover to the end of the WAL to ensure that all changes are applied correctly. But what we've instead is that minRecoveryPoint remains set to a prior location because of this: /* * Update pg_control, using current time. Check that it still shows * IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing; * this is a quick hack to make sure nothing really bad happens if somehow * we get here after the end-of-recovery checkpoint. */ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && ControlFile->checkPointCopy.redo < lastCheckPoint.redo) { ControlFile->checkPoint = lastCheckPointRecPtr; ControlFile->checkPointCopy = lastCheckPoint; ControlFile->time = (pg_time_t) time(NULL); /* * Ensure minRecoveryPoint is past the checkpoint record. Normally, * this will have happened already while writing out dirty buffers, * but not necessarily - e.g. because no buffers were dirtied. We do * this because a non-exclusive base backup uses minRecoveryPoint to * determine which WAL files must be included in the backup, and the * file (or files) containing the checkpoint record must be included, * at a minimum. Note that for an ordinary restart of recovery there's * no value in having the minimum recovery point any earlier than this * anyway, because redo will begin just after the checkpoint record. */ if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr) { ControlFile->minRecoveryPoint = lastCheckPointEndPtr; ControlFile->minRecoveryPointTLI = lastCheckPoint.ThisTimeLineID; /* update local copy */ minRecoveryPoint = ControlFile->minRecoveryPoint; minRecoveryPointTLI = ControlFile->minRecoveryPointTLI; } if (flags & CHECKPOINT_IS_SHUTDOWN) ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY; UpdateControlFile(); } LWLockRelease(ControlFileLock); After promotion, the minRecoveryPoint is only updated (cleared) when the first regular checkpoint completes. If a crash happens before that, we will run the crash recovery with a stale minRecoveryPoint, which results into the PANIC that we diagnosed. The test case was written to reproduce the issue as reported to us. Thus the test case TRUNCATEs and extends the table at hand after promotion. The crash shortly thereafter leaves the pages in uninitialised state because the shared buffers are not yet flushed to the disk. During crash recovery, we see uninitialised pages for the WAL records written before the promotion. These pages are remembered and we expect to either see a DROP TABLE or TRUNCATE WAL record before the minRecoveryPoint is reached. But since the minRecoveryPoint is still pointing to a WAL location prior to the TRUNCATE operation, crash recovery hits the minRecoveryPoint before seeing the TRUNCATE WAL record. That results in a PANIC situation. I propose that we should always clear the minRecoveryPoint after promotion to ensure that crash recovery always run to the end if a just-promoted standby crashes before completing its first regular checkpoint. A WIP patch is attached. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0002-Ensure-recovery-is-run-to-the-end-upon-promotion-of-.patch Description: Binary data 0001-A-new-TAP-test-to-test-a-recovery-bug.patch Description: Binary data
Re: Needless additional partition check in INSERT?
On 2018/05/10 13:33, David Rowley wrote: > On 10 May 2018 at 16:13, Amit Langotewrote: >> The patch to ExecInsert looks good, but I think we also need to do the >> same thing in CopyFrom. > > I think so too. > > Updated patch attached. Thanks, looks good. Regards, Amit
Re: Needless additional partition check in INSERT?
On 10 May 2018 at 16:13, Amit Langotewrote: > The patch to ExecInsert looks good, but I think we also need to do the > same thing in CopyFrom. I think so too. Updated patch attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services remove_needless_additional_partition_check_v2.patch Description: Binary data
Re: Needless additional partition check in INSERT?
On 2018/05/10 12:55, David Rowley wrote: > Hi, > > Scanning ExecInsert, it looks like there's a needless additional > partition constraint check against the tuple. This should only be > required if there's a before row INSERT trigger. The code block up > one from the additional check tries to disable the check, but it goes > ahead regardless, providing there's some other constraint. > > ExecFindPartition should have already located the correct partition > and nothing should have changed in the absence of before row insert > triggers, so it looks like we're fine to not bother re-checking. I think you're right. So if we call ExecConstraints only because there are other tuple constraints (rd_att->constr != NULL), then currently we pass true for whether to check the partition constraint, irrespective of the value of check_partition_constr. I guess 19c47e7c820 should have done it the way your patch teaches it do to begin with. The patch to ExecInsert looks good, but I think we also need to do the same thing in CopyFrom. Thanks, Amit
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
On Wed, May 9, 2018 at 5:20 PM, Etsuro Fujitawrote: > > (2018/04/25 18:51), Ashutosh Bapat wrote: >> Actually I noticed that ConvertRowtypeExpr are used to cast a child's >> whole row reference expression (not just a Var node) into that of its >> parent and not. For example a cast like NULL::child::parent produces a >> ConvertRowtypeExpr encapsulating a NULL constant node and not a Var >> node. We are interested only in ConvertRowtypeExprs embedding Var >> nodes with Var::varattno = 0. I have changed this code to use function >> is_converted_whole_row_reference() instead of the above code with >> Assert. In order to use that function, I had to take it out of >> setrefs.c and put it in nodeFuncs.c which seemed to be a better fit. > > This change seems a bit confusing to me because the flag bits > "PVC_INCLUDE_CONVERTROWTYPES" and "PVC_RECURSE_CONVERTROWTYPES" passed to > pull_var_clause look as if it handles any ConvertRowtypeExpr nodes from a > given clause, but really, it only handles ConvertRowtypeExprs you mentioned > above. To make that function easy to understand and use, I think it'd be > better to use the IsA(node, ConvertRowtypeExpr) test as in the first version > of the patch, instead of is_converted_whole_row_reference, which would be > more consistent with other cases such as PlaceHolderVar. I agree that using is_converted_whole_row_reference() is not consistent with the other nodes that are handled by pull_var_clause(). I also agree that PVC_*_CONVERTROWTYPES doesn't reflect exactly what's being done with those options. But using is_converted_whole_row_reference() is the correct thing to do since we are interested only in the whole-row references embedded in ConvertRowtypeExpr. There can be anything encapsulated in ConvertRowtypeExpr(), a non-shippable function for example. We don't want to try to push that down in postgres_fdw's case. Neither in other cases. So, it boils down to changing names of PVC_*_CONVERTROWTYPES to something more meaningful. How about PVC_*_CONVERTWHOLEROWS? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Needless additional partition check in INSERT?
Hi, Scanning ExecInsert, it looks like there's a needless additional partition constraint check against the tuple. This should only be required if there's a before row INSERT trigger. The code block up one from the additional check tries to disable the check, but it goes ahead regardless, providing there's some other constraint. ExecFindPartition should have already located the correct partition and nothing should have changed in the absence of before row insert triggers, so it looks like we're fine to not bother re-checking. Or did I misunderstand? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services remove_needless_additional_partition_check.patch Description: Binary data
Re: Indexes on partitioned tables and foreign partitions
On 2018/05/10 10:02, Michael Paquier wrote: > Something > that I find confusing on HEAD though is that DefineIndex calls itself > around line 1006 and cascades through each children but there is no > context about the error. > > For example if I have this partition layer: > CREATE TABLE measurement ( > logdate date not null, > peaktempint, > unitsales int > ) PARTITION BY RANGE (logdate); > CREATE FOREIGN TABLE measurement_y2016m07 > PARTITION OF measurement FOR VALUES FROM ('2016-07-01') TO > ('2016-08-01') > SERVER postgres_server; > > Then by creating an index on the parent I get that: > =# create index measurement_idx on measurement (peaktemp); > ERROR: 42809: cannot create index on foreign table "measurement_y2016m07" > LOCATION: DefineIndex, indexcmds.c:442 > > This may be confusing for the user as the DDL does not complain directly > about the relation worked on. Perhaps we could add an error context? > We surely don't want multiple contexts either when working on long > chains, but we could build a chained list of relation names in the error > message. How about we error out even *before* calling DefineIndex for the 1st time? I see that ProcessUtilitySlow() gets a list of all partitions when locking them for index creation before calling DefineIndex. Maybe, just go through the list and error out if one of them is a partition that we don't support creating an index on? For example, with the attached patch doing the above, I get: create table p (a int) partition by range (a); create table p1 partition of p for values from (minvalue) to (1); create foreign table p2 partition of p for values from (1) to (maxvalue) server loopback options (table_name 'p2base'); create index on p (a); ERROR: cannot create index on partitioned table containing foreign partitions Thanks, Amit diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 287addf429..d49e32d83f 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -67,6 +67,7 @@ #include "tcop/utility.h" #include "utils/acl.h" #include "utils/guc.h" +#include "utils/lsyscache.h" #include "utils/syscache.h" #include "utils/rel.h" @@ -1318,12 +1319,26 @@ ProcessUtilitySlow(ParseState *pstate, if (stmt->relation->inh) { Relationrel; + ListCell *lc; /* already locked by RangeVarGetRelidExtended */ rel = heap_open(relid, NoLock); if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) inheritors = find_all_inheritors(relid, lockmode, NULL); + foreach(lc, inheritors) + { + charrelkind = get_rel_relkind(lfirst_oid(lc)); + + if (relkind != RELKIND_RELATION && + relkind != RELKIND_MATVIEW && + relkind != RELKIND_PARTITIONED_TABLE) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("cannot create index on partitioned table containing foreign partitions"))); + } + + list_free(inheritors); heap_close(rel, NoLock); } @@ -1353,8 +1368,6 @@ ProcessUtilitySlow(ParseState *pstate, parsetree); commandCollected = true; EventTriggerAlterTableEnd(); - - list_free(inheritors); } break;
Re: Indexes on partitioned tables and foreign partitions
On 2018/05/10 10:37, Michael Paquier wrote: > On Thu, May 10, 2018 at 10:15:05AM +0900, Amit Langote wrote: >> While I agree with this, let me point out that we do allow inherited check >> constraints on foreign tables that are not actually enforced locally. >> >> create table p (a int) partition by range (a); >> create table p1 partition of p for values from (minvalue) to (1); >> create table p2base (a int); >> create foreign table p2 partition of p for values from (1) to (maxvalue) >> server loopback options (table_name 'p2base'); >> >> alter table p add check (a between -1000 and 1000); >> >> -- routed to foreign partition, which doesn't enforce check constraints >> insert into p values (1001); >> INSERT 0 1 > > That's not actually a surprise, right? Since foreign tables can be part > of inheritance trees in 9.5, CHECK constraints on foreign tables are not > enforced locally, but used as planner hints to guess how a query would > work remotely. So getting partition children to work the same way is > consistent. > >> We have to do the following to prevent that. >> >> alter table p2base add check (a between -1000 and 1000); >> insert into p values (1001); >> ERROR: new row for relation "p2base" violates check constraint >> "p2base_a_check" >> DETAIL: Failing row contains (1001). >> CONTEXT: remote SQL command: INSERT INTO public.p2base(a) VALUES ($1) > > This bit looks natural to me as well. Yes, I know it is working as designed and documented. I was just trying to comment on this bit of Robert's email: "...because a major point of such an index is to enforce a constraint; we can't allege that we have such a constraint if foreign tables are just silently skipped." So if someday we go ahead and allow indexes to be created on partitioned tables even if there are foreign partitions, how would we choose to deal with a unique constraint? Will it be same as CHECK constraints such that we don't enforce it locally but only assume it to be enforced by the remote data source using whatever method? But it seems I've misinterpreted what he was saying. He doesn't seem to be saying anything about how or whether we enforce the unique constraint on foreign tables. Only that if someone creates a constraint index on the partitioned table, all partitions *including* foreign partitions, must get a copy. So for now, we give users an error if they try to create an index on a partitioned table with a mix of local and foreign partitions. Once we figure out how to allow creating indexes (constraint-enforcing or not) on foreign tables, we can then think of relaxing that restriction. Thanks, Amit
Re: Should we add GUCs to allow partition pruning to be disabled?
David Rowley wrote: > On 10 May 2018 at 14:01, Alvaro Herrerawrote: > > I'm thinking something a bit more radical. First, since partition > > pruning is the future and constraint exclusion is soon to be a thing of > > the past, we should describe pruning first, and then describe exclusion > > in terms of pruning. > > But... that's not true. The chapter describes inheritance partitioned > tables too, and we're not getting rid of constraint exclusion because > it's needed for those. Oh, I'm sure it is, but nobody is going to set up new inheritance partitioned tables anymore, except people who pg_upgrade from older releases. (And while I haven't tried, I'm sure it's possible to migrate from old-style to new-style partitioned tables without incurring full table rewrites, with little downside and lots to gain.) Now, maybe you argue that we could have a structure like this instead: 5.10.1. Overview 5.10.2. Declarative Partitioning 5.10.3. Partition Pruning 5.10.4. Implementation Using Inheritance 5.10.5. Constraint Exclusion I wouldn't oppose that. > However, that might not mean your patch has to be changed. I'd better > have a look... Thanks :-) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Should we add GUCs to allow partition pruning to be disabled?
On 10 May 2018 at 14:01, Alvaro Herrerawrote: > I'm thinking something a bit more radical. First, since partition > pruning is the future and constraint exclusion is soon to be a thing of > the past, we should describe pruning first, and then describe exclusion > in terms of pruning. But... that's not true. The chapter describes inheritance partitioned tables too, and we're not getting rid of constraint exclusion because it's needed for those. However, that might not mean your patch has to be changed. I'd better have a look... -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Should we add GUCs to allow partition pruning to be disabled?
David Rowley wrote: > Thanks for reviewing again. Hi, I'm thinking something a bit more radical. First, since partition pruning is the future and constraint exclusion is soon to be a thing of the past, we should describe pruning first, and then describe exclusion in terms of pruning. Second, I'd put constraint exclusion as a inside the that describes pruning (but keep the XML "id" the same, so that old links continue to work.) I took a stab at this, but ran out of time before trimming the text for constraint exclusion. What do you think of this rough sketch? I'm thinking 5.10.4 is close to its final form (wording suggestions of course welcome), but 5.10.4.1 still needs to be trimmed heavily, to avoid repeating what was already explained in 5.10.4 (we need only explain how exclusion differs from pruning.) I'm a bit undecided on where to leave the . (Note: make -C doc/src/sgml html XSLTPROCFLAGS='--stringparam rootid ddl' builds only the 'ddl' chapter, which is nice when proofreading.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 3f3f567222..2152b4d16d 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -3759,7 +3759,151 @@ ANALYZE measurement; - + + Partition Pruning + + +partition pruning + + + +Partition pruning is a query optimization technique +that improves performance for partitioned tables. As an example: + + +SET enable_partition_pruning = on;-- the default +SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01'; + + +Without partition pruning, the above query would scan each of the +the partitions of the measurement table. With +partition pruning enabled, the planner will examine the definition of each +partition and prove that the partition need not +be scanned because it could not contain any rows meeting the query's +WHERE clause. When the planner can prove this, it +excludes the partition from the query plan. + + + +You can use the EXPLAIN command to show the difference +between a plan with enable_partition_pruning on and a plan +with it off. A typical unoptimized plan for this type of table setup is: + + +SET enable_partition_pruning = off; +EXPLAIN SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01'; +QUERY PLAN +─── + Aggregate (cost=188.76..188.77 rows=1 width=8) + - Append (cost=0.00..181.05 rows=3085 width=0) + - Seq Scan on measurement_y2006m02 (cost=0.00..33.12 rows=617 width=0) + Filter: (logdate = '2008-01-01'::date) + - Seq Scan on measurement_y2006m03 (cost=0.00..33.12 rows=617 width=0) + Filter: (logdate = '2008-01-01'::date) + - Seq Scan on measurement_y2007m11 (cost=0.00..33.12 rows=617 width=0) + Filter: (logdate = '2008-01-01'::date) + - Seq Scan on measurement_y2007m12 (cost=0.00..33.12 rows=617 width=0) + Filter: (logdate = '2008-01-01'::date) + - Seq Scan on measurement_y2008m01 (cost=0.00..33.12 rows=617 width=0) + Filter: (logdate = '2008-01-01'::date) + + +Some or all of the partitions might use index scans instead of +full-table sequential scans, but the point here is that there +is no need to scan the older partitions at all to answer this query. +When we enable partition pruning, we get a significantly +cheaper plan that will deliver the same answer: + + +SET enable_partition_pruning = on; +EXPLAIN SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01'; +QUERY PLAN +─── + Aggregate (cost=37.75..37.76 rows=1 width=8) + - Append (cost=0.00..36.21 rows=617 width=0) + - Seq Scan on measurement_y2008m01 (cost=0.00..33.12 rows=617 width=0) + Filter: (logdate = '2008-01-01'::date) + + + + +Note that partition pruning is driven only by the constraints defined by +the partition keys, not by the presence of indexes. Therefore it isn't +necessary to define indexes on the key columns. Whether an index +needs to be created for a given partition depends on whether you +expect that queries that scan the partition will generally scan +a large part of the partition or just a small part. An index will +be helpful in the latter case but not the former. + + + +Partition pruning +can be performed not only during the planning of a given query, but also +during its execution. This is useful as it can allow more partitions to +be pruned
Re: Postgres, fsync, and OSs (specifically linux)
On 10 May 2018 at 06:55, Andres Freundwrote: > Do you have a patchset including that? I didn't find anything after a > quick search... There was an earlier rev on the other thread but without msync checks. I've added panic for msync in the attached, and tidied the comments a bit. I didn't add comments on why we panic to each individual pg_fsync or FileSync caller that panics; instead pg_fsync points to pg_fsync_no_writethrough which explains it briefly and has links. I looked at callers of pg_fsync, pg_fsync_no_writethrough, pg_fsync_writethrough, mdsync, and FileSync when writing this. WAL writing already PANIC'd on fsync failure, which helps, though we now know that's not sufficient for complete safety. Patch on top of v11 HEAD @ ddc1f32ee507 -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From a6cade9e1de68962d95374127841b0af8eb4cfe0 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Tue, 10 Apr 2018 14:08:32 +0800 Subject: [PATCH v2] PANIC when we detect a possible fsync I/O error instead of retrying fsync Panic the server on fsync failure in places where we can't simply repeat the whole operation on retry. Most imporantly, panic when fsync fails during a checkpoint. This will result in log messages like: PANIC: 58030: could not fsync file "base/12367/16386": Input/output error LOG: 0: checkpointer process (PID 10799) was terminated by signal 6: Aborted and, if the condition persists during redo: LOG: 0: checkpoint starting: end-of-recovery immediate PANIC: 58030: could not fsync file "base/12367/16386": Input/output error LOG: 0: startup process (PID 10808) was terminated by signal 6: Aborted Why? In a number of places PostgreSQL we responded to fsync() errors by retrying the fsync(), expecting that this would force the operating system to repeat any write attempts. The code assumed that fsync() would return an error on all subsequent calls until any I/O error was resolved. This is not what actually happens on some platforms, including Linux. The operating system may give up and drop dirty buffers for async writes on the floor and mark the page mapping as bad. The first fsync() clears any error flag from the page entry and/or our file descriptor. So a subsequent fsync() returns success, even though the data PostgreSQL wrote was really discarded. We have no way to find out which writes failed, and no way to ask the kernel to retry indefinitely, so all we can do is PANIC. Redo will attempt the write again, and if it fails again, it will also PANIC. This doesn't completely prevent fsync reliability issues, because it only handles cases where the kernel actually reports the error to us. It's entirely possible for a buffered write to be lost without causing fsync to report an error at all (see discussion below). Work on addressing those issues and documenting them is ongoing and will be committed separately. See: * https://www.postgresql.org/message-id/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com * https://www.postgresql.org/message-id/20180427222842.in2e4mibx45zd...@alap3.anarazel.de * https://lwn.net/Articles/752063/ * https://lwn.net/Articles/753650/ * https://lwn.net/Articles/752952/ * https://lwn.net/Articles/752613/ --- src/backend/access/heap/rewriteheap.c | 6 +++--- src/backend/access/transam/timeline.c | 4 ++-- src/backend/access/transam/twophase.c | 2 +- src/backend/access/transam/xlog.c | 4 ++-- src/backend/replication/logical/snapbuild.c | 3 +++ src/backend/storage/file/fd.c | 29 +++-- src/backend/storage/smgr/md.c | 22 -- src/backend/utils/cache/relmapper.c | 2 +- 8 files changed, 55 insertions(+), 17 deletions(-) diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 8d3c861a33..0320baffec 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -965,7 +965,7 @@ logical_end_heap_rewrite(RewriteState state) while ((src = (RewriteMappingFile *) hash_seq_search(_status)) != NULL) { if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0) - ereport(ERROR, + ereport(PANIC, (errcode_for_file_access(), errmsg("could not fsync file \"%s\": %m", src->path))); FileClose(src->vfd); @@ -1180,7 +1180,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r) */ pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC); if (pg_fsync(fd) != 0) - ereport(ERROR, + ereport(PANIC, (errcode_for_file_access(), errmsg("could not fsync file \"%s\": %m", path))); pgstat_report_wait_end(); @@ -1279,7 +1279,7 @@ CheckPointLogicalRewriteHeap(void) */ pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC); if (pg_fsync(fd) != 0)
Re: Indexes on partitioned tables and foreign partitions
On Thu, May 10, 2018 at 10:15:05AM +0900, Amit Langote wrote: > While I agree with this, let me point out that we do allow inherited check > constraints on foreign tables that are not actually enforced locally. > > create table p (a int) partition by range (a); > create table p1 partition of p for values from (minvalue) to (1); > create table p2base (a int); > create foreign table p2 partition of p for values from (1) to (maxvalue) > server loopback options (table_name 'p2base'); > > alter table p add check (a between -1000 and 1000); > > -- routed to foreign partition, which doesn't enforce check constraints > insert into p values (1001); > INSERT 0 1 That's not actually a surprise, right? Since foreign tables can be part of inheritance trees in 9.5, CHECK constraints on foreign tables are not enforced locally, but used as planner hints to guess how a query would work remotely. So getting partition children to work the same way is consistent. > We have to do the following to prevent that. > > alter table p2base add check (a between -1000 and 1000); > insert into p values (1001); > ERROR: new row for relation "p2base" violates check constraint > "p2base_a_check" > DETAIL: Failing row contains (1001). > CONTEXT: remote SQL command: INSERT INTO public.p2base(a) VALUES ($1) This bit looks natural to me as well. -- Michael signature.asc Description: PGP signature
Re: Indexes on partitioned tables and foreign partitions
On 2018/05/09 23:57, Robert Haas wrote: > For right now, I think the options are (1) throw an ERROR if we > encounter a foreign table or (2) silently skip the foreign table. I > think (2) is defensible for non-UNIQUE indexes, because the index is > just a performance optimization. Along with others, my +1 to option 1. > However, for UNIQUE indexes, at > least, it seems like we'd better do (1), because a major point of such > an index is to enforce a constraint; we can't allege that we have such > a constraint if foreign tables are just silently skipped. While I agree with this, let me point out that we do allow inherited check constraints on foreign tables that are not actually enforced locally. create table p (a int) partition by range (a); create table p1 partition of p for values from (minvalue) to (1); create table p2base (a int); create foreign table p2 partition of p for values from (1) to (maxvalue) server loopback options (table_name 'p2base'); alter table p add check (a between -1000 and 1000); -- routed to foreign partition, which doesn't enforce check constraints insert into p values (1001); INSERT 0 1 -- whereas, local partition does insert into p values (-1001); ERROR: new row for relation "p1" violates check constraint "p_a_check" DETAIL: Failing row contains (-1001). We have to do the following to prevent that. alter table p2base add check (a between -1000 and 1000); insert into p values (1001); ERROR: new row for relation "p2base" violates check constraint "p2base_a_check" DETAIL: Failing row contains (1001). CONTEXT: remote SQL command: INSERT INTO public.p2base(a) VALUES ($1) Thanks, Amit
Re: Indexes on partitioned tables and foreign partitions
On Wed, May 09, 2018 at 11:10:43AM -0400, Tom Lane wrote: > Agreed about unique indexes. I suggest that we throw an error for both > cases, because (1) having unique and non-unique indexes behave differently > for this purpose seems pretty weird; (2) throwing an error now preserves > our options to do something different later. Given where we are in the > release cycle, we should be taking the most conservative path we can. Heartfully agreed to throw an error for all cases for now. Something that I find confusing on HEAD though is that DefineIndex calls itself around line 1006 and cascades through each children but there is no context about the error. For example if I have this partition layer: CREATE TABLE measurement ( logdate date not null, peaktempint, unitsales int ) PARTITION BY RANGE (logdate); CREATE FOREIGN TABLE measurement_y2016m07 PARTITION OF measurement FOR VALUES FROM ('2016-07-01') TO ('2016-08-01') SERVER postgres_server; Then by creating an index on the parent I get that: =# create index measurement_idx on measurement (peaktemp); ERROR: 42809: cannot create index on foreign table "measurement_y2016m07" LOCATION: DefineIndex, indexcmds.c:442 This may be confusing for the user as the DDL does not complain directly about the relation worked on. Perhaps we could add an error context? We surely don't want multiple contexts either when working on long chains, but we could build a chained list of relation names in the error message. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] path toward faster partition pruning
On 2018/05/09 22:43, Alvaro Herrera wrote: > Amit Langote wrote: >> On 2018/05/09 11:31, David Rowley wrote: >>> On 9 May 2018 at 14:29, Amit Langotewrote: On 2018/05/09 11:20, Michael Paquier wrote: > While looking at this code, is there any reason to not make > gen_partprune_steps static? This is only used in partprune.c for now, > so the intention is to make it available for future patches? > >> Here is a patch that does that. > > Pushed, thanks. Thank you. Regards, Amit
Re: Postgres, fsync, and OSs (specifically linux)
On 2018-05-01 09:38:03 +0800, Craig Ringer wrote: > On 1 May 2018 at 00:09, Andres Freundwrote: > > > It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which > > seems sensible, because they could be considered data integrity > > operations. > > Ah, I misread that. Thankyou. > > >> I'm very suspicious about the safety of the msync() path too. > > > > That seems justified however: > > I'll add EIO tests there. Do you have a patchset including that? I didn't find anything after a quick search... Greetings, Andres Freund
Re: [HACKERS] path toward faster partition pruning
On Wed, May 09, 2018 at 10:39:07AM -0300, Alvaro Herrera wrote: > Michael Paquier wrote: >> This removes a useless default clause in partprune.c and it got >> forgotten in the crowd. Just attaching it again here, and it can just >> be applied on top of the rest. > > Done, thanks for insisting. Thanks! -- Michael signature.asc Description: PGP signature
Re: [SPAM] Re: Local partitioned indexes and pageinspect
On Wed, May 9, 2018 at 2:04 PM, Michael Paquierwrote: > On Wed, May 09, 2018 at 02:28:50PM -0300, Alvaro Herrera wrote: >> I pushed some fixes produced here. Attached is the remainder of the >> patch you submitted. I notice now that we haven't actually fixed >> Peter's source of complaint, though. AFAICS your patch just adds test >> cases, and upthread discussion apparently converges on not doing >> anything about the code. I'm not yet sure what to think of that ... > > Thanks Álvaro. I tend to think that there is little point in changing > the error handling, still it would be good to get test coverage. That's > not really a bug though as in all those cases we don't get errors like > "could not open file" or such. So we could also let things as they are > now. Now that the relkind issue is documented, I wouldn't mind just leaving it as-is. -- Peter Geoghegan
Re: [SPAM] Re: Local partitioned indexes and pageinspect
On Wed, May 09, 2018 at 02:28:50PM -0300, Alvaro Herrera wrote: > I pushed some fixes produced here. Attached is the remainder of the > patch you submitted. I notice now that we haven't actually fixed > Peter's source of complaint, though. AFAICS your patch just adds test > cases, and upthread discussion apparently converges on not doing > anything about the code. I'm not yet sure what to think of that ... Thanks Álvaro. I tend to think that there is little point in changing the error handling, still it would be good to get test coverage. That's not really a bug though as in all those cases we don't get errors like "could not open file" or such. So we could also let things as they are now. -- Michael signature.asc Description: PGP signature
Re: parallel.sgml for Gather with InitPlans
On Tue, May 8, 2018 at 11:21 PM, Amit Kapilawrote: > On Tue, May 8, 2018 at 5:27 PM, Robert Haas wrote: >> On Mon, May 7, 2018 at 11:34 PM, Amit Kapila wrote: >>> I think we can cover InitPlan and Subplans that can be parallelized in >>> a separate section "Parallel Subplans" or some other heading. I think >>> as of now we have enabled parallel subplans and initplans in a >>> limited, but useful cases (as per TPC-H benchmark) and it might be >>> good to cover them in a separate section. I can come up with an >>> initial patch (or I can review it if you write the patch) if you and >>> or others think that makes sense. >> >> We could go that way, but what I wrote is short and -- I think -- accurate. >> > > Okay, again thinking about it after your explanation, it appears > correct to me, but it was not apparent on the first read. I think > other alternatives could be (a) Evaluation of initplan OR (b) > Execution of initplan. I think it makes sense to add what you have > written or one of the alternatives suggested by me as you deem most > appropriate. I think one can always write a detailed explanation as a > separate patch. I've committed what I suggested before. If you want to propose another patch, feel free, but I'm not sure how much energy this is worth. The more detailed we make the documentation the more we have to update the next time something changes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Clock with Adaptive Replacement
On Thu, May 3, 2018 at 5:16 PM, Peter Geogheganwrote: > On Thu, May 3, 2018 at 12:37 PM, Robert Haas wrote: >> On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov >> wrote: >>> Sample output can be seen here: >>> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio >> >> Neat. Not sure what generated this trace, but note this part: > I don't have time to check this out just now, but it seems like an > excellent idea. It would be nice if it could be enhanced further, so > you get some idea of what the blocks are without having to decode them > yourself using tools like pageinspect. Although I'm a Linux user, I added --enable-dtrace to my standard release build configure alias several months back. I wanted to be able to use the user space probes it adds with BCC [1], which is a really nice tool (the dtrace/system tap probes seem to be totally generic). If you have a fairly recent kernel, many really nice tricks are possible with only modest effort. I've been meaning to polish up a selection of one-liners I came up with several months back, to make them available for others. Here is a one-liner that traces buffer reads within the backend with PID 6617: pg@bat:~/notes/setup$ sudo /usr/share/bcc/tools/trace -p 6617 --time 'u:/code/postgresql/root/install/bin/postgres:postgresql:buffer__read__start "fork: %d, blockNum: %u, spcNode: %u, dbNode: %u, relNode: %u, backend: %d, isExtend: %d" arg1, arg2, arg3, arg4, arg5, arg6, arg7' If I run the query "select * from pgbench_accounts where aid = 555" within that backend, I can see the following: TIME PIDTIDCOMM FUNC - 12:17:28 6617 6617 postgres buffer__read__start fork: 0, blockNum: 290, spcNode: 1663, dbNode: 13100, relNode: 16404, backend: -1, isExtend: 0 12:17:28 6617 6617 postgres buffer__read__start fork: 0, blockNum: 3, spcNode: 1663, dbNode: 13100, relNode: 16404, backend: -1, isExtend: 0 12:17:28 6617 6617 postgres buffer__read__start fork: 0, blockNum: 2, spcNode: 1663, dbNode: 13100, relNode: 16404, backend: -1, isExtend: 0 12:17:28 6617 6617 postgres buffer__read__start fork: 0, blockNum: 9, spcNode: 1663, dbNode: 13100, relNode: 16396, backend: -1, isExtend: 0 This is pretty easy to interpret: 290 is the root page, 3 in another internal page, and 2 in the leaf page. Finally, block 9 is from the heap relation. I haven't actually instrumented the amount of time each read takes here, so it's not quite as good as your dtrace script. I'm sure that that would be trivial, since BCC really does seem to make customization very easy. I just don't want to spend more than 10 minutes on this today. [1] https://github.com/iovisor/bcc -- Peter Geoghegan
Re: [HACKERS] Parallel Append implementation
On Tue, May 8, 2018 at 5:05 PM, Thomas Munrowrote: > +scanning them more than once would preduce duplicate results. Plans that > > s/preduce/produce/ Fixed, thanks. > +Append or MergeAppend plan node. > vs. > +Append of regular Index Scan plans; each > > I think we should standardise on Foo Bar, > FooBar or foo bar when > discussing executor nodes on this page. Well, EXPLAIN prints MergeAppend but Index Scan, and I think we should follow that precedent here. As for vs. , I think the reason I ended up using in the section on scans was because I thought that Parallel Seq Scan might be confusing (what's a "seq"?), so I tried to fudge my way around that by referring to it as an abstract idea rather than the exact EXPLAIN output. You then copied that style in the join section, and, well, like you say, now we have a sort of hodgepodge of styles. Maybe that's a problem for another patch, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company parallel-append-doc-v2.patch Description: Binary data
Re: [HACKERS] Clock with Adaptive Replacement
On Wed, May 9, 2018 at 1:37 PM, Merlin Moncurewrote: > On Wed, May 9, 2018 at 11:00 AM Robert Haas wrote: >> Independently of that, it would be probably also useful to avoid >> bumping the reference count multiple times when a buffer is accessed >> by the same backend several times in quick succession. Perhaps this >> could even be as simple as maintaining a list of the last two buffer >> IDs for which we bumped the usage count; if we see one of them again, >> don't bump the usage count again. > > Hm. Is the objective here to optimize access patterns or to reduce atomic > operations (or both)? All else being equal, an algorithm that delivers > the similar eviction results with less cache synchronization ought to be > preferred...are the various algorithms scored in that way? The CAR paper explains steps they've taken to minimize atomic operations, so it is something that people do think about. The main reason for wanting to avoid extra usage count bumps is to avoid distorting the usage counts. Any savings we get from reducing atomic ops is a bonus. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Clock with Adaptive Replacement
On Wed, May 9, 2018 at 11:00 AM Robert Haaswrote: > Independently of that, it would be probably also useful to avoid > bumping the reference count multiple times when a buffer is accessed > by the same backend several times in quick succession. Perhaps this > could even be as simple as maintaining a list of the last two buffer > IDs for which we bumped the usage count; if we see one of them again, > don't bump the usage count again. Hm. Is the objective here to optimize access patterns or to reduce atomic operations (or both)? All else being equal, an algorithm that delivers the similar eviction results with less cache synchronization ought to be preferred...are the various algorithms scored in that way? merlin
Re: [HACKERS] Clock with Adaptive Replacement
On Tue, May 8, 2018 at 10:22 AM, Юрий Соколовwrote: > "just prior to the point" means we have to have machinery for information > expiration without eviction. For example, same "clock hand" should walk > over all buffers continiously, and decrement "usage count", but without > eviction performed. Right? Right. > As alternative, some approximation of Working Set algorithm could be used: > - on every access "access time" should be written to item, > - items accessed before some "expiration interval" are considered for > expiration. > This way, there is also fresh information about usage even without any > expiration performed yet. That's basically the same kind of thing. Ideally we want to have one mechanism that operates all the time, rather than one that works when no eviction is occurring and a completely separate one that operates when eviction is occurring. Anyway, I think we really ought to investigate CAR. CAR could be made to piggyback on our existing GLOCK scheme. See section III.B of the the CAR paper, page 4, especially the second paragraph in the second column. What would change is that instead of sweeping through all buffers, there would be two sets of buffers T1 and T2, each with a separate clock hand. Also, we'd need to store the history lists B1 and B2 someplace. Independently of that, it would be probably also useful to avoid bumping the reference count multiple times when a buffer is accessed by the same backend several times in quick succession. Perhaps this could even be as simple as maintaining a list of the last two buffer IDs for which we bumped the usage count; if we see one of them again, don't bump the usage count again. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Cutting initdb's runtime (Perl question embedded)
Alvaro Herrerawrites: > Tom Lane wrote: >> Robert Haas writes: >>> Who says we need a portable way? If we had something that worked on >>> Linux and macOS, it would cover most developer environments. I wonder >>> if readlink("/etc/localtime", buf, sz) might be a viable approach. >> I wondered about that, but I'm afraid it's often a hardlink not a >> symlink. Still, we could try it. > In Debian systems, it's a symlink. Apparently in RHEL6 and older it's a > copy or hardlink, and the file /etc/sysconfig/clock contains a ZONE > variable that points to the right zone. Yeah, on my RHEL6 box, $ ls -l /etc/localtime -rw-r--r--. 1 root root 3519 May 4 2010 /etc/localtime The mod date's a little astonishing, considering this system was built in 2013. It is *not* a hardlink, or even an exact match, to anything under /usr/share/zoneinfo, though perhaps it was originally. Also: $ cat /etc/sysconfig/clock # The time zone of the system is defined by the contents of /etc/localtime. # This file is only for evaluation by system-config-date, do not rely on its # contents elsewhere. ZONE="America/New York" I'm inclined to respect the comment, especially since I see they are not spelling the zone name canonically anyway (note space not underscore); so looking at this wouldn't work without some ill-defined heuristics. However, more modern Red Hat systems seem to have /etc/localtime as a symlink, so checking it would work there, and also macOS seems to do it that way for as far back as I can check. > This comment is insightful: > https://github.com/HowardHinnant/date/issues/269#issuecomment-353792132 > It's talking about this code: > https://github.com/HowardHinnant/date/blob/master/src/tz.cpp#L3652 Interesting. Not sure if we want to try all those files. But I'll take a look at whipping up something that checks /etc/localtime. regards, tom lane
Re: Indexes on partitioned tables and foreign partitions
Simon Riggswrites: > Indexes on foreign tables cause an ERROR, so yes, we already just > don't create them. > > You're suggesting silently skipping the ERROR. I can't see a reason for that. Truly, I was inaccurate. I mean that index propagation is a nice feature, and making it available for mix of foreign and local partitions skipping the foreign ones is useful thing for users who understand the consequences (of course there should be a warning in the docs -- my patch was just an illustration of the idea). As I said, FDW-based sharding often involves mix of foreign and local partitions, even in simple manual forms. Imagine that you have a table which is too big to fit into one server, so you partition it and move some partiitons to another machine; you still would like to access the whole table to avoid manual tuple routing. Or, you partition table by timestamps and keep recent data on fast primary server, gradually moving old partitions to another box. Again, it would be nice to access table as a whole to e.g. calculate some analytics. Anyway, given other opinions, I assume that the community has chosen more conservative approach. Indeed, we can't guarantee uniqueness this way; that's fully users responsiblity. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Cutting initdb's runtime (Perl question embedded)
On Wed, May 9, 2018 at 11:39 AM, Alvaro Herrerawrote: > In Debian systems, it's a symlink. Apparently in RHEL6 and older it's a > copy or hardlink, and the file /etc/sysconfig/clock contains a ZONE > variable that points to the right zone. Maybe if we add enough > platform-dependent hacks, we would use the slow fallback only for rare > cases. (Maybe have initdb emit a warning when the fallback is used, so > that we know what else to look for.) I just checked a couple of RHEL7 systems and it seems to be a symlink there. It's also a symlink on my laptop (macOS 10.13.3). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Indexes on partitioned tables and foreign partitions
On Wed, May 9, 2018 at 11:12 AM, Simon Riggswrote: > If we can assume an index exists on a foreign table, why can we not > just assume a unique index exists?? Why the difference? We can't assume either of those things, and I didn't say that we should. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Indexes on partitioned tables and foreign partitions
On Wed, May 9, 2018 at 11:20 AM, Simon Riggswrote: > On 9 May 2018 at 16:15, Robert Haas wrote: >> On Wed, May 9, 2018 at 11:14 AM, Simon Riggs wrote: >>> On 9 May 2018 at 16:10, Tom Lane wrote: Robert Haas writes: > On Wed, May 9, 2018 at 9:08 AM, Simon Riggs wrote: >> Shouldn't the fix be to allow creation of indexes on foreign tables? >> (Maybe they would be virtual or foreign indexes??) > It might be useful to invent the concept of a foreign index, but not > for v11 a month after feature freeze. Yeah. That's a can of worms we can *not* open at this stage. >>> >>> Lucky nobody suggested that then, eh? Robert's just making a joke. >> >> Someone did suggest that. It was you. > > Oh, you weren't joking. I think we're having serious problems with > people putting words in my mouth again then. > > Please show me where I suggested doing anything for v11? Come on, Simon. It's in the quoted text. I realize you didn't say v11 specifically, but this is the context of a patch that is proposed a bug-fix for v11. If you meant that we should apply the patch as proposed now, or some other one, and do the other thing later, you could have said so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Indexes on partitioned tables and foreign partitions
On 9 May 2018 at 16:15, Robert Haaswrote: > On Wed, May 9, 2018 at 11:14 AM, Simon Riggs wrote: >> On 9 May 2018 at 16:10, Tom Lane wrote: >>> Robert Haas writes: On Wed, May 9, 2018 at 9:08 AM, Simon Riggs wrote: > Shouldn't the fix be to allow creation of indexes on foreign tables? > (Maybe they would be virtual or foreign indexes??) >>> It might be useful to invent the concept of a foreign index, but not for v11 a month after feature freeze. >>> >>> Yeah. That's a can of worms we can *not* open at this stage. >> >> Lucky nobody suggested that then, eh? Robert's just making a joke. > > Someone did suggest that. It was you. Oh, you weren't joking. I think we're having serious problems with people putting words in my mouth again then. Please show me where I suggested doing anything for v11? -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Cutting initdb's runtime (Perl question embedded)
Robert Haaswrites: > On Tue, May 8, 2018 at 4:48 PM, Tom Lane wrote: >> Really the only thing here that jumps out as being unduly expensive for >> what it's doing is select_default_timezone. That is, and always has been, >> a brute-force algorithm; I wonder if there's a way to do better? > Who says we need a portable way? If we had something that worked on > Linux and macOS, it would cover most developer environments. I wonder > if readlink("/etc/localtime", buf, sz) might be a viable approach. I wondered about that, but I'm afraid it's often a hardlink not a symlink. Still, we could try it. > Also, how about having a --timezone option for initdb? We already have that, it's called the TZ environment variable. There was an effort awhile back to try to speed up the buildfarm by having that get set automatically, but it failed for reasons I don't recall ATM. regards, tom lane
Re: Indexes on partitioned tables and foreign partitions
On Wed, May 9, 2018 at 11:14 AM, Simon Riggswrote: > On 9 May 2018 at 16:10, Tom Lane wrote: >> Robert Haas writes: >>> On Wed, May 9, 2018 at 9:08 AM, Simon Riggs wrote: Shouldn't the fix be to allow creation of indexes on foreign tables? (Maybe they would be virtual or foreign indexes??) >> >>> It might be useful to invent the concept of a foreign index, but not >>> for v11 a month after feature freeze. >> >> Yeah. That's a can of worms we can *not* open at this stage. > > Lucky nobody suggested that then, eh? Robert's just making a joke. Someone did suggest that. It was you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Indexes on partitioned tables and foreign partitions
On 9 May 2018 at 16:10, Tom Lanewrote: > Robert Haas writes: >> On Wed, May 9, 2018 at 9:08 AM, Simon Riggs wrote: >>> Shouldn't the fix be to allow creation of indexes on foreign tables? >>> (Maybe they would be virtual or foreign indexes??) > >> It might be useful to invent the concept of a foreign index, but not >> for v11 a month after feature freeze. > > Yeah. That's a can of worms we can *not* open at this stage. Lucky nobody suggested that then, eh? Robert's just making a joke. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Indexes on partitioned tables and foreign partitions
On 9 May 2018 at 15:57, Robert Haaswrote: > For right now, I think the options are (1) throw an ERROR if we > encounter a foreign table or (2) silently skip the foreign table. I > think (2) is defensible for non-UNIQUE indexes, because the index is > just a performance optimization. However, for UNIQUE indexes, at > least, it seems like we'd better do (1), because a major point of such > an index is to enforce a constraint; we can't allege that we have such > a constraint if foreign tables are just silently skipped. If we can assume an index exists on a foreign table, why can we not just assume a unique index exists?? Why the difference? -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Indexes on partitioned tables and foreign partitions
Robert Haaswrites: > On Wed, May 9, 2018 at 9:08 AM, Simon Riggs wrote: >> Shouldn't the fix be to allow creation of indexes on foreign tables? >> (Maybe they would be virtual or foreign indexes??) > It might be useful to invent the concept of a foreign index, but not > for v11 a month after feature freeze. Yeah. That's a can of worms we can *not* open at this stage. > For right now, I think the options are (1) throw an ERROR if we > encounter a foreign table or (2) silently skip the foreign table. I > think (2) is defensible for non-UNIQUE indexes, because the index is > just a performance optimization. However, for UNIQUE indexes, at > least, it seems like we'd better do (1), because a major point of such > an index is to enforce a constraint; we can't allege that we have such > a constraint if foreign tables are just silently skipped. Agreed about unique indexes. I suggest that we throw an error for both cases, because (1) having unique and non-unique indexes behave differently for this purpose seems pretty weird; (2) throwing an error now preserves our options to do something different later. Given where we are in the release cycle, we should be taking the most conservative path we can. regards, tom lane
Re: [HACKERS] Cutting initdb's runtime (Perl question embedded)
On Tue, May 8, 2018 at 4:48 PM, Tom Lanewrote: > Really the only thing here that jumps out as being unduly expensive for > what it's doing is select_default_timezone. That is, and always has been, > a brute-force algorithm; I wonder if there's a way to do better? We can > probably guess that every non-Windows platform is using the IANA timezone > data these days. If there were some way to extract the name of the active > timezone setting directly, we wouldn't have to try to reverse-engineer it. > But I don't know of any portable way :-( Who says we need a portable way? If we had something that worked on Linux and macOS, it would cover most developer environments. I wonder if readlink("/etc/localtime", buf, sz) might be a viable approach. You could, at least, try any time zone you can derive that way first, before you try any others. Also, how about having a --timezone option for initdb? Then you could determine the time zone just once and pass it down to subsequent initdb invocations. Or just hardcode the regression tests to use some fixed time zone, like Antarctica/Troll. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Indexes on partitioned tables and foreign partitions
On Wed, May 9, 2018 at 9:08 AM, Simon Riggswrote: > How much sense is it to have a partitioned table with a mix of local > and foreign tables? Fair question, but we put some effort into making it work, so I think it should keep working. > Shouldn't the fix be to allow creation of indexes on foreign tables? > (Maybe they would be virtual or foreign indexes??) It might be useful to invent the concept of a foreign index, but not for v11 a month after feature freeze. For right now, I think the options are (1) throw an ERROR if we encounter a foreign table or (2) silently skip the foreign table. I think (2) is defensible for non-UNIQUE indexes, because the index is just a performance optimization. However, for UNIQUE indexes, at least, it seems like we'd better do (1), because a major point of such an index is to enforce a constraint; we can't allege that we have such a constraint if foreign tables are just silently skipped. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Global snapshots
On Tue, May 8, 2018 at 4:51 PM, Stas Kelvichwrote: >> On 7 May 2018, at 20:04, Robert Haas wrote: >> But what happens if a transaction starts on node A at time T0 but >> first touches node B at a much later time T1, such that T1 - T0 > >> global_snapshot_defer_time? > > Such transaction will get "global snapshot too old" error. Ouch. That's not so bad at READ COMMITTED, but at higher isolation levels failure becomes extremely likely. Any multi-statement transaction that lasts longer than global_snapshot_defer_time is pretty much doomed. > In principle such behaviour can be avoided by calculating oldest > global csn among all cluster nodes and oldest xmin on particular > node will be held only when there is some open old transaction on > other node. It's easy to do from global snapshot point of view, > but it's not obvious how to integrate that into postgres_fdw. Probably > that will require bi-derectional connection between postgres_fdw nodes > (also distributed deadlock detection will be easy with such connection). I don't think holding back xmin is a very good strategy. Maybe it won't be so bad if and when we get zheap, since only the undo log will bloat rather than the table. But as it stands, holding back xmin means everything bloats and you have to CLUSTER or VACUUM FULL the table in order to fix it. If the behavior were really analogous to our existing "snapshot too old" feature, it wouldn't be so bad. Old snapshots continue to work without error so long as they only read unmodified data, and only error out if they hit modified pages. SERIALIZABLE works according to a similar principle: it worries about data that is written by one transaction and read by another, but if there's a portion of the data that is only read and not written, or at least not written by any transactions that were active around the same time, then it's fine. While the details aren't really clear to me, I'm inclined to think that any solution we adopt for global snapshots ought to leverage this same principle in some way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Indexes on partitioned tables and foreign partitions
On 9 May 2018 at 15:26, Arseny Sherwrote: > > Simon Riggs writes: > >> How much sense is it to have a partitioned table with a mix of local >> and foreign tables? > > Well, as much sense as fdw-based sharding has, for instance. It is > arguable, but it exists. > >> Shouldn't the fix be to allow creation of indexes on foreign tables? >> (Maybe they would be virtual or foreign indexes??) > > Similar ideas were discussed at [1]. There was no wide consensus of even > what problems such feature would solve. Since currently indexes on > foreign tables are just forbidden, it seems to me that the best what > partitioning code can do today is just not creating them. > > [1] > https://www.postgresql.org/message-id/flat/4F62FD69.2060007%40lab.ntt.co.jp#4f62fd69.2060...@lab.ntt.co.jp Indexes on foreign tables cause an ERROR, so yes, we already just don't create them. You're suggesting silently skipping the ERROR. I can't see a reason for that. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] path toward faster partition pruning
Marina Polyakova wrote: > Hello everyone in this thread! > I got a similar server crash as in [1] on the master branch since the commit > 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because > the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is > an ArrayCoerceExpr (see [2]): Hello Marina, thanks for reporting this. I have pushed all fixes derived from this report -- thanks to Amit and Michaël for those. I verified your test case no longer crashes. If you have more elaborate test cases, please do try these too. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Indexes on partitioned tables and foreign partitions
Simon Riggswrites: > How much sense is it to have a partitioned table with a mix of local > and foreign tables? Well, as much sense as fdw-based sharding has, for instance. It is arguable, but it exists. > Shouldn't the fix be to allow creation of indexes on foreign tables? > (Maybe they would be virtual or foreign indexes??) Similar ideas were discussed at [1]. There was no wide consensus of even what problems such feature would solve. Since currently indexes on foreign tables are just forbidden, it seems to me that the best what partitioning code can do today is just not creating them. [1] https://www.postgresql.org/message-id/flat/4F62FD69.2060007%40lab.ntt.co.jp#4f62fd69.2060...@lab.ntt.co.jp -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] path toward faster partition pruning
Amit Langote wrote: > On 2018/05/09 11:31, David Rowley wrote: > > On 9 May 2018 at 14:29, Amit Langotewrote: > >> On 2018/05/09 11:20, Michael Paquier wrote: > >>> While looking at this code, is there any reason to not make > >>> gen_partprune_steps static? This is only used in partprune.c for now, > >>> so the intention is to make it available for future patches? > Here is a patch that does that. Pushed, thanks. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] path toward faster partition pruning
Michael Paquier wrote: > Alvaro, could it be possible to consider as well the patch I posted > here? > https://www.postgresql.org/message-id/20180424012042.gd1...@paquier.xyz > > This removes a useless default clause in partprune.c and it got > forgotten in the crowd. Just attaching it again here, and it can just > be applied on top of the rest. Done, thanks for insisting. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Indexes on partitioned tables and foreign partitions
On 9 May 2018 at 12:50, Arseny Sherwrote: > Hi, > > 8b08f7d4 added propagation of indexes on partitioned tables to > partitions, which is very cool. However, index creation also recurses > down to foreign tables. I doubt this is intentional, as such indexes are > forbidden as not making much sense; attempt to create index on > partitioned table with foreign partition leads to an error > now. Attached lines fix this. "Fix"? How much sense is it to have a partitioned table with a mix of local and foreign tables? Shouldn't the fix be to allow creation of indexes on foreign tables? (Maybe they would be virtual or foreign indexes??) -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Indexes on partitioned tables and foreign partitions
On Wed, May 9, 2018 at 5:20 PM, Arseny Sherwrote: > Hi, > > 8b08f7d4 added propagation of indexes on partitioned tables to > partitions, which is very cool. However, index creation also recurses > down to foreign tables. I doubt this is intentional, as such indexes are > forbidden as not making much sense; attempt to create index on > partitioned table with foreign partition leads to an error > now. Attached lines fix this. > The patch looks good, but I guess that we have to do some tricks with the validity of index on the partitioned table since not all partitions have a given index when one of those is a foreign partition. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
(2018/05/08 16:20), Ashutosh Bapat wrote: On Tue, May 1, 2018 at 5:00 PM, Etsuro Fujitawrote: Here are my review comments on patch 0003-postgres_fdw-child-join-with-ConvertRowtypeExprs-cau.patch: * This assertion in deparseConvertRowtypeExpr wouldn't always be true because of cases like ALTER FOREIGN TABLE DROP COLUMN plus ALTER FOREIGN TABLE ADD COLUMN: + Assert(parent_desc->natts == child_desc->natts); I think we should remove that assertion. Removed. * But I don't think that change is enough. Here is another oddity with that assertion removed. To fix this, I think we should skip the deparseColumnRef processing for dropped columns in the parent table. Done. Thanks for those changes! I have changed the test file a bit to test these scenarios. +1 * I think it would be better to do EXPLAIN with the VERBOSE option to the queries this patch added to the regression tests, to see if ConvertRowtypeExprs are correctly deparsed in the remote queries. The reason VERBOSE option was omitted for partition-wise join was to avoid repeated and excessive output for every child-join. I think that still applies. I'd like to leave that for the committer's judge. PFA patch-set with the fixes. Thanks for updating the patch! I also noticed that the new function deparseConvertRowtypeExpr is not quite following the naming convention of the other deparseFoo functions. Foo is usually the type of the node the parser would produced when the SQL string produced by that function is parsed. That doesn't hold for the SQL string produced by ConvertRowtypeExpr but then naming it as generic deparseRowExpr() wouldn't be correct either. And then there are functions like deparseColumnRef which may produce a variety of SQL strings which get parsed into different node types e.g. a whole-row reference would produce ROW(...) which gets parsed as a RowExpr. Please let me know if you have any suggestions for the name. To be honest, I don't have any strong opinion about that. But I like "deparseConvertRowtypeExpr" because that name seems to well represent the functionality, so I'd vote for that. BTW, one thing I'm a bit concerned about is this: (2018/04/25 18:51), Ashutosh Bapat wrote: > Actually I noticed that ConvertRowtypeExpr are used to cast a child's > whole row reference expression (not just a Var node) into that of its > parent and not. For example a cast like NULL::child::parent produces a > ConvertRowtypeExpr encapsulating a NULL constant node and not a Var > node. We are interested only in ConvertRowtypeExprs embedding Var > nodes with Var::varattno = 0. I have changed this code to use function > is_converted_whole_row_reference() instead of the above code with > Assert. In order to use that function, I had to take it out of > setrefs.c and put it in nodeFuncs.c which seemed to be a better fit. This change seems a bit confusing to me because the flag bits "PVC_INCLUDE_CONVERTROWTYPES" and "PVC_RECURSE_CONVERTROWTYPES" passed to pull_var_clause look as if it handles any ConvertRowtypeExpr nodes from a given clause, but really, it only handles ConvertRowtypeExprs you mentioned above. To make that function easy to understand and use, I think it'd be better to use the IsA(node, ConvertRowtypeExpr) test as in the first version of the patch, instead of is_converted_whole_row_reference, which would be more consistent with other cases such as PlaceHolderVar. Best regards, Etsuro Fujita
Indexes on partitioned tables and foreign partitions
Hi, 8b08f7d4 added propagation of indexes on partitioned tables to partitions, which is very cool. However, index creation also recurses down to foreign tables. I doubt this is intentional, as such indexes are forbidden as not making much sense; attempt to create index on partitioned table with foreign partition leads to an error now. Attached lines fix this. *** a/src/backend/commands/indexcmds.c --- b/src/backend/commands/indexcmds.c *** DefineIndex(Oid relationId, *** 915,920 --- 915,926 int maplen; childrel = heap_open(childRelid, lockmode); + /* Foreign table doesn't need indexes */ + if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE) + { + heap_close(childrel, NoLock); + continue; + } childidxs = RelationGetIndexList(childrel); attmap = convert_tuples_by_name_map(RelationGetDescr(childrel), *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** AttachPartitionEnsureIndexes(Relation re *** 14352,14357 --- 14352,14361 MemoryContext cxt; MemoryContext oldcxt; + /* Foreign table doesn't need indexes */ + if (attachrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE) + return; + cxt = AllocSetContextCreate(CurrentMemoryContext, "AttachPartitionEnsureIndexes", ALLOCSET_DEFAULT_SIZES); -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Setting libpq TCP keepalive parameters from environment
On Wed, May 9, 2018 at 1:58 AM, Craig Ringerwrote: > > > > It would be much more convenient to just set the environment variable > when > > running the script and let it affect the whole process and its children. > > > > Would a patch be welcome? > > I can't really comment on that part, but: > > PGOPTIONS="-c tcp_keepalives_count=5 -c tcp_keepalives_interval=60" > psql 'host=localhost' > > should enable server-side keepalives. Unsure how much that helps you > if you need client-side keepalives too. > Good point, in our specific case it appears to work as well if it's the server who sends the keepalives. Thank you, -- Alex
Re: Should we add GUCs to allow partition pruning to be disabled?
On 2018/05/09 13:14, Amit Langote wrote: > Hi David. > > Thanks for addressing my comments. > > On 2018/05/07 15:00, David Rowley wrote: >> v2 patch is attached. > > Looks good to me. Sorry, I should've seen noticed v3 before sending my email. v3 looks good too, but when going through it, I noticed one bit in 5.10.4. Partitioning and Constraint Exclusion: A good rule of thumb is that partitioning constraints should contain only comparisons of the partitioning column(s) to constants using B-tree-indexable operators, which applies even to partitioned tables, because only B-tree-indexable column(s) are allowed in the partition key. I think the part after ", which applies even to partitioned tables,.." should be removed. Attached find the updated patch. Thanks, Amit diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index ffea744cb8..76606a8535 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3841,6 +3841,11 @@ ANY num_sync ( for more information +on partition pruning and partitioning. + diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 004ecacbbf..d02edd771f 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -3760,7 +3760,7 @@ ANALYZE measurement; - Partitioning and Constraint Exclusion + Inheritance Partitioning and Constraint Exclusion constraint exclusion @@ -3768,9 +3768,8 @@ ANALYZE measurement; Constraint exclusion is a query optimization technique -that improves performance for partitioned tables defined in the -fashion described above (both declaratively partitioned tables and those -implemented using inheritance). As an example: +that improves performance for inheritance partitioned tables defined in the +fashion described above. As an example: SET constraint_exclusion = on; @@ -3847,15 +3846,14 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01'; is actually neither on nor off, but an intermediate setting called partition, which causes the technique to be -applied only to queries that are likely to be working on partitioned +applied only to queries that are likely to be working on inheritance partitioned tables. The on setting causes the planner to examine CHECK constraints in all queries, even simple ones that are unlikely to benefit. -The following caveats apply to constraint exclusion, which is used by -both inheritance and partitioned tables: +The following caveats apply to constraint exclusion: @@ -3877,11 +3875,7 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01'; range tests for range partitioning, as illustrated in the preceding examples. A good rule of thumb is that partitioning constraints should contain only comparisons of the partitioning column(s) to constants - using B-tree-indexable operators, which applies even to partitioned - tables, because only B-tree-indexable column(s) are allowed in the - partition key. (This is not a problem when using declarative - partitioning, since the automatically generated constraints are simple - enough to be understood by the planner.) + using B-tree-indexable operators. @@ -3898,6 +3892,94 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01'; + + + Declarative Partitioning and Partition Pruning + + +partition pruning + + + +Partition pruning is a query optimization technique +similar to constraint exclusion, but applies only to declaratively +partitioned tables. Like constraint exclusion, this uses (but is not +limited to using) the query's WHERE clause to exclude +partitions which cannot possibly contain any matching records. + + + +Partition pruning is much more efficient than constraint exclusion, since +it avoids scanning each partition's metadata to determine if the partition +is required for a particular query. + + + +Partition pruning is also more powerful than constraint exclusion as it +can be performed not only during the planning of a given query, but also +during its execution. This is useful as it can allow more partitions to +be pruned when clauses contain expressions whose values are unknown to the +query planner. For example, parameters defined in a +PREPARE statement, using a value obtained from a +subquery or using a parameterized value on the inner side of a nested loop +join. + + + +Partition pruning during execution can be performed at any of the +following times: + + + + + During initialization of the query plan. Partition pruning can be + performed here for parameter values which are known during the + initialization phase of execution.
Re: [HACKERS] path toward faster partition pruning
On Wed, May 09, 2018 at 02:01:26PM +0900, Amit Langote wrote: > On 2018/05/09 11:31, David Rowley wrote: >> On 9 May 2018 at 14:29, Amit Langotewrote: >>> On 2018/05/09 11:20, Michael Paquier wrote: While looking at this code, is there any reason to not make gen_partprune_steps static? This is only used in partprune.c for now, so the intention is to make it available for future patches? >>> >>> Yeah, making it static might be a good idea. I had made it externally >>> visible, because I was under the impression that the runtime pruning >>> related code would want to call it from elsewhere within the planner. >>> But, instead it introduced a make_partition_pruneinfo() which in turn >>> calls get_partprune_steps. >> >> Yeah. Likely left over from when run-time pruning was generating the >> steps during execution rather than during planning. > > Here is a patch that does that. Thanks, Amit. Alvaro, could it be possible to consider as well the patch I posted here? https://www.postgresql.org/message-id/20180424012042.gd1...@paquier.xyz This removes a useless default clause in partprune.c and it got forgotten in the crowd. Just attaching it again here, and it can just be applied on top of the rest. -- Michael diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c index f8844ef2eb..cbbb4c1827 100644 --- a/src/backend/partitioning/partprune.c +++ b/src/backend/partitioning/partprune.c @@ -2950,10 +2950,6 @@ perform_pruning_combine_step(PartitionPruneContext *context, } } break; - - default: - elog(ERROR, "invalid pruning combine op: %d", - (int) cstep->combineOp); } return result; signature.asc Description: PGP signature