Re: Loaded footgun open_datasync on Windows
Michael Paquier wrote: > On Thu, Sep 13, 2018 at 03:55:20PM +0200, Laurenz Albe wrote: > > Attached is the new, improved version of the patch. > > Thanks, finally pushed. I have made sure that recovery TAP tests, > upgrade tests and bin/ tests are working properly. Thanks for being interested and doing the work. If it turns out not to break anything, would you consider backpatching? On the one hand it fixes a bug, on the other hand it affects all frontend executables... I wonder why nobody noticed the problem in pg_test_fsync earlier. Is it that people running Windows care less if their storage is reliable? Yours, Laurenz Albe
Re: hot_standby_feedback vs excludeVacuum and snapshots
At Sat, 4 Aug 2018 14:09:18 +1200, Thomas Munro wrote in > On Mon, Jul 9, 2018 at 6:51 AM, Noah Misch wrote: > > On Fri, Jul 06, 2018 at 04:32:56PM +0100, Simon Riggs wrote: > >> On 6 July 2018 at 03:30, Thomas Munro > >> wrote: > >> > On Thu, Jul 5, 2018 at 8:27 AM, Noah Misch wrote: > >> >>> However, 49bff5300d527 also introduced a similar bug where > >> >>> subtransaction > >> >>> commit would fail to release an AccessExclusiveLock, leaving the > >> >>> lock to > >> >>> be removed sometimes early and sometimes late. This commit fixes > >> >>> that bug also. Backpatch to PG10 needed. > >> >> > >> >> Subtransaction commit is too early to release an arbitrary > >> >> AccessExclusiveLock. The primary releases every AccessExclusiveLock at > >> >> top-level transaction commit, top-level transaction abort, or > >> >> subtransaction > >> >> abort. CommitSubTransaction() doesn't do that; it transfers locks to > >> >> the > >> >> parent sub(xact). Standby nodes can't safely remove an arbitrary lock > >> >> earlier > >> >> than the primary would. > >> > > >> > But we don't release locks acquired by committing subxacts until the > >> > top level xact commits. Perhaps that's what the git commit message > >> > meant by "early", as opposed to "late" meaning when > >> > StandbyReleaseOldLocks() next runs because an XLOG_RUNNING_XACTS > >> > record is replayed? > >> > >> Locks held by subtransactions were not released at the correct timing > >> of top-level commit; they are now. > > > > I read the above-quoted commit message as saying that the commit expands the > > set of locks released when replaying subtransaction commit. The only lock > > that should ever be released at subxact commit is the subxact XID lock. If > > that continues to be the only lock released at subxact commit, good. > > You can still see these "late" lock releases on 10, since the above > quoted commit (15378c1a) hasn't been back-patched yet: > > CREATE TABLE foo (); > > BEGIN; SAVEPOINT s; LOCK foo; COMMIT; > > An AccessExclusiveLock is held on the standby until the next > XLOG_RUNNING_XACTS comes along, up to LOG_SNAPSHOT_INTERVAL_MS = 15 > seconds later. > > Does anyone know why StandbyReleaseLocks() releases all locks if > passed InvalidTransactionId? When would that happen? AFAICS, it used to be used at shutdown time since hot standby was introduced by efc16ea520 from ShutdownRecoveryTransactionEnvironment/StandbyReleaseAllLocks. c172b7b02e (Jan 23 2012) modified StandbyReleaseAllLocks not to call StandbyReleaseLocks with InvalidTransactionId and the feature became useless, and now it is. So I think the feature has been obsolete for a long time. As a similar thing, the following commands leaves AEL even though the savepoint is rollbacked. BEGIN; SAVEPOINT s; LOCK foo; CHECKPOINT; ROLLBACK TO SAVEPOINT s; This is because the checkpoint issues XLOG_STANDBY_LOCK on foo with the top-transaciton XID. Every checkpoint issues it for all existent locks so RecoveryLockList(s) can be bloated with the same lock entries and increases lock counts. Although it doesn't seem common to sustain AELs for a long time so that the length harms, I don't think such duplication is good. Patch attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c index f9c12e603b..9de46e0bea 100644 --- a/src/backend/storage/ipc/standby.c +++ b/src/backend/storage/ipc/standby.c @@ -632,6 +632,7 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid) xl_standby_lock *newlock; LOCKTAG locktag; bool found; + ListCell *lc; /* Already processed? */ if (!TransactionIdIsValid(xid) || @@ -653,6 +654,20 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid) entry->locks = NIL; } + /* + * multple lock for the same object can be given by XLOG_STANDBY_LOCK logs. + * we need no more than one of them. + */ + foreach (lc, entry->locks) + { + xl_standby_lock *l = (xl_standby_lock *) lfirst(lc); + + Assert(l->xid == xid); + + if (l->dbOid == dbOid && l->relOid == relOid) + return; + } + newlock = palloc(sizeof(xl_standby_lock)); newlock->xid = xid; newlock->dbOid = dbOid;
Re: Cache lookup errors with functions manipulation object addresses
On Fri, Sep 14, 2018 at 11:22:12AM +0900, Michael Paquier wrote: > Attached are rebased versions. This has been around for some time, so I > am planning to move on with this patch set pretty soon as that's mainly > cleanup for getObjectIdentity as it triggers elog("cache lookup") or > such for undefined objects. Patch 0001 extends FDW and server routines > so as it is possible to skip missing entries, without breaking > compatibility. Patch 0002 adds a missing_ok flag when doing > subscription and publication lookups. > > Any objections? And I forgot to attach the patches.. -- Michael From fba6464f1555ec05029a58e2bf378ef83ce73172 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Sun, 1 Jul 2018 23:26:10 +0900 Subject: [PATCH 1/3] Extend lookup routines for FDW and foreign server with NULL handling The cache lookup routines for foreign-data wrappers and foreign servers are extended with an extra argument able to control if an error or a NULL object is returned to the caller in the event of an undefined object. This is added in a set of new routines to not impact unnecessrily any FPW plugins. --- doc/src/sgml/fdwhandler.sgml | 30 + src/backend/foreign/foreign.c | 36 +-- src/include/foreign/foreign.h | 4 3 files changed, 68 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml index 4ce88dd77c..a68e264261 100644 --- a/doc/src/sgml/fdwhandler.sgml +++ b/doc/src/sgml/fdwhandler.sgml @@ -1408,6 +1408,21 @@ ReparameterizeForeignPathByChild(PlannerInfo *root, List *fdw_private, ForeignDataWrapper * +GetForeignDataWrapperExtended(Oid fdwid, bool missing_ok); + + + This function returns a ForeignDataWrapper + object for the foreign-data wrapper with the given OID. A + ForeignDataWrapper object contains properties + of the FDW (see foreign/foreign.h for details). + If missing_ok is true, a NULL + result is returned to the caller instead of an error for an undefined + FDW. + + + + +ForeignDataWrapper * GetForeignDataWrapper(Oid fdwid); @@ -1420,6 +1435,21 @@ GetForeignDataWrapper(Oid fdwid); ForeignServer * +GetForeignServerExtended(Oid serverid, bool missing_ok); + + + This function returns a ForeignServer object + for the foreign server with the given OID. A + ForeignServer object contains properties + of the server (see foreign/foreign.h for details). + If missing_ok is true, a NULL + result is returned to the caller instead of an error for an undefined + foreign server. + + + + +ForeignServer * GetForeignServer(Oid serverid); diff --git a/src/backend/foreign/foreign.c b/src/backend/foreign/foreign.c index eac78a5d31..01b5175e71 100644 --- a/src/backend/foreign/foreign.c +++ b/src/backend/foreign/foreign.c @@ -33,6 +33,18 @@ */ ForeignDataWrapper * GetForeignDataWrapper(Oid fdwid) +{ + return GetForeignDataWrapperExtended(fdwid, false); +} + + +/* + * GetForeignDataWrapperExtended - look up the foreign-data wrapper + * by OID. If missing_ok is true, return NULL if the object cannot be + * found instead of raising an error. + */ +ForeignDataWrapper * +GetForeignDataWrapperExtended(Oid fdwid, bool missing_ok) { Form_pg_foreign_data_wrapper fdwform; ForeignDataWrapper *fdw; @@ -43,7 +55,11 @@ GetForeignDataWrapper(Oid fdwid) tp = SearchSysCache1(FOREIGNDATAWRAPPEROID, ObjectIdGetDatum(fdwid)); if (!HeapTupleIsValid(tp)) - elog(ERROR, "cache lookup failed for foreign-data wrapper %u", fdwid); + { + if (!missing_ok) + elog(ERROR, "cache lookup failed for foreign-data wrapper %u", fdwid); + return NULL; + } fdwform = (Form_pg_foreign_data_wrapper) GETSTRUCT(tp); @@ -91,6 +107,18 @@ GetForeignDataWrapperByName(const char *fdwname, bool missing_ok) */ ForeignServer * GetForeignServer(Oid serverid) +{ + return GetForeignServerExtended(serverid, false); +} + + +/* + * GetForeignServerExtended - look up the foreign server definition. If + * missing_ok is true, return NULL if the object cannot be found instead + * of raising an error. + */ +ForeignServer * +GetForeignServerExtended(Oid serverid, bool missing_ok) { Form_pg_foreign_server serverform; ForeignServer *server; @@ -101,7 +129,11 @@ GetForeignServer(Oid serverid) tp = SearchSysCache1(FOREIGNSERVEROID, ObjectIdGetDatum(serverid)); if (!HeapTupleIsValid(tp)) - elog(ERROR, "cache lookup failed for foreign server %u", serverid); + { + if (!missing_ok) + elog(ERROR, "cache lookup failed for foreign server %u", serverid); + return NULL; + } serverform = (Form_pg_foreign_server) GETSTRUCT(tp); diff --git a/src/include/foreign/foreign.h b/src/include/foreign/foreign.h index 3ca12e64d2..5cc89e967c 100644 --- a/src/include/foreign/foreign.h +++ b/src/include/foreign/foreign.h @@ -70,9 +70,13 @@ typedef struct ForeignTable extern ForeignServer *GetForeignServer(Oid serverid); +extern ForeignSe
Re: Consistent segfault in complex query
> "Tom" == Tom Lane writes: >> Your other idea of forcing initPlan parameters to be evaluated >> before we enter the EPQ execution environment is probably more >> workable. Tom> Concretely, the attached seems to be enough to fix it (though I Tom> only tried the simplest case you posted). If it helps, here is a patch that adds isolation tests to eval-plan-qual.spec for two test cases (one with CTE, one without). I've verified that these reproduce the crash, and that they run successfully with your patch. I can't currently see any more specific code paths to probe in these tests. -- Andrew (irc:RhodiumToad) diff --git a/src/test/isolation/expected/eval-plan-qual.out b/src/test/isolation/expected/eval-plan-qual.out index 9bbfdc1b5d..49b3fb3446 100644 --- a/src/test/isolation/expected/eval-plan-qual.out +++ b/src/test/isolation/expected/eval-plan-qual.out @@ -184,6 +184,37 @@ ta_id ta_value tb_row 1 newTableAValue (1,tableBValue) step c2: COMMIT; +starting permutation: updateforcip updateforcip2 c1 c2 read_a +step updateforcip: + UPDATE table_a SET value = NULL WHERE id = 1; + +step updateforcip2: + UPDATE table_a SET value = COALESCE(value, (SELECT text 'newValue')) WHERE id = 1; + +step c1: COMMIT; +step updateforcip2: <... completed> +step c2: COMMIT; +step read_a: SELECT * FROM table_a ORDER BY id; +id value + +1 newValue + +starting permutation: updateforcip updateforcip3 c1 c2 read_a +step updateforcip: + UPDATE table_a SET value = NULL WHERE id = 1; + +step updateforcip3: + WITH d(val) AS (SELECT text 'newValue' FROM generate_series(1,1)) + UPDATE table_a SET value = COALESCE(value, (SELECT val FROM d)) WHERE id = 1; + +step c1: COMMIT; +step updateforcip3: <... completed> +step c2: COMMIT; +step read_a: SELECT * FROM table_a ORDER BY id; +id value + +1 newValue + starting permutation: wrtwcte readwcte c1 c2 step wrtwcte: UPDATE table_a SET value = 'tableAValue2' WHERE id = 1; step readwcte: diff --git a/src/test/isolation/specs/eval-plan-qual.spec b/src/test/isolation/specs/eval-plan-qual.spec index 0b70ad55ba..367922de75 100644 --- a/src/test/isolation/specs/eval-plan-qual.spec +++ b/src/test/isolation/specs/eval-plan-qual.spec @@ -92,6 +92,13 @@ step "updateforss" { UPDATE table_b SET value = 'newTableBValue' WHERE id = 1; } +# these tests exercise EvalPlanQual with conditional InitPlans which +# have not been executed prior to the EPQ + +step "updateforcip" { + UPDATE table_a SET value = NULL WHERE id = 1; +} + # these tests exercise mark/restore during EPQ recheck, cf bug #15032 step "selectjoinforupdate" { @@ -129,6 +136,13 @@ step "readforss" { FROM table_a ta WHERE ta.id = 1 FOR UPDATE OF ta; } +step "updateforcip2" { + UPDATE table_a SET value = COALESCE(value, (SELECT text 'newValue')) WHERE id = 1; +} +step "updateforcip3" { + WITH d(val) AS (SELECT text 'newValue' FROM generate_series(1,1)) + UPDATE table_a SET value = COALESCE(value, (SELECT val FROM d)) WHERE id = 1; +} step "wrtwcte" { UPDATE table_a SET value = 'tableAValue2' WHERE id = 1; } step "wrjt" { UPDATE jointest SET data = 42 WHERE id = 7; } step "c2" { COMMIT; } @@ -137,6 +151,7 @@ session "s3" setup { BEGIN ISOLATION LEVEL READ COMMITTED; } step "read" { SELECT * FROM accounts ORDER BY accountid; } step "read_ext" { SELECT * FROM accounts_ext ORDER BY accountid; } +step "read_a" { SELECT * FROM table_a ORDER BY id; } # this test exercises EvalPlanQual with a CTE, cf bug #14328 step "readwcte" { @@ -171,6 +186,8 @@ permutation "wx2" "partiallock" "c2" "c1" "read" permutation "wx2" "lockwithvalues" "c2" "c1" "read" permutation "wx2_ext" "partiallock_ext" "c2" "c1" "read_ext" permutation "updateforss" "readforss" "c1" "c2" +permutation "updateforcip" "updateforcip2" "c1" "c2" "read_a" +permutation "updateforcip" "updateforcip3" "c1" "c2" "read_a" permutation "wrtwcte" "readwcte" "c1" "c2" permutation "wrjt" "selectjoinforupdate" "c2" "c1" permutation "wrtwcte" "multireadwcte" "c1" "c2"
Re: Cache lookup errors with functions manipulation object addresses
On Mon, Jul 02, 2018 at 08:54:25PM +0900, Michael Paquier wrote: > I am fine with any conclusion. As the patch has rotten a bit, I > switched it from "Ready for committer" to "Needs Review" by the way. > That seems more appropriate as the thing has rotten a bit. Attached are rebased versions. This has been around for some time, so I am planning to move on with this patch set pretty soon as that's mainly cleanup for getObjectIdentity as it triggers elog("cache lookup") or such for undefined objects. Patch 0001 extends FDW and server routines so as it is possible to skip missing entries, without breaking compatibility. Patch 0002 adds a missing_ok flag when doing subscription and publication lookups. Any objections? -- Michael signature.asc Description: PGP signature
logical decoding bug when mapped relation with toast contents is rewritten repeatedly
Hi, (Tomas, CCing you because you IIRC mentioned encountered an issue like this) I just spent quite a while debugging an issue where running logical decoding yielded a: ERROR: could not map filenode "base/$X/$Y" to relation OID error. After discarding like 30 different theories, I have found the cause: During rewrites (i.e. VACUUM FULL / CLUSTER) of a mapped relation with a toast table with actual live toasted tuples (pg_proc in my case and henceforth) heap inserts with the toast data happen into the new toast relation, triggered by: static void raw_heap_insert(RewriteState state, HeapTuple tup) ... /* * If the new tuple is too big for storage or contains already toasted * out-of-line attributes from some other relation, invoke the toaster. * * Note: below this point, heaptup is the data we actually intend to store * into the relation; tup is the caller's original untoasted data. */ if (state->rs_new_rel->rd_rel->relkind == RELKIND_TOASTVALUE) { /* toast table entries should never be recursively toasted */ Assert(!HeapTupleHasExternal(tup)); heaptup = tup; } else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD) heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL, HEAP_INSERT_SKIP_FSM | (state->rs_use_wal ? 0 : HEAP_INSERT_SKIP_WAL)); else heaptup = tup; At that point the new toast relation does *NOT* appear to be a system catalog, it's just appears as an "independent" table. Therefore we do not trigger, in heap_insert(): /* * RelationIsLogicallyLogged * True if we need to log enough information to extract the data from the * WAL stream. * * We don't log information for unlogged tables (since they don't WAL log * anyway) and for system tables (their content is hard to make sense of, and * it would complicate decoding slightly for little gain). Note that we *do* * log information for user defined catalog tables since they presumably are * interesting to the user... */ #define RelationIsLogicallyLogged(relation) \ (XLogLogicalInfoActive() && \ RelationNeedsWAL(relation) && \ !IsCatalogRelation(relation)) /* * For logical decoding, we need the tuple even if we're doing a full * page write, so make sure it's included even if we take a full-page * image. (XXX We could alternatively store a pointer into the FPW). */ if (RelationIsLogicallyLogged(relation)) { xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE; bufflags |= REGBUF_KEEP_DATA; } i.e. the inserted toast tuple will be marked as XLH_INSERT_CONTAINS_NEW_TUPLE - which it shouldn't, because it's a system table. Which we currently do not allow do be logically decoded. That normally ends up being harmless, because ReorderBufferCommit() has the following check: if (!RelationIsLogicallyLogged(relation)) goto change_done; but to reach that check, we first have to map the relfilenode from the WAL to the corresponding OID: reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode, change->data.tp.relnode.relNode); That works correctly if there's only one rewrite - the relmapper contains the data for the new toast table. But if there's been *two* consecutive rewrites, the relmapper *does not* contain the intermediary relfilenode of pg_proc. There's no such problem for non-mapped tables, because historic snapshots allow us to access the relevant data, but the relmapper isn't mvcc. Therefore the catalog-rewrite escape hatch of: /* * Catalog tuple without data, emitted while catalog was * in the process of being rewritten. */ if (reloid == InvalidOid && change->data.tp.newtuple == NULL && change->data.tp.oldtuple == NULL) goto change_done; does not trigger and we run into: else if (reloid == InvalidOid) elog(
Re: stat() on Windows might cause error if target file is larger than 4GB
On Thu, Sep 13, 2018 at 02:23:47PM -0400, Robert Haas wrote: > This, to me, seems way too clever. Replacing 'struct stat' with > something else everywhere in the code is more churn, but far less > likely to have surprising consequences down the road. Or so I would > think, anyway. I don't have the mind set to work on that today (enough Windows-ism for the day), but I would rather use the clever solution because, as far as I know, we want a back-patchable solution so things should not be invasive. -- Michael signature.asc Description: PGP signature
Re: Allowing printf("%m") only where it actually works
On Wed, Sep 12, 2018 at 01:40:01PM -0400, Tom Lane wrote: > Michael Paquier writes: > > I would have liked to look at this patch in details, but it failed to > > apply. Could you rebase? > > Ah, yeah, the dlopen patch touched a couple of the same places. > Rebase attached --- no substantive changes. - if (handleDLL == NULL) - ereport(FATAL, - (errmsg_internal("could not load netmsg.dll: error -code %lu", GetLastError(; In 0001, this is replaced by a non-FATAL error for the backend, which does not seem like a good idea to me because the user loses visibility with this DDL which canot be loaded. I still have to see this error... And about 0002. I am surprised by the amount of cleanup that the removal of all the special wrappers for %m gives, especially expand_fmt_string. So +1 for the concept. I have been testing both patches individually on Windows and compilation, as well as tests, do not show any issues. The tests have been done only with MSVC. Could you drop the configure checks for snprintf and vsnprintf in a separate patch? The cleanup of %m and the removal of those switches should be treated separatly in my opinion. -- Michael signature.asc Description: PGP signature
Re: Getting ERROR: could not open file "base/13164/t3_16388" with partition table with ON COMMIT
On 2018/09/13 23:13, Tom Lane wrote: > Amit Langote writes: >> On 2018/09/13 1:14, Tom Lane wrote: >>> That seems excessively restrictive. Anything that has storage (e.g. >>> matviews) ought to be truncatable, no? > >> Not by heap_truncate it seems. The header comment of heap_truncate says >> that it concerns itself only with ON COMMIT truncation of temporary tables: > > Ah. Well, in that case I'm OK with just a simple test for > RELKIND_RELATION, but I definitely feel that it should be inside > heap_truncate. Callers don't need to know about the limited scope > of what that does. I guess you meant inside heap_truncate_one_rel. I updated the patch that way, but I wonder if we shouldn't also allow other relkinds that have storage which RelationTruncate et al can technically deal with. Thanks, Amit diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 9176f6280b..57df70f7b9 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -3195,6 +3195,10 @@ heap_truncate_one_rel(Relation rel) { Oid toastrelid; + /* Only certain relkinds have storage. */ + if (rel->rd_rel->relkind != RELKIND_RELATION) + return; + /* Truncate the actual file (and discard buffers) */ RelationTruncate(rel, 0);
Re: Loaded footgun open_datasync on Windows
On Thu, Sep 13, 2018 at 03:55:20PM +0200, Laurenz Albe wrote: > Attached is the new, improved version of the patch. Thanks, finally pushed. I have made sure that recovery TAP tests, upgrade tests and bin/ tests are working properly. -- Michael signature.asc Description: PGP signature
Re: Indicate anti-wraparound autovacuum in log_autovacuum_min_duration
On Thu, Sep 13, 2018 at 12:00:49PM +0300, Sergei Kornilov wrote: > Looks better for me. Updated patch attached. Thanks Sergei for the new version, pushed. -- Michael signature.asc Description: PGP signature
Re: Consistent segfault in complex query
Andrew Gierth writes: > "Tom" == Tom Lane writes: > Tom> Don't think that's going to work; the EPQ environment doesn't have > Tom> any way to know that an execPlan link is pointing to something in > Tom> a different estate. > But can't such a way be created? e.g. by pointing execPlan to a special > proxy node that points back to the original estate? I don't really think the amount of complexity that would add is something to consider for a back-patchable fix. > Well, the case of UPDATE ... SET foo = case when x then (select thing > from big_cte) else (select thing from other_big_cte) end will be rather > annoying if we end up forcing both initplans to execute. Given that this bug has been there since the late bronze age and just now got detected, I think that optimizing the fix for especially improbable cases ought not be the first thing on our minds. Let's just get it to work reliably. regards, tom lane
Re: Consistent segfault in complex query
I wrote: > Your other idea of forcing initPlan parameters to be evaluated before we > enter the EPQ execution environment is probably more workable. Concretely, the attached seems to be enough to fix it (though I only tried the simplest case you posted). I don't find anything to love about ExecEvalParamExecParams: it's badly named, badly located, full of undocumented assumptions, and probably causes a memory leak. Plus it doesn't exist as far back as we need it for this. But fixing those problems is a separable task. In the meantime, this is an expedient way to test whether this approach can work. regards, tom lane diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index c583e02..35c9eb2 100644 *** a/src/backend/executor/execMain.c --- b/src/backend/executor/execMain.c *** *** 46,51 --- 46,52 #include "commands/matview.h" #include "commands/trigger.h" #include "executor/execdebug.h" + #include "executor/execExpr.h" #include "foreign/fdwapi.h" #include "mb/pg_wchar.h" #include "miscadmin.h" *** EvalPlanQualBegin(EPQState *epqstate, ES *** 3078,3083 --- 3079,3087 { int i; + /* First, force evaluation of any initPlans needed by subplan */ + ExecEvalParamExecParams(planstate->plan->extParam, parentestate); + i = list_length(parentestate->es_plannedstmt->paramExecTypes); while (--i >= 0) *** EvalPlanQualStart(EPQState *epqstate, ES *** 3170,3175 --- 3174,3182 { int i; + /* First, force evaluation of any initPlans needed by subplan */ + ExecEvalParamExecParams(planTree->extParam, parentestate); + i = list_length(parentestate->es_plannedstmt->paramExecTypes); estate->es_param_exec_vals = (ParamExecData *) palloc0(i * sizeof(ParamExecData));
Re: Consistent segfault in complex query
> "Tom" == Tom Lane writes: >> What I'm wondering is whether the param in the copied estate >> shouldn't rather be just a proxy for the one in the original estate >> - if we need to evaluate it, then do so in the original estate, >> store the value there, and copy the value back into the EPQ >> plantree. Tom> Don't think that's going to work; the EPQ environment doesn't have Tom> any way to know that an execPlan link is pointing to something in Tom> a different estate. But can't such a way be created? e.g. by pointing execPlan to a special proxy node that points back to the original estate? Tom> Your other idea of forcing initPlan parameters to be evaluated Tom> before we enter the EPQ execution environment is probably more Tom> workable. It would be annoying to do that for every initPlan in Tom> sight, but I think we could look at the subplan's extParam to see Tom> whether it potentially references that parameter. (Although Tom> really, in most scenarios it wouldn't matter because all the Tom> initPlans in a data-modifying query are probably referenced in the Tom> subplan anyhow ...) Well, the case of UPDATE ... SET foo = case when x then (select thing from big_cte) else (select thing from other_big_cte) end will be rather annoying if we end up forcing both initplans to execute. -- Andrew (irc:RhodiumToad)
Re: Consistent segfault in complex query
Andrew Gierth writes: > "Tom" == Tom Lane writes: > Tom> The second of those; what we need is for any referenced InitPlans > Tom> to be executed afresh under EPQ rules. (I'm not entirely sure that > Tom> an InitPlan could need to see different input tuples under EPQ > Tom> than it'd see otherwise, but I'm not sure it couldn't, either.) > Shouldn't the InitPlan pretty much by definition be independent of the > tuples being locked/updated? [ thinks for awhile... ] Yeah, I wasn't thinking clearly enough. The point of the special EPQ rules is that, other than the target row-being- updated, any tuples from other tables should be the *same* tuples we'd joined that row to before EPQ. The logical extension of that to InitPlans is that it should be the same InitPlan output as before, not potentially a different value ... > And doesn't executing them again run the risk of getting a different > value for other reasons, for example if an initplan is volatile? ... and that's another good argument for not doing the initplan over. > What I'm wondering is whether the param in the copied estate shouldn't > rather be just a proxy for the one in the original estate - if we need > to evaluate it, then do so in the original estate, store the value > there, and copy the value back into the EPQ plantree. Don't think that's going to work; the EPQ environment doesn't have any way to know that an execPlan link is pointing to something in a different estate. Your other idea of forcing initPlan parameters to be evaluated before we enter the EPQ execution environment is probably more workable. It would be annoying to do that for every initPlan in sight, but I think we could look at the subplan's extParam to see whether it potentially references that parameter. (Although really, in most scenarios it wouldn't matter because all the initPlans in a data-modifying query are probably referenced in the subplan anyhow ...) regards, tom lane
Re: Bug fix for glibc broke freebsd build in REL_11_STABLE
On 2018-09-04 17:51:30 -0700, Andres Freund wrote: > My current proposal is thus to do add a check that does > #if defined(__clang__) && defined(__i386__) && !defined(__SSE2_MATH__) > something-that-fails > #endif > in an autoconf test, and have configure complain if that > fails. Something roughly along the lines of > "Compiling PostgreSQL with clang, on 32bit x86, requires SSE2 support. Use > -msse2 or use gcc." Here's a patch along those lines. Greetings, Andres Freund >From 4a457a5b907d1390c133097e9081ecf6fb1d30ef Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 13 Sep 2018 14:18:43 -0700 Subject: [PATCH] Error out for clang on x86-32 without SSE2 support, no -fexcess-precision. As clang currently doesn't support -fexcess-precision=standard, compiling x86-32 code with SSE2 disabled, can lead to problems with floating point overflow checks and the like. This issue was noticed because clang, on at least some BSDs, defaults to i386 compatibility, whereas it defaults to pentium4 on Linux. Our forced usage of __builtin_isinf() lead to some overflow checks not triggering when compiling for i386, e.g. when the result of the calculation didn't overflow in 80bit registers, but did so in 64bit. While we could just fall back to a non-builtin isinf, it seems likely that the use of 80bit registers leads to other problems (which is why we force the flag for GCC already). Therefore error out when detecting clang in that situation. Reported-By: Victor Wagner Analyzed-By: Andrew Gierth and Andres Freund Author: Andres Freund Discussion: https://postgr.es/m/20180905005130.ewk4xcs5dgyzc...@alap3.anarazel.de Backpatch: 9.3-, all supported versions are affected --- configure| 33 + configure.in | 18 ++ 2 files changed, 51 insertions(+) diff --git a/configure b/configure index c6a44a9078a..9b304023d3d 100755 --- a/configure +++ b/configure @@ -7023,6 +7023,39 @@ fi rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext fi +# Defend against clang being used on x86-32 without SSE2 enabled. As current +# versions of clang do not understand -fexcess-precision=standard, the use of +# x87 floating point operations leads to problems like isinf possibly returning +# false for a value that is infinite when converted from the 80bit register to +# the 8byte memory representation. +# +# Only perform the test if the compiler doesn't understand +# -fexcess-precision=standard, that way a potentially fixed compiler will work +# automatically. +if test "$pgac_cv_prog_CC_cflags__fexcess_precision_standard" = no; then +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + +#if defined(__clang__) && defined(__i386__) && !defined(__SSE2_MATH__) +choke me +#endif + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile "$LINENO"; then : + +else + as_fn_error $? "Compiling PostgreSQL with clang, on 32bit x86, requires SSE2 support. Use -msse2 or use gcc." "$LINENO" 5 +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +fi + ac_ext=c ac_cpp='$CPP $CPPFLAGS' ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5' diff --git a/configure.in b/configure.in index 3ada48b5f95..2e60a89502c 100644 --- a/configure.in +++ b/configure.in @@ -624,6 +624,24 @@ choke me @%:@endif])], [], [AC_MSG_ERROR([do not put -ffast-math in CFLAGS])]) fi +# Defend against clang being used on x86-32 without SSE2 enabled. As current +# versions of clang do not understand -fexcess-precision=standard, the use of +# x87 floating point operations leads to problems like isinf possibly returning +# false for a value that is infinite when converted from the 80bit register to +# the 8byte memory representation. +# +# Only perform the test if the compiler doesn't understand +# -fexcess-precision=standard, that way a potentially fixed compiler will work +# automatically. +if test "$pgac_cv_prog_CC_cflags__fexcess_precision_standard" = no; then +AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [ +@%:@if defined(__clang__) && defined(__i386__) && !defined(__SSE2_MATH__) +choke me +@%:@endif +])], [], +[AC_MSG_ERROR([Compiling PostgreSQL with clang, on 32bit x86, requires SSE2 support. Use -msse2 or use gcc.])]) +fi + AC_PROG_CPP AC_SUBST(GCC) -- 2.18.0.rc2.dirty
Re: Consistent segfault in complex query
> "Tom" == Tom Lane writes: >> So I can see exactly where the problem is, but I'm not sure what the >> solution should be. >> EvalPlanQualStart copies the param_exec value list explicitly _not_ >> including the execPlan link, which obviously isn't going to work if >> the value has not been computed yet. Should it be forcing the >> evaluation of initplans that haven't been run yet, or should the EPQ >> scan evaluate them itself from a copy of the plan, or does there >> need to be some way to share state? (having the InitPlan be run more >> than once might be a problem?) Tom> The second of those; what we need is for any referenced InitPlans Tom> to be executed afresh under EPQ rules. (I'm not entirely sure that Tom> an InitPlan could need to see different input tuples under EPQ Tom> than it'd see otherwise, but I'm not sure it couldn't, either.) Obviously you know this code better than I do... but I'm not convinced. Shouldn't the InitPlan pretty much by definition be independent of the tuples being locked/updated? And doesn't executing them again run the risk of getting a different value for other reasons, for example if an initplan is volatile? What I'm wondering is whether the param in the copied estate shouldn't rather be just a proxy for the one in the original estate - if we need to evaluate it, then do so in the original estate, store the value there, and copy the value back into the EPQ plantree. -- Andrew (irc:RhodiumToad)
Re: pg_dump test instability
I wrote: > Peter Eisentraut writes: >> I see. Why not shift all items up to the i'th up by one place, instead >> of moving only the first one? That way the sortedness would be >> preserved. Otherwise we'd move the first one into the middle, then >> sorting would move it to the front again, etc. > Hmmm ... might be worth doing, but I'm not sure. The steady-state cycle > will probably be that after one task has been dispatched, we'll sleep > until some task finishes and then that will unblock some pending items, > resulting in new entries at the end of the list, forcing a sort anyway > before we next dispatch a task. So I was expecting that avoiding a sort > here wasn't really going to be worth expending much effort for. But my > intuition about that could be wrong. I'll run a test case with some > instrumentation added and see how often we could avoid sorts by > memmove'ing. OK, my intuition was faulty. At least when testing with the regression database, situations where we are taking the slow path at all seem to involve several interrelated dump objects (eg indexes of a table) that are all waiting for the same lock, such that we may be able to dispatch a number of unrelated tasks before anything gets added from the pending list. Doing it as you suggest eliminates a significant fraction of the re-sort operations. Attached updated patch does it like that and makes the cosmetic adjustments you suggested. I also went ahead and did the renaming of par_prev/par_next/par_list_xxx that I'd suggested upthread. I think this is committable ... regards, tom lane diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 42cf441..ba79821 100644 *** a/src/bin/pg_dump/pg_backup.h --- b/src/bin/pg_dump/pg_backup.h *** extern void ConnectDatabase(Archive *AH, *** 252,269 extern void DisconnectDatabase(Archive *AHX); extern PGconn *GetConnection(Archive *AHX); - /* Called to add a TOC entry */ - extern void ArchiveEntry(Archive *AHX, - CatalogId catalogId, DumpId dumpId, - const char *tag, - const char *namespace, const char *tablespace, - const char *owner, bool withOids, - const char *desc, teSection section, - const char *defn, - const char *dropStmt, const char *copyStmt, - const DumpId *deps, int nDeps, - DataDumperPtr dumpFn, void *dumpArg); - /* Called to write *data* to the archive */ extern void WriteData(Archive *AH, const void *data, size_t dLen); --- 252,257 diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 36e3383..3f7a658 100644 *** a/src/bin/pg_dump/pg_backup_archiver.c --- b/src/bin/pg_dump/pg_backup_archiver.c *** typedef struct _outputContext *** 49,54 --- 49,72 int gzOut; } OutputContext; + /* + * State for tracking TocEntrys that are ready to process during a parallel + * restore. (This used to be a list, and we still call it that, though now + * it's really an array so that we can apply qsort to it.) + * + * tes[] is sized large enough that we can't overrun it. + * The valid entries are indexed first_te .. last_te inclusive. + * We periodically sort the array to bring larger-by-dataLength entries to + * the front; "sorted" is true if the valid entries are known sorted. + */ + typedef struct _parallelReadyList + { + TocEntry **tes; /* Ready-to-dump TocEntrys */ + int first_te; /* index of first valid entry in tes[] */ + int last_te; /* index of last valid entry in tes[] */ + bool sorted; /* are valid entries currently sorted? */ + } ParallelReadyList; + /* translator: this is a module name */ static const char *modulename = gettext_noop("archiver"); *** static void restore_toc_entries_parallel *** 95,107 TocEntry *pending_list); static void restore_toc_entries_postfork(ArchiveHandle *AH, TocEntry *pending_list); ! static void par_list_header_init(TocEntry *l); ! static void par_list_append(TocEntry *l, TocEntry *te); ! static void par_list_remove(TocEntry *te); ! static void move_to_ready_list(TocEntry *pending_list, TocEntry *ready_list, RestorePass pass); ! static TocEntry *get_next_work_item(ArchiveHandle *AH, ! TocEntry *ready_list, ParallelState *pstate); static void mark_dump_job_done(ArchiveHandle *AH, TocEntry *te, --- 113,132 TocEntry *pending_list); static void restore_toc_entries_postfork(ArchiveHandle *AH, TocEntry *pending_list); ! static void pending_list_header_init(TocEntry *l); ! static void pending_list_append(TocEntry *l, TocEntry *te); ! static void pending_list_remove(TocEntry *te); ! static void ready_list_init(ParallelReadyList *ready_list, int tocCount); ! static void ready_list_free(ParallelReadyList *ready_list); ! static void ready_list_insert(ParallelReadyList *ready_list, TocEntry *te); ! static void ready_list_remove(ParallelReadyList *
Re: patch to allow disable of WAL recycling
Hi Peter, I'll take a look at that. I had been trying to keep the patch as minimal as possible, but I'm happy to work through this. Thanks, Jerry On Tue, Sep 11, 2018 at 9:39 AM, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 10/09/2018 16:10, Jerry Jelinek wrote: > > Thank you again for running all of these tests on your various hardware > > configurations. I was not aware of the convention that the commented > > example in the config file is expected to match the default value, so I > > was actually trying to show what to use if you didn't want the default, > > but I am happy to update the patch so the comment matches the default. > > Beyond that, I am unsure what the next steps are for this proposal. > > Could you organize the code so that the block below > > /* > * Initialize info about where to try to recycle to. > */ > > isn't executed if recycling is off, since we don't need it. > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: Index Skip Scan
El 13/09/18 a las 18:39, Jesper Pedersen escribió: Yes, this doesn't look good. Using your test case I'm seeing that unique is being chosen when the group size is below 34, and skip above. This is with the standard initdb configuration; did you change something else ? Or did you force the default plan ? Sorry I didn't mention this, the first column is indeed forced skip scan, just to see how it compares to index scan. This is something to look at -- maybe there is a way to use btpo_next/btpo_prev instead/too in order to speed things up. Atm we just have the scan key in BTScanOpaqueData. I'll take a look after my upcoming vacation; feel free to contribute those changes in the meantime of course. I probably won't be able to contribute the changes, but I'll try to review them. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [Patch] Create a new session in postmaster by calling setsid()
Hi, On 2018-09-13 15:27:58 +0800, Paul Guo wrote: > From the respective of users instead of developers, I think for such > important > service, tty should be dissociated, i.e. postmaster should be daemonized by > default (and even should have default log filename?) or be setsid()-ed at > least. I don't think it'd be good to switch to postgres daemonizing itself. If we were starting fresh, maybe, but there's plenty people invoking postgres from scripts etc. And tools like systemd don't actually want daemons to fork away, so there's no clear need from that side either. > I'm not sure the several-years-ago LINUX_OOM_ADJ issue (to resolve that, > silient_mode was removed) still exists or not. I don't have context about > that, so > I conceded to use setsid() even I prefer the deleted silent_mode code. Yes, the OOM issues are still relevant. Greetings, Andres Freund
Re: Consistent segfault in complex query
Andrew Gierth writes: > So I can see exactly where the problem is, but I'm not sure what the > solution should be. > EvalPlanQualStart copies the param_exec value list explicitly _not_ > including the execPlan link, which obviously isn't going to work if the > value has not been computed yet. Should it be forcing the evaluation of > initplans that haven't been run yet, or should the EPQ scan evaluate > them itself from a copy of the plan, or does there need to be some way > to share state? (having the InitPlan be run more than once might be a > problem?) The second of those; what we need is for any referenced InitPlans to be executed afresh under EPQ rules. (I'm not entirely sure that an InitPlan could need to see different input tuples under EPQ than it'd see otherwise, but I'm not sure it couldn't, either.) Also, copying the execPlan links would be bad because it'd allow EPQ to execute planstate subtrees that are outside the portion of the planstate tree that it made a working copy of, which doesn't seem safe (e.g., the planstates could easily have dependencies on the particular EState they are children of). I think that the expectation of this code was that empty execPlan links would be filled at need during EvalPlanQualStart's ExecInitNode calls. That's not happening because the InitPlan we're concerned about is not attached to any plan node in the part of the tree that we copied; it's attached to the ModifyTable node itself. We could fix that for the particular scenario we're looking at here, perhaps by having such initplans be initialized the same way EvalPlanQualStart treats subplans. I'm worried though about whether any referenceable initplans might be attached even higher in the plan tree. If we can ensure that the planner will never do that, then a fix along these lines should be fairly straightforward. regards, tom lane
Re: [HACKERS] Bug in to_timestamp().
Alexander Korotkov writes: > I've closed commitfest entry. I think we can add new commitfest entry if > required. Regarding FX part, it easy to extract it as separate patch, but > it's hard to find consensus. I think there are at least three possible > decisions. > 1) Change FX mode to require separators to be the same. > 2) Leave FX mode "as is". > 3) Introduce GUC variable controlling behavior of FX mode. > Any thoughts? A GUC variable is a horrid solution. Years ago we thought it'd be OK to have GUCs that change query behavior, but we've regretted it every time we did that, and often removed them again later (e.g. regex_flavor, sql_inheritance). Applications that want to be portable have to contend with all possible values of the GUC, and that's no fun for anybody. Given the lack of consensus, it's hard to make a case for breaking backwards compatibility, so I'd have to vote for option 2. regards, tom lane
Re: stat() on Windows might cause error if target file is larger than 4GB
On Thu, Sep 13, 2018 at 10:29 AM, Tom Lane wrote: > What I was vaguely imagining is that win32_port.h could #include > whichever Windows header defines these functions and structs, and > then do > > #define stat __stat64 > > static inline ... __stat64(...) { return _stat64(...); } > > What would need testing is whether the #define has nasty side-effects > even if we've already included the system header. I don't think it'd > hurt, eg, local variables named "stat"; though people might be surprised > when examining things in a debugger. This, to me, seems way too clever. Replacing 'struct stat' with something else everywhere in the code is more churn, but far less likely to have surprising consequences down the road. Or so I would think, anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Bug in to_timestamp().
On Tue, Sep 11, 2018 at 6:18 PM Arthur Zakirov wrote: > On Sun, Sep 09, 2018 at 09:52:46PM +0300, Alexander Korotkov wrote: > > So, pushed! Thanks to every thread participant for review and feedback. > > Great! Should we close the commitfest entry? There is FX part of the > patch though. But it seems that nobody is happy with it. It could be > done with a separate patch anyway. > I've closed commitfest entry. I think we can add new commitfest entry if required. Regarding FX part, it easy to extract it as separate patch, but it's hard to find consensus. I think there are at least three possible decisions. 1) Change FX mode to require separators to be the same. 2) Leave FX mode "as is". 3) Introduce GUC variable controlling behavior of FX mode. Any thoughts? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Avoid extra Sort nodes between WindowAggs when sorting can be reused
Here's what I have queued up to push. My changes are: - added to the header comment of list_concat_unique that callers have ordering expectations. Didn't touch list_union, since I ended up sticking with list_concat_unique for this patch. - WindowClauseSortNode renamed WindowClauseSortData - added and rewrote some comments - tidied up some casting in common_prefix_cmp - pgindent and some layout tweaks -- Andrew (irc:RhodiumToad) >From 9c89883ffe2153cc9d047f71a2b0e611f2c60452 Mon Sep 17 00:00:00 2001 From: Andrew Gierth Date: Thu, 13 Sep 2018 18:12:37 +0100 Subject: [PATCH] Order active window clauses for greater reuse of Sort nodes. By sorting the active window list lexicographically by the sort clause list but putting longer clauses before shorter prefixes, we generate more chances to elide Sort nodes when building the path. Author: Daniel Gustafsson (with some editorialization by me) Reviewed-by: Alexander Kuzmenkov, Masahiko Sawada, Tom Lane Discussion: https://postgr.es/m/124A7F69-84CD-435B-BA0E-2695BE21E5C2%40yesql.se --- src/backend/nodes/list.c | 7 +- src/backend/optimizer/plan/planner.c | 154 +-- src/test/regress/expected/window.out | 60 +++--- src/test/regress/sql/window.sql | 16 4 files changed, 177 insertions(+), 60 deletions(-) diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index f3e1800708..55fd4c359b 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -1011,8 +1011,11 @@ list_append_unique_oid(List *list, Oid datum) * via equal(). * * This is almost the same functionality as list_union(), but list1 is - * modified in-place rather than being copied. Note also that list2's cells - * are not inserted in list1, so the analogy to list_concat() isn't perfect. + * modified in-place rather than being copied. However, callers of this + * function may have strict ordering expectations -- i.e. that the relative + * order of those list2 elements that are not duplicates is preserved. Note + * also that list2's cells are not inserted in list1, so the analogy to + * list_concat() isn't perfect. */ List * list_concat_unique(List *list1, List *list2) diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 96bf0601a8..94b85721fa 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -110,6 +110,17 @@ typedef struct int *tleref_to_colnum_map; } grouping_sets_data; +/* + * Temporary structure for use during WindowClause reordering in order to be + * be able to sort WindowClauses on partitioning/ordering prefix. + */ +typedef struct +{ + WindowClause *wc; + List *uniqueOrder; /* A List of unique ordering/partitioning + * clauses per Window */ +} WindowClauseSortData; + /* Local functions */ static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind); static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode); @@ -237,6 +248,7 @@ static void create_partitionwise_grouping_paths(PlannerInfo *root, static bool group_by_has_partkey(RelOptInfo *input_rel, List *targetList, List *groupClause); +static int common_prefix_cmp(const void *a, const void *b); /* @@ -5260,68 +5272,120 @@ postprocess_setop_tlist(List *new_tlist, List *orig_tlist) static List * select_active_windows(PlannerInfo *root, WindowFuncLists *wflists) { - List *result; - List *actives; + List *windowClause = root->parse->windowClause; + List *result = NIL; ListCell *lc; + int nActive = 0; + WindowClauseSortData *actives = palloc(sizeof(WindowClauseSortData) + * list_length(windowClause)); - /* First, make a list of the active windows */ - actives = NIL; - foreach(lc, root->parse->windowClause) + /* First, construct an array of the active windows */ + foreach(lc, windowClause) { WindowClause *wc = lfirst_node(WindowClause, lc); /* It's only active if wflists shows some related WindowFuncs */ Assert(wc->winref <= wflists->maxWinRef); - if (wflists->windowFuncs[wc->winref] != NIL) - actives = lappend(actives, wc); + if (wflists->windowFuncs[wc->winref] == NIL) + continue; + + actives[nActive].wc = wc; /* original clause */ + + /* + * For sorting, we want the list of partition keys followed by the + * list of sort keys. But pathkeys construction will remove duplicates + * between the two, so we can as well (even though we can't detect all + * of the duplicates, since some may come from ECs - that might mean + * we miss optimization chances here). We must, however, ensure that + * the order of entries is preserved with respect to the ones we do + * keep. + * + * partitionClause and orderClause had their own duplicates removed in + * parse analysis, so we're only concerned here with removing + * orderClause entries tha
Re: [patch] Support LLVM 7
On 2018-09-12 23:07:34 +0200, Christoph Berg wrote: > Re: Andres Freund 2018-09-12 > <20180912210338.h3vsss5lkuu26...@alap3.anarazel.de> > > Hi, > > > > On 2018-09-12 14:45:17 +0200, Christoph Berg wrote: > > > LLVM 7 landed in Debian unstable, this patch teaches ./configure to use > > > it. (General patch, not specific to Debian.) > > > > Thanks. Yes, I think we should do that, especially because my patches > > to add proper debugging and profiling support only landed in LLVM > > 7. Therefore I'm planning to add this to both v11 and master. Unless > > somebody protests? > > I plan to switch postgresql-11.deb to LLVM 7 over the next days > because of the support for non-x86 architectures, so this should > definitely land in 11. Pushed, thanks for the patch! Greetings, Andres Freund
Re: cache lookup failed for constraint when alter table referred by partition table
On 2018-Sep-13, Tom Lane wrote: > Alvaro Herrera writes: > > That's the problem all right. The solution is to drop all > > index/constraint objects together in one performMultipleDeletions() > > instead of performDeletion() one by one, as in the attached patch. > > Looks reasonable as far as it goes. Given that we no longer require > any of this: > > - * Now we can drop the existing constraints and indexes --- constraints > - * first, since some of them might depend on the indexes. In fact, we > - * have to delete FOREIGN KEY constraints before UNIQUE constraints, but > - * we already ordered the constraint list to ensure that would happen. > > can we make any simplifications in earlier steps? At the very least, > look for comments related to this assumption. Ah, I had looked, but not hard enough. In this new version I removed some code in ATExecAlterColumnType that's now irrelevant. I tested this by changing both lappend calls to lcons in that function; seems to behave the same. (Also added more constraints to the test case.) Another thing I found I can change is to move the add_object_address() calls to the other loops scanning the same lists, so that we don't have to walk each the two lists twice. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 9a9fe65eeb0bc3074793474b9d85b10c3e260e78 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 13 Sep 2018 13:26:18 -0300 Subject: [PATCH v2] fix ALTER TYPE --- src/backend/commands/tablecmds.c | 71 +++ src/test/regress/expected/foreign_key.out | 12 ++ src/test/regress/sql/foreign_key.sql | 11 + 3 files changed, 47 insertions(+), 47 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index e96512e051..03c0b739b3 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -9527,33 +9527,12 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel, { char *defstring = pg_get_constraintdef_command(foundObject.objectId); - /* -* Put NORMAL dependencies at the front of the list and -* AUTO dependencies at the back. This makes sure that -* foreign-key constraints depending on this column will -* be dropped before unique or primary-key constraints of -* the column; which we must have because the FK -* constraints depend on the indexes belonging to the -* unique constraints. -*/ - if (foundDep->deptype == DEPENDENCY_NORMAL) - { - tab->changedConstraintOids = - lcons_oid(foundObject.objectId, - tab->changedConstraintOids); - tab->changedConstraintDefs = - lcons(defstring, - tab->changedConstraintDefs); - } - else - { - tab->changedConstraintOids = - lappend_oid(tab->changedConstraintOids, - foundObject.objectId); - tab->changedConstraintDefs = - lappend(tab->changedConstraintDefs, - defstring); - } + tab->changedConstraintOids = + lappend_oid(tab->changedConstraintOids, + foundObject.objectId); + tab->changedConstraintDefs = + lappend(tab->changedConstraintDefs, + defstring); } break; @@ -9893,10 +9872,18 @@ static void ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab, LOCKMODE lockmode) { Objec
PostgreSQL 11 {Beta 4, RC1} Release: 2018-09-20
Hi, We are planning to have another release of PostgreSQL 11, either Beta 4 or RC1, next week on Thursday, 2018-09-20. The version will be determined based on the state of the open items list[1] around the time of stamping. Thanks, Jonathan [1] https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues signature.asc Description: OpenPGP digital signature
Re: cache lookup failed for constraint when alter table referred by partition table
Alvaro Herrera writes: > That's the problem all right. The solution is to drop all > index/constraint objects together in one performMultipleDeletions() > instead of performDeletion() one by one, as in the attached patch. Looks reasonable as far as it goes. Given that we no longer require any of this: -* Now we can drop the existing constraints and indexes --- constraints -* first, since some of them might depend on the indexes. In fact, we -* have to delete FOREIGN KEY constraints before UNIQUE constraints, but -* we already ordered the constraint list to ensure that would happen. can we make any simplifications in earlier steps? At the very least, look for comments related to this assumption. regards, tom lane
Re: cache lookup failed for constraint when alter table referred by partition table
On 2018-Sep-10, Alvaro Herrera wrote: > ATPostAlterTypeCleanup is trying to search the original constraint by > OID in order to drop it, but it's not there -- I suppose it has already > been dropped by recursion in a previous step. That's the problem all right. The solution is to drop all index/constraint objects together in one performMultipleDeletions() instead of performDeletion() one by one, as in the attached patch. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 4e1ec5fcb396e3c71fc399c986d9bbd9610d7849 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 13 Sep 2018 13:26:18 -0300 Subject: [PATCH] fix ALTER TYPE --- src/backend/commands/tablecmds.c | 28 ++-- src/test/regress/expected/foreign_key.out | 12 src/test/regress/sql/foreign_key.sql | 11 +++ 3 files changed, 37 insertions(+), 14 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index e96512e051..f8c5d71ccf 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -9893,6 +9893,7 @@ static void ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab, LOCKMODE lockmode) { ObjectAddress obj; + ObjectAddresses *objects; ListCell *def_item; ListCell *oid_item; @@ -9963,29 +9964,28 @@ ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab, LOCKMODE lockmode) } /* -* Now we can drop the existing constraints and indexes --- constraints -* first, since some of them might depend on the indexes. In fact, we -* have to delete FOREIGN KEY constraints before UNIQUE constraints, but -* we already ordered the constraint list to ensure that would happen. It -* should be okay to use DROP_RESTRICT here, since nothing else should be -* depending on these objects. +* Now we can drop the existing constraints and indexes. Do them all in a +* single call, so that we don't have to worry about dependencies among +* them. It should be okay to use DROP_RESTRICT here, since nothing else +* should be depending on these objects. */ + objects = new_object_addresses(); foreach(oid_item, tab->changedConstraintOids) { - obj.classId = ConstraintRelationId; - obj.objectId = lfirst_oid(oid_item); - obj.objectSubId = 0; - performDeletion(&obj, DROP_RESTRICT, PERFORM_DELETION_INTERNAL); + ObjectAddressSet(obj, ConstraintRelationId, lfirst_oid(oid_item)); + add_exact_object_address(&obj, objects); } foreach(oid_item, tab->changedIndexOids) { - obj.classId = RelationRelationId; - obj.objectId = lfirst_oid(oid_item); - obj.objectSubId = 0; - performDeletion(&obj, DROP_RESTRICT, PERFORM_DELETION_INTERNAL); + ObjectAddressSet(obj, RelationRelationId, lfirst_oid(oid_item)); + add_exact_object_address(&obj, objects); } + performMultipleDeletions(objects, DROP_RESTRICT, PERFORM_DELETION_INTERNAL); + + free_object_addresses(objects); + /* * The objects will get recreated during subsequent passes over the work * queue. diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out index b90c4926e2..1854867381 100644 --- a/src/test/regress/expected/foreign_key.out +++ b/src/test/regress/expected/foreign_key.out @@ -1518,6 +1518,18 @@ DETAIL: Key (a, b)=(2500, 2502) is still referenced from table "fk_partitioned_ ALTER TABLE fk_partitioned_fk DROP CONSTRAINT fk_partitioned_fk_a_fkey; -- done. DROP TABLE fk_notpartitioned_pk, fk_partitioned_fk; +-- Altering a type referenced by a foreign key needs to drop/recreate the FK. +-- Ensure that works. +CREATE TABLE fk_notpartitioned_pk (a INT, PRIMARY KEY(a)); +CREATE TABLE fk_partitioned_fk (a INT REFERENCES fk_notpartitioned_pk(a)) PARTITION BY RANGE(a); +CREATE TABLE fk_partitioned_fk_1 PARTITION OF fk_partitioned_fk FOR VALUES FROM (MINVALUE) TO (MAXVALUE); +INSERT INTO fk_notpartitioned_pk VALUES (1); +INSERT INTO fk_partitioned_fk VALUES (1); +ALTER TABLE fk_notpartitioned_pk ALTER COLUMN a TYPE bigint; +DELETE FROM fk_notpartitioned_pk WHERE a = 1; +ERROR: update or delete on table "fk_notpartitioned_pk" violates foreign key constraint "fk_partitioned_fk_a_fkey" on table "fk_partitioned_fk" +DETAIL: Key (a)=(1) is still referenced from table "fk_partitioned_fk". +DROP TABLE fk_notpartitioned_pk, fk_partitioned_fk; -- Test some other exotic foreign key features: MATCH SIMPLE, ON UPDATE/DELETE -- actions CREATE TABLE fk_notpartitioned_pk (a int, b int, primary key (a, b)); diff --git a/src/test/regress/sql/foreign_key.sql b/src/test/regress/sq
Re: Index Skip Scan
Hi Alexander. On 9/13/18 9:01 AM, Alexander Kuzmenkov wrote: While testing this patch Thanks for the review ! I noticed that current implementation doesn't perform well when we have lots of small groups of equal values. Here is the execution time of index skip scan vs unique over index scan, in ms, depending on the size of group. The benchmark script is attached. group size skip unique 1 2,293.85 132.55 5 464.40 106.59 10 239.61 102.02 50 56.59 98.74 100 32.56 103.04 500 6.08 97.09 Yes, this doesn't look good. Using your test case I'm seeing that unique is being chosen when the group size is below 34, and skip above. This is with the standard initdb configuration; did you change something else ? Or did you force the default plan ? So, the current implementation can lead to performance regression, and the choice of the plan depends on the notoriously unreliable ndistinct statistics. Yes, Peter mentioned this, which I'm still looking at. The regression is probably because skip scan always does _bt_search to find the next unique tuple. Very likely. I think we can improve this, and the skip scan can be strictly faster than index scan regardless of the data. As a first approximation, imagine that we somehow skipped equal tuples inside _bt_next instead of sending them to the parent Unique node. This would already be marginally faster than Unique + Index scan. A more practical implementation would be to remember our position in tree (that is, BTStack returned by _bt_search) and use it to skip pages in bulk. This looks straightforward to implement for a tree that does not change, but I'm not sure how to make it work with concurrent modifications. Still, this looks a worthwhile direction to me, because if we have a strictly faster skip scan, we can just use it always and not worry about our unreliable statistics. What do you think? This is something to look at -- maybe there is a way to use btpo_next/btpo_prev instead/too in order to speed things up. Atm we just have the scan key in BTScanOpaqueData. I'll take a look after my upcoming vacation; feel free to contribute those changes in the meantime of course. Thanks again ! Best regards, Jesper
Re: speeding up planning with partitions
On Tue, Sep 4, 2018 at 12:44 PM, Amit Langote wrote: > Hi Dilip, > > Thanks for taking a look. > > On 2018/09/03 20:57, Dilip Kumar wrote: >> The idea looks interesting while going through the patch I observed >> this comment. >> >> /* >> * inheritance_planner >> * Generate Paths in the case where the result relation is an >> * inheritance set. >> * >> * We have to handle this case differently from cases where a source relation >> * is an inheritance set. Source inheritance is expanded at the bottom of the >> * plan tree (see allpaths.c), but target inheritance has to be expanded at >> * the top. >> >> I think with your patch these comments needs to be change? > > Yes, maybe a good idea to mention that partitioned table result relations > are not handled here. > > Actually, I've been wondering if this patch (0001) shouldn't get rid of > inheritance_planner altogether and implement the new approach for *all* > inheritance sets, not just partitioned tables, but haven't spent much time > on that idea yet. That will be interesting. > >> if (parse->resultRelation && >> - rt_fetch(parse->resultRelation, parse->rtable)->inh) >> + rt_fetch(parse->resultRelation, parse->rtable)->inh && >> + rt_fetch(parse->resultRelation, parse->rtable)->relkind != >> + RELKIND_PARTITIONED_TABLE) >> inheritance_planner(root); >> else >> grouping_planner(root, false, tuple_fraction); >> >> I think we can add some comments to explain if the target rel itself >> is partitioned rel then why >> we can directly go to the grouping planner. > > Okay, I will try to add more explanatory comments here and there in the > next version I will post later this week. Okay. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: stat() on Windows might cause error if target file is larger than 4GB
Michael Paquier writes: > On Wed, Sep 12, 2018 at 08:17:05PM -0400, Tom Lane wrote: >> And we can't just "#define stat __stat64" because >> that would break the calls to stat(). Or wait, maybe we could do that >> and also have a one-liner function named __stat64 that would catch the >> calls and redirect to _stat64? > I don't think I grab your point here. "#define stat __stat64" cannot > exist from the start so how would you do that? What I was vaguely imagining is that win32_port.h could #include whichever Windows header defines these functions and structs, and then do #define stat __stat64 static inline ... __stat64(...) { return _stat64(...); } What would need testing is whether the #define has nasty side-effects even if we've already included the system header. I don't think it'd hurt, eg, local variables named "stat"; though people might be surprised when examining things in a debugger. regards, tom lane
Re: Avoid extra Sort nodes between WindowAggs when sorting can be reused
Andrew Gierth writes: > (aside: I itch to rewrite the comment that says "the spec requires that > there be only one sort" - number of sorts is an implementation detail > about which the spec is silent, what it _actually_ requires is that peer > rows must be presented in the same order in all order-equivalent > windows, which we choose to implement by ensuring there is only one sort > for such windows, rather than, for example, adding extra sort keys to > provide stability.) Sure, rewrite away. > (Perhaps worth noting for future work is that this code and the grouping > sets code have a common issue: currently we allow only one sort order to > be requested as query_pathkeys, but this means that both window paths > and grouping sets paths have to make an essentially arbitrary choice of > query_pathkeys, rather than having a set of possible "useful" orderings > and taking whichever can be produced most cheaply.) Yeah, I've had a bee in my bonnet for awhile about replacing query_pathkeys with a list of potentially-desirable result orderings. So far there hasn't been a truly compelling reason to do it, but if somebody felt like generalizing the window function ordering stuff in that direction, it'd be a nice project. regards, tom lane
Re: [HACKERS] Cutting initdb's runtime (Perl question embedded)
Thomas Munro writes: > I noticed that the patch does a bunch of s/Olson/IANA/. That leaves > only one place in the tree that still refers to the "Olson" database: > dt_common.c. Might want to change that too? Ah, I didn't realize we were that close to wiping out the old terminology. Yeah, I'll get that one too. Thanks for noticing. regards, tom lane
Re: Getting ERROR: could not open file "base/13164/t3_16388" with partition table with ON COMMIT
Amit Langote writes: > On 2018/09/13 1:14, Tom Lane wrote: >> That seems excessively restrictive. Anything that has storage (e.g. >> matviews) ought to be truncatable, no? > Not by heap_truncate it seems. The header comment of heap_truncate says > that it concerns itself only with ON COMMIT truncation of temporary tables: Ah. Well, in that case I'm OK with just a simple test for RELKIND_RELATION, but I definitely feel that it should be inside heap_truncate. Callers don't need to know about the limited scope of what that does. regards, tom lane
Re: Avoid extra Sort nodes between WindowAggs when sorting can be reused
> "Tom" == Tom Lane writes: Tom> I'm also a bit suspicious as to whether the code is even correct; Tom> does it *really* match what will happen later when we create sort Tom> plan nodes? (Maybe it's fine; I haven't looked.) As things stand before the patch, the code that actually generates the path of sort+window nodes requires only this assumption: that order-equivalent windows (as defined by the spec) must end up together in the list, or otherwise separated only by entries that don't add a new sort node. (aside: I itch to rewrite the comment that says "the spec requires that there be only one sort" - number of sorts is an implementation detail about which the spec is silent, what it _actually_ requires is that peer rows must be presented in the same order in all order-equivalent windows, which we choose to implement by ensuring there is only one sort for such windows, rather than, for example, adding extra sort keys to provide stability.) The path-generation code simply concatenates the partition and order lists and creates pathkeys. The pathkeys creation removes redundant entries. So if we're guaranteed that two entries considered equal by the patch code are also considered equal by the pathkey mechanism, which I believe is the case, then the logic is still correct (enough to satisfy the spec and produce correct query results). There are optimizations that can be done once we have the pathkeys that can't be anticipated by select_active_windows because that function is run before we set up equivalence classes. This might lead path creation to produce fewer sorts than anticipated, but not more sorts. So I'm satisfied, as far as I can tell, that the logic is both correct and an improvement over what we currently have. (Perhaps worth noting for future work is that this code and the grouping sets code have a common issue: currently we allow only one sort order to be requested as query_pathkeys, but this means that both window paths and grouping sets paths have to make an essentially arbitrary choice of query_pathkeys, rather than having a set of possible "useful" orderings and taking whichever can be produced most cheaply.) -- Andrew (irc:RhodiumToad)
Re: pg_dump test instability
Peter Eisentraut writes: > On 12/09/2018 18:06, Tom Lane wrote: >> No. In both code paths, the array slot at index first_te is being >> physically dropped from the set of valid entries (by incrementing >> first_te). In the first path, that slot holds the item we want to >> remove logically from the set, so that incrementing first_te is >> all we have to do: the remaining entries are still in the range >> first_te..last_te, and they're still sorted. In the second code >> path, the item that was in that slot is still wanted as part of >> the set, so we copy it into the valid range (overwriting the item >> in slot i, which is no longer wanted). Now the valid range is >> probably not sorted, so we have to flag that a re-sort is needed. > I see. Why not shift all items up to the i'th up by one place, instead > of moving only the first one? That way the sortedness would be > preserved. Otherwise we'd move the first one into the middle, then > sorting would move it to the front again, etc. Hmmm ... might be worth doing, but I'm not sure. The steady-state cycle will probably be that after one task has been dispatched, we'll sleep until some task finishes and then that will unblock some pending items, resulting in new entries at the end of the list, forcing a sort anyway before we next dispatch a task. So I was expecting that avoiding a sort here wasn't really going to be worth expending much effort for. But my intuition about that could be wrong. I'll run a test case with some instrumentation added and see how often we could avoid sorts by memmove'ing. regards, tom lane
Re: Loaded footgun open_datasync on Windows
Michael Paquier wrote: > On Mon, Sep 10, 2018 at 04:46:40PM +0200, Laurenz Albe wrote: > > I didn't get pg_upgrade to work without the log file hacks; I suspect > > that there is more than just log file locking going on, but my Windows > > skills are limited. > > > > How shall I proceed? > > I do like this patch, and we have an occasion to clean a bunch of things > in pg_upgrade, so this argument is enough to me to put my hands in the > dirt and check by myself, so... > > I really thought that this was not ambitious enough, so I have hacked on > top of your patch, so as pg_upgrade concurrent issues are removed, and I > have found one barrier in pg_ctl which decides that it is smarter to > redirect the log file (here pg_upgrade_server.log) using CMD. The > problem is that the lock taken by the process which does the redirection > does not work nicely with what pg_upgrade does in parallel. So I think > that it is better to drop that part. I got you now. So I won't try to deal with that at this point. > +#ifdef WIN32 > + if ((infile = fopen(path, "rt")) == NULL) > +#else > if ((infile = fopen(path, "r")) == NULL) > +#endif > This should have a comment, saying roughly that as this uses > win32_fopen, text mode needs to be enforced to get proper CRLF. Done. > One spot for open() is missed in file_utils.c, please see > pre_sync_fname(). Done. > The patch fails to apply for pg_verify_checksums, with a conflict easy > enough to fix. Fixed. > Laurenz, could you update your patch? I am switching that as waiting on > author for now. Attached is the new, improved version of the patch. Thanks again! Yours, Laurenz Albe From 8cd3d1a75fc3b18e6928b9b26841f1947d1f72ba Mon Sep 17 00:00:00 2001 From: Laurenz Albe Date: Thu, 13 Sep 2018 14:24:59 +0200 Subject: [PATCH] Use pgwin32_open in frontend code on Windows This is particularly important for pg_test_fsync which does not handle O_DSYNC correctly otherwise, leading to false claims that disks are unsafe. With pgwin32_open, files won't get locked when opened. This could be used to get rid of some of the workarounds for Windows' file locking, but that is left for later because it proved not to be as trivial as hoped. Discussion: https://www.postgresql.org/message-id/1527846213.2475.31.camel%40cybertec.at Laurenz Albe, reviewed by Michael Paquier and Kuntal Ghosh --- src/bin/initdb/initdb.c | 8 src/bin/pg_basebackup/pg_receivewal.c | 2 +- src/bin/pg_verify_checksums/pg_verify_checksums.c | 2 +- src/common/file_utils.c | 4 ++-- src/include/port.h| 3 --- 5 files changed, 12 insertions(+), 7 deletions(-) diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 32746c7703..bf234c0763 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -490,7 +490,15 @@ readfile(const char *path) char *buffer; int c; +#ifdef WIN32 + /* + * On Windows, we have to open the file in text mode + * so that "carriage return" characters are stripped. + */ + if ((infile = fopen(path, "rt")) == NULL) +#else if ((infile = fopen(path, "r")) == NULL) +#endif { fprintf(stderr, _("%s: could not open file \"%s\" for reading: %s\n"), progname, path, strerror(errno)); diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c index 8be8d48a8a..912aed8d7c 100644 --- a/src/bin/pg_basebackup/pg_receivewal.c +++ b/src/bin/pg_basebackup/pg_receivewal.c @@ -288,7 +288,7 @@ FindStreamingStart(uint32 *tli) snprintf(fullpath, sizeof(fullpath), "%s/%s", basedir, dirent->d_name); - fd = open(fullpath, O_RDONLY | PG_BINARY); + fd = open(fullpath, O_RDONLY | PG_BINARY, 0); if (fd < 0) { fprintf(stderr, _("%s: could not open compressed file \"%s\": %s\n"), diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c index d46bac2cd6..507f83ca4c 100644 --- a/src/bin/pg_verify_checksums/pg_verify_checksums.c +++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c @@ -80,7 +80,7 @@ scan_file(const char *fn, BlockNumber segmentno) int f; BlockNumber blockno; - f = open(fn, O_RDONLY | PG_BINARY); + f = open(fn, O_RDONLY | PG_BINARY, 0); if (f < 0) { fprintf(stderr, _("%s: could not open file \"%s\": %s\n"), diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 48876061c3..d952bc8c88 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -222,7 +222,7 @@ pre_sync_fname(const char *fname, bool isdir, const char *progname) { int fd; - fd = open(fname, O_RDONLY | PG_BINARY); + fd = open(fname, O_RDONLY | PG_BINARY, 0); if (fd < 0) { @@ -283,7 +283,7 @@ fsync_fname(const char *fname, bool isdir, const char *progname) * unsupported operations, e.g. opening a directory under Windows), and * logging others. */ - fd = open(fname, flags); + fd = open(fname, flags, 0); if (fd
Re: Index Skip Scan
Hi Jesper, While testing this patch I noticed that current implementation doesn't perform well when we have lots of small groups of equal values. Here is the execution time of index skip scan vs unique over index scan, in ms, depending on the size of group. The benchmark script is attached. group size skip unique 1 2,293.85 132.55 5 464.40 106.59 10 239.61 102.02 50 56.59 98.74 100 32.56 103.04 500 6.08 97.09 So, the current implementation can lead to performance regression, and the choice of the plan depends on the notoriously unreliable ndistinct statistics. The regression is probably because skip scan always does _bt_search to find the next unique tuple. I think we can improve this, and the skip scan can be strictly faster than index scan regardless of the data. As a first approximation, imagine that we somehow skipped equal tuples inside _bt_next instead of sending them to the parent Unique node. This would already be marginally faster than Unique + Index scan. A more practical implementation would be to remember our position in tree (that is, BTStack returned by _bt_search) and use it to skip pages in bulk. This looks straightforward to implement for a tree that does not change, but I'm not sure how to make it work with concurrent modifications. Still, this looks a worthwhile direction to me, because if we have a strictly faster skip scan, we can just use it always and not worry about our unreliable statistics. What do you think? -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company test-skip.sql Description: application/sql
Re: Protect syscache from bloating with negative cache entries
Hello. Thank you for looking this. At Wed, 12 Sep 2018 05:16:52 +, "Ideriha, Takeshi" wrote in <4E72940DA2BF16479384A86D54D0988A6F197012@G01JPEXMBKW04> > Hi, > > >Subject: Re: Protect syscache from bloating with negative cache entries > > > >Hello. The previous v4 patchset was just broken. > > >Somehow the 0004 was merged into the 0003 and applying 0004 results in > >failure. I > >removed 0004 part from the 0003 and rebased and repost it. > > I have some questions about syscache and relcache pruning > though they may be discussed at upper thread or out of point. > > Can I confirm about catcache pruning? > syscache_memory_target is the max figure per CatCache. > (Any CatCache has the same max value.) > So the total max size of catalog caches is estimated around or > slightly more than # of SysCache array times syscache_memory_target. Right. > If correct, I'm thinking writing down the above estimation to the document > would help db administrators with estimation of memory usage. > Current description might lead misunderstanding that syscache_memory_target > is the total size of catalog cache in my impression. Honestly I'm not sure that is the right design. Howerver, I don't think providing such formula to users helps users, since they don't know exactly how many CatCaches and brothres live in their server and it is a soft limit, and finally only few or just one catalogs can reach the limit. The current design based on the assumption that we would have only one extremely-growable cache in one use case. > Related to the above I just thought changing sysycache_memory_target per > CatCache > would make memory usage more efficient. We could easily have per-cache settings in CatCache, but how do we provide the knobs for them? I can guess only too much solutions for that. > Though I haven't checked if there's a case that each system catalog cache > memory usage varies largely, > pg_class cache might need more memory than others and others might need less. > But it would be difficult for users to check each CatCache memory usage and > tune it > because right now postgresql hasn't provided a handy way to check them. I supposed that this is used without such a means. Someone suffers syscache bloat just can set this GUC to avoid the bloat. End. Apart from that, in the current patch, syscache_memory_target is not exact at all in the first place to avoid overhead to count the correct size. The major difference comes from the size of cache tuple itself. But I came to think it is too much to omit. As a *PoC*, in the attached patch (which applies to current master), size of CTups are counted as the catcache size. It also provides pg_catcache_size system view just to give a rough idea of how such view looks. I'll consider more on that but do you have any opinion on this? =# select relid::regclass, indid::regclass, size from pg_syscache_sizes order by size desc; relid | indid | size -+---+ pg_class| pg_class_oid_index| 131072 pg_class| pg_class_relname_nsp_index| 131072 pg_cast | pg_cast_source_target_index | 5504 pg_operator | pg_operator_oprname_l_r_n_index | 4096 pg_statistic| pg_statistic_relid_att_inh_index | 2048 pg_proc | pg_proc_proname_args_nsp_index| 2048 .. > Another option is that users only specify the total memory target size and > postgres > dynamically change each CatCache memory target size according to a certain > metric. > (, which still seems difficult and expensive to develop per benefit) > What do you think about this? Given that few caches bloat at once, it's effect is not so different from the current design. > As you commented here, guc variable syscache_memory_target and > syscache_prune_min_age are used for both syscache and relcache (HTAB), right? Right, just not to add knobs for unclear reasons. Since ... > Do syscache and relcache have the similar amount of memory usage? They may be different but would make not so much in the case of cache bloat. > If not, I'm thinking that introducing separate guc variable would be fine. > So as syscache_prune_min_age. I implemented that so that it is easily replaceable in case, but I'm not sure separating them makes significant difference.. Thanks for the opinion, I'll put consideration on this more. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index bee4afbe4e..6a00141fc9 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1617,6 +1617,44 @@ include_dir 'conf.d' + + syscache_memory_target (integer) + + syscache_memory_target configuration parameter + +
Re: Collation versioning
Greetings, * Christoph Berg (m...@debian.org) wrote: > Re: Peter Eisentraut 2018-09-13 > <4f60612c-a7b5-092d-1532-21ff7a106...@2ndquadrant.com> > > Moreover, the fix for a collation version mismatch is, in the simplest > > case, to go around and REINDEX everything. Making the collation or > > collation version global doesn't fix that. It would actually make it > > harder because you couldn't run ALTER COLLATION REFRESH VERSION until > > after you have rebuilt all affected objects *in all databases*. > > Btw, I think a "reindexdb --all --collation" (and the SQL per-database > equivalent) that only rebuilds indexes that are affected by collations > would be immensely useful to have. As I was discussing w/ Peter G during PostgresOpen, we'd have to wait until that reindexdb is complete before actually using anything in the system and that's pretty painful. While it sounds like it'd be a good bit of work, it seems like we really need to have a way to support multiple collation versions concurrently and to do that we'll need to have the library underneath actually providing that to us. Once we have that, we can build new indexes concurrently and swap to them (or, ideally, just issue REINDEX CONCURRENTLY once we support that..). Until then, it seems like we really need to have a way to realize that a given upgrade is going to require a big reindex, before actually doing the reindex and suddenly discovering that we can't use a bunch of indexes because they're out of date and extending the downtime for the upgrade to be however long it takes to rebuild those indexes... Thanks! Stephen signature.asc Description: PGP signature
Re: executor relation handling
Hi Amit, On 9/13/18 12:58 AM, Amit Langote wrote: Attached updated patches. Beside the issue that caused eval-plan-qual isolation test to crash, I also spotted and fixed an oversight in the 0002 patch which would lead to EState.es_output_cid being set to wrong value and causing unexpected error during tuple locking as result of that. Thanks for the update. However, the subscription TAP test (src/test/subscription/t/001_rep_changes.pl) is still failing. CFBot has the same log https://travis-ci.org/postgresql-cfbot/postgresql/builds/42769 as locally. Best regards, Jesper
Re: Problem while setting the fpw with SIGHUP
On Mon, Sep 10, 2018 at 11:54 AM Amit Kapila wrote: > > Thanks, but what I wanted you to verify is the commit that broke it in > 9.5. On again looking at it, I think it is below code in commit > 2076db2aea that caused this problem. If possible, can you once test > it before and at this commit or at least do the manual review of same > to cross-verify? > I have myself investigated this further and found that this problem started occuring due to code rearrangement in commits 2c03216d83 and 2076db2aea. In commit 2076db2aea, below check for comparing the old and new value for fullPageWrites got changed: > - if ((Insert->fullPageWrites || Insert->forcePageWrites) && > !doPageWrites) > + if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && > doPageWrites) > { However, it alone didn't lead to this problem because XLogRecordAssemble() gives the valid value of fpw_lsn irrespective of whether full-page-writes was enabled or not. Later in commit 2c03216d83, we changed XLogRecordAssemble() such that it will give the valid value of fpw_lsn only if doPageWrites is enabled, basically this part of the commit: + /* Determine if this block needs to be backed up */ + if (regbuf->flags & REGBUF_FORCE_IMAGE) + needs_backup = true; + else if (regbuf->flags & REGBUF_NO_IMAGE) + needs_backup = false; + else if (!doPageWrites) + needs_backup = false; + else { - /* Simple data, just include it */ - len += rdt->len; + /* + * We assume page LSN is first data on *every* page that can be + * passed to XLogInsert, whether it has the standard page layout + * or not. + */ + XLogRecPtr page_lsn = PageGetLSN(regbuf->page); + + needs_backup = (page_lsn <= RedoRecPtr); + if (!needs_backup) + { + if (*fpw_lsn == InvalidXLogRecPtr || page_lsn < *fpw_lsn) + *fpw_lsn = page_lsn; + } } So, the problem started appearing after some rearrangement of code in both the above-mentioned commits. I verified that this problem doesn't exist in versions <=9.4, so backpatch-through 9.5. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Restricting maximum keep segments by repslots
Hello. Thank you for the comments, Sawada-san, Peter. At Mon, 10 Sep 2018 19:52:24 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI wrote in <20180910.195224.22629595.horiguchi.kyot...@lab.ntt.co.jp> > At Thu, 6 Sep 2018 22:32:21 +0200, Peter Eisentraut > wrote in > <29bbd79d-696b-509e-578a-0fc38a3b9...@2ndquadrant.com> > Thanks for pointing that. That's a major cause of confusion. Does > the following make sense? > > > Specify the maximum size of WAL files that > linkend="streaming-replication-slots">replication slots > > are allowed to retain in the pg_wal > > directory at checkpoint time. If > > max_slot_wal_keep_size is zero (the > > default), replication slots retain unlimited size of WAL files. > + If restart_lsn of a replication slot gets behind more than that > + bytes from the current LSN, the standby using the slot may not > + be able to reconnect due to removal of required WAL records. ... > > Also, I don't think 0 is a good value for the default behavior. 0 would > > mean that a slot is not allowed to retain any more WAL than already > > exists anyway. Maybe we don't want to support that directly, but it's a > > valid configuration. So maybe use -1 for infinity. > > In realtion to the reply just sent to Sawada-san, remain of a > slot can be at most 16MB in the 0 case with the default segment > size. So you're right in this sense. Will fix in the coming > version. Thanks. I did the following thinkgs in the new version. - Changed the disable (or infinite) and default value of max_slot_wal_keep_size to -1 from 0. (patch 1, 2. guc.c, xlog.c: GetOldestKeepSegment()) - Fixed documentation for max_slot_wal_keep_size tomention what happnes when WAL exceeds the size, and additional rewrites. (patch 4, catalogs.sgml, config.sgml) - Folded parameter list of GetOldestKeepSegment(). (patch 1, 2. xlog.c) - Provided the plural form of errdetail of checkpoint-time warning. (patch 1, xlog.c: KeepLogSeg()) - Some cosmetic change and small refactor. (patch 1, 2, 3) regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From ee8ddfa69b6fb6832307d15374ea5f2446bda85f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Dec 2017 21:20:20 +0900 Subject: [PATCH 1/4] Add WAL relief vent for replication slots Adds a capability to limit the number of segments kept by replication slots by a GUC variable. --- src/backend/access/transam/xlog.c | 108 -- src/backend/utils/misc/guc.c | 12 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/access/xlog.h | 1 + 4 files changed, 97 insertions(+), 25 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 85a7b285ec..deda43607d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -105,6 +105,7 @@ int wal_level = WAL_LEVEL_MINIMAL; int CommitDelay = 0; /* precommit delay in microseconds */ int CommitSiblings = 5; /* # concurrent xacts needed to sleep */ int wal_retrieve_retry_interval = 5000; +int max_slot_wal_keep_size_mb = -1; #ifdef WAL_DEBUG bool XLOG_DEBUG = false; @@ -867,6 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI, static void LocalSetXLogInsertAllowed(void); static void CreateEndOfRecoveryRecord(void); static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags); +static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr); static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo); static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void); @@ -9505,6 +9507,53 @@ CreateRestartPoint(int flags) return true; } +/* + * Returns minimum segment number the next checkpoint must leave considering + * wal_keep_segments, replication slots and max_slot_wal_keep_size. + * + * currLSN is the current insert location + * minSlotLSN is the minimum restart_lsn of all active slots + */ +static XLogSegNo +GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN) +{ + uint64 keepSegs = 0; + XLogSegNo currSeg; + XLogSegNo minSlotSeg; + + XLByteToSeg(currLSN, currSeg, wal_segment_size); + XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size); + + /* + * Calculate keep segments by slots first. The second term of the + * condition is just a sanity check. + */ + if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg) + keepSegs = currSeg - minSlotSeg; + + /* Cap keepSegs by max_slot_wal_keep_size */ + if (max_slot_wal_keep_size_mb >= 0) + { + uint64 limitSegs; + + limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size); + + /* Apply max_slot_wal_keep_size to keepSegs */ + if (limitSegs < keepSegs) + keepSegs = limitSegs; + } + + /* but, keep at least wal_keep_segments segments if any */ + if (wal_keep_segments > 0 && keepSegs < wal_keep_segments) + keepSegs = wal_keep_segments; + + /* avoid underflow, don't go below 1 */ + if (currSeg
Re: Indicate anti-wraparound autovacuum in log_autovacuum_min_duration
Hello > The brackets look rather useless to me, wouldn't it be better to remove > them? By doing so the longest message becomes: > "automatic aggressive vacuum to prevent wraparound of table blah.blah" Hmm, > 2018-09-13 11:48:09.303 MSK 6994 @ from [vxid:6/713 txid:0] [] LOG: > automatic aggressive vacuum (to prevent wraparound) of table > "template0.pg_toast.pg_toast_12252": index scans: 0 or > 2018-09-13 11:54:55.095 MSK 10115 @ from [vxid:3/100278 txid:0] [] LOG: > automatic aggressive vacuum to prevent wraparound of table > "template0.pg_toast.pg_toast_12252": index scans: 0 Looks better for me. Updated patch attached. regards, Sergeidiff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 5649a70..8996d36 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -374,10 +374,20 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params, * emitting individual parts of the message when not applicable. */ initStringInfo(&buf); - if (aggressive) -msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n"); + if (params->is_wraparound) + { +if (aggressive) + msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n"); +else + msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n"); + } else -msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n"); + { +if (aggressive) + msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n"); +else + msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n"); + } appendStringInfo(&buf, msgfmt, get_database_name(MyDatabaseId), get_namespace_name(RelationGetNamespace(onerel)),
Re: Collation versioning
Re: Peter Eisentraut 2018-09-13 <4f60612c-a7b5-092d-1532-21ff7a106...@2ndquadrant.com> > Moreover, the fix for a collation version mismatch is, in the simplest > case, to go around and REINDEX everything. Making the collation or > collation version global doesn't fix that. It would actually make it > harder because you couldn't run ALTER COLLATION REFRESH VERSION until > after you have rebuilt all affected objects *in all databases*. Btw, I think a "reindexdb --all --collation" (and the SQL per-database equivalent) that only rebuilds indexes that are affected by collations would be immensely useful to have. Christoph
Re: Unused argument from execute_sql_string()
On Thu, Sep 13, 2018 at 03:47:26PM +0900, Tatsuo Ishii wrote: > Anyway, considering it's a static function, chance of breaking > backward compatibility is minimum, I think. > > So +1 to remove the unused argument. Same opinion and arguments here, so I have committed the patch. -- Michael signature.asc Description: PGP signature
Re: pg_dump test instability
On 12/09/2018 18:06, Tom Lane wrote: > No. In both code paths, the array slot at index first_te is being > physically dropped from the set of valid entries (by incrementing > first_te). In the first path, that slot holds the item we want to > remove logically from the set, so that incrementing first_te is > all we have to do: the remaining entries are still in the range > first_te..last_te, and they're still sorted. In the second code > path, the item that was in that slot is still wanted as part of > the set, so we copy it into the valid range (overwriting the item > in slot i, which is no longer wanted). Now the valid range is > probably not sorted, so we have to flag that a re-sort is needed. I see. Why not shift all items up to the i'th up by one place, instead of moving only the first one? That way the sortedness would be preserved. Otherwise we'd move the first one into the middle, then sorting would move it to the front again, etc. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Patch] Create a new session in postmaster by calling setsid()
On Thu, Sep 13, 2018 at 8:20 AM, Michael Paquier wrote: > On Wed, Sep 12, 2018 at 03:55:00PM -0400, Tom Lane wrote: > > Although pg_ctl could sneak it in between forking and execing, it seems > > like it'd be cleaner to have the postmaster proper do it in response to > > a switch that pg_ctl passes it. That avoids depending on the fork/exec > > separation, and makes the functionality available for other postmaster > > startup mechanisms, and opens the possibility of delaying the detach > > to the end of startup. > > I would be fine with something happening only once the postmaster thinks > that consistency has been reached and is open for business, though I'd > still think that this should be controlled by a switch, where the > default does what we do now. Feel free to ignore me if I am outvoted ;) > -- > Michael > >From the respective of users instead of developers, I think for such important service, tty should be dissociated, i.e. postmaster should be daemonized by default (and even should have default log filename?) or be setsid()-ed at least. For concerns from developers, maybe a shell alias, or an internal switch in pg_ctl, or env variable guard in postmaster code could address. I'm not sure the several-years-ago LINUX_OOM_ADJ issue (to resolve that, silient_mode was removed) still exists or not. I don't have context about that, so I conceded to use setsid() even I prefer the deleted silent_mode code.
Re: Collation versioning
On 12/09/2018 13:25, Christoph Berg wrote: > Re: Peter Eisentraut 2018-09-12 > <0447ec7b-cdb6-7252-7943-88a4664e7...@2ndquadrant.com> >>> Naive idea: make that catalog shared? Collations are system-wide after >>> all. >> >> By the same argument, extensions should be shared, but they are not. > > But extensions put a lot of visible stuff into a database, whereas a > collation is just a line in some table that doesn't get into the way. How about C functions? They are just a system catalog representation of something that exists on the OS. Anyway, we also want to support application-specific collation definitions, so that users can CREATE COLLATION "my_specific_requirements" and use that that in their application, so global collations wouldn't be appropriate for that. Moreover, the fix for a collation version mismatch is, in the simplest case, to go around and REINDEX everything. Making the collation or collation version global doesn't fix that. It would actually make it harder because you couldn't run ALTER COLLATION REFRESH VERSION until after you have rebuilt all affected objects *in all databases*. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services