Re: ALTER TABLE ADD COLUMN fast default
On Fri, Mar 30, 2018 at 10:08 AM, Andrew Dunstan wrote: > On Fri, Mar 30, 2018 at 5:00 AM, Andres Freund wrote: [ missing values are loaded in the TupleDesc by RelationBuildTupleDesc ] >>> > I'm still strongly object to do doing this unconditionally for queries >>> > that never need any of this. We're can end up with a significant number >>> > of large tuples in memory here, and then copy that through dozens of >>> > tupledesc copies in queries. >> >>> We're just doing the same thing we do for default values. >> >> That's really not a defense. It's hurting us there too. >> > > > I will take a look and see if it can be avoided easily. > > For missing values there are three places where we would need to load them on demand: getmissingattr(), slot_getmissingattrs() and expand_tuple(). However, none of those functions have obvious access to the information required to load the missing values. We could probably do something very ugly like getting the relid from the TupleDesc's first attribute. Maybe someone else has a better option. But I can't see a simple and clean way to do this. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Rewriting the test of pg_upgrade as a TAP test - take two
On Mon, Apr 02, 2018 at 10:35:09AM -0400, Peter Eisentraut wrote: > I took a quick look at that part. It appears to be quite invasive, more > than I would have hoped. Basically, it imposes that from now on all > program invocations must observe the bindir setting, which someone is > surely going to forget. The pg_upgrade tests aren't going to exercise > all possible paths where programs are called, so this is going to lead > to omissions and inconsistencies -- which will then possibly only be > found much later by the extensions that Craig was talking about. I'd > like to see this more isolated, maybe via a path change, or something > inside system_or_bail or something less invasive and more maintainable. Hm. PostgresNode.pm uses now a couple of different ways to trigger commands, which makes the problem harder than it seems: - system_or_bail - system_log - IPC::Run::timeout - IPC::Run::run Because of that, and if one wants to minimize the number of places where bindir is called, would be to modify the command strings themselves. What I could think of would be to have a wrapper like build_bin_path which adds $bindir on top of the binary, but that would still cause all the callers to use this wrapper. The use of this wrapper could still be forgotten. Enforcing PATH temporarily in a code path or another implies that the previous value of PATH needs to be added back, which could be forgotten as well. At the end, it seems to me that what I am proposing is themost readable approach, and with proper documentation things should be handled finely... Or there is an approach you have in mind I do not foresee? -- Michael signature.asc Description: PGP signature
Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.
Hello. At Mon, 2 Apr 2018 16:11:12 -0300, Alvaro Herrera wrote in <20180402191112.wneiyj4v5upnfjst@alvherre.pgsql> > Why do we need AccessExclusiveLock on all children of a relation that we > want to scan to search for rows not satisfying the constraint? I think > it should be enough to get ShareLock, which prevents INSERT, no? I have > a feeling I'm missing something here, but I don't know what, and all > tests pass with that change. Mmm. I'm not sure if there's a lock-upgrade case but the following sentense is left at the last of one of the modified comments. ATREwriteTables is called with AEL after that (that has finally no effect in this case). | But we cannot risk a deadlock by taking | * a weaker lock now and the stronger one only when needed. I don't actually find places where the children's lock can be raised but this seems just following the lock parent first principle. By the way check_default_allows_bound() (CREATE TABLE case) contains a similar usage of find_all_inheritors(default_rel, AEL). > Also: the change proposed to remove the get_default_oid_from_partdesc() > call actually fixes the bug, but to me it was not at all obvious why. CommandCounterIncrement updates the content of a relcache entry via invalidation. It can be surprising for callers of a function like StorePartitionBound. CommandCounterIncrement RelationCacheInvalidateEntry RelationFlushRelation RelationClearRelation > To figure out why, I first had to realize that > ValidatePartitionConstraints was lying to me, both in name and in > comments: it doesn't actually validate anything, it merely queues > entries so that alter table's phase 3 would do the validation. I found > this extremely confusing, so I fixed the comments to match reality, and > later decided to rename the function also. It is reasonable. Removing exccessive extension of lower-level partitions is also reasonable. The previous code extended inheritances in different manner for levels at odd and even depth. > At that point I was able to understand what the problem was: when > attaching the default partition, we were setting the constraint to be > validated for that partition to the correctly computed partition > constraint; and later in the same function we would set the constraint > to "the immature constraint" because we were now seeing that the default > partition OID was not invalid. So it was rather a bug in > ValidatePartitionConstraints, in that it was accepting to set the > expression to validate when another expression had already been set! I > added an assert to protect against this. And then changed the decision > of whether or not to run this block based on the attached partition > being the default one or not; because as the "if" test was, it was just > a recipe for confusion. (Now, if you look carefully you realize that > the new test for bound->is_default I added is kinda redundant: it can > only be set if there was a default partition OID at start of the > function, and therefore we know we're not adding a default partition. > And for the case where we *are* adding the default partition, it is > still Invalid because we don't re-read its own OID. But relying on that > seems brittle because it breaks if we happen to update the default > partition OID in the middle of that function, which is what we were > doing. Hence the new is_default test.) Seems reasonable. But even if we assume so, rereading defaultPartOid is still breaking the assumption that defaultPartOid is that at the time of entering to this function and the added condition just conceals the fact. So I think it should be an assertion. | if (OidIsValid(defaultPartOid)) | { | /* | * The command cannot be adding default partition if the | * defaultPartOid is valid. | */ | Assert(!cmd->bound->is_default); > I looked at the implementation of ValidatePartitionConstraints and > didn't like it, so I rewrote it also. > > This comment is mistaken, too: > - /* > -* Skip if the partition is itself a partitioned table. We can only > -* ever scan RELKIND_RELATION relations. > -*/ > ... because it ignores the possibility of a partition being a foreign > table. The code seems to work, but I think there is no test to cover > the case that a foreign table containing data that doesn't satisfy the > constraint is attached, because when I set that case to return doing > nothing (ie. ATGetQueueEntry is not called for a foreign partition), no > test failed. Foreign tables are intentionally not verified on attaching (in my understanding). ATRewriteTables ignores foreign tables so it contradicts to what ATRewriteTables thinks. As my understanding foreign tables are assumed to be unrestrictable by local constraint after attaching so the users are responsible to the consistency. > Generally speaking, I didn't like ATExecAttachPartition; it's doing too > much that should have b
Re: Optimizing nested ConvertRowtypeExpr execution
On Mon, Apr 2, 2018 at 1:40 AM, Andres Freund wrote: > Hi, > > On 2018-02-26 17:20:05 +0530, Ashutosh Bapat wrote: >> In a multi-level partitioned table, a parent whole-row reference gets >> translated into nested ConvertRowtypeExpr with child whole-row >> reference as the leaf. During the execution, the child whole-row >> reference gets translated into all all intermediate parents' whole-row >> references, ultimately represented as parent's whole-row reference. >> AFAIU, the intermediate translations are unnecessary. The leaf child >> whole-row can be directly translated into top parent's whole-row >> reference. Here's a WIP patch which does that by eliminating >> intermediate ConvertRowtypeExprs during ExecInitExprRec(). > > Why is this done appropriately at ExecInitExpr() time, rather than at > plan time? Seems like eval_const_expressions() would be a bit more > appropriate (being badly named aside...)? That seems to be a better idea. Here's patch. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company From b99ae826931516a7c6ad981435e8ac9ef754f15e Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Tue, 3 Apr 2018 09:00:19 +0530 Subject: [PATCH] Optimize nested ConvertRowtypeExprs A ConvertRowtypeExprs is used to translate a whole-row reference of a child to that of a parent. The planner produces nested ConvertRowtypeExpr while translating whole-row reference of a leaf partition in a multi-level partition hierarchy. Executor then translates the whole-row reference from the leaf partition into all the intermediate parent's whole-row references before arriving at the final whole-row reference. It could instead translate the whole-row reference from the leaf partition directly to the top-most parent's whole-row reference skipping any intermediate translations. Ashutosh Bapat, reviewed by Andres Freund. --- src/backend/optimizer/util/clauses.c | 25 + 1 file changed, 25 insertions(+) diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c index ed6b680..3af5033 100644 --- a/src/backend/optimizer/util/clauses.c +++ b/src/backend/optimizer/util/clauses.c @@ -3657,6 +3657,31 @@ eval_const_expressions_mutator(Node *node, context); } break; + case T_ConvertRowtypeExpr: + { +ConvertRowtypeExpr *cre = castNode(ConvertRowtypeExpr, node); +Expr *arg = cre->arg; +ConvertRowtypeExpr *newcre; + +/* + * In case of a nested ConvertRowtypeExpr, we can convert the + * leaf row directly to the topmost row format without any + * intermediate conversions. + */ +Assert(arg != NULL); +while (IsA(arg, ConvertRowtypeExpr)) + arg = castNode(ConvertRowtypeExpr, arg)->arg; + +newcre = makeNode(ConvertRowtypeExpr); +newcre->arg = arg; +newcre->resulttype = cre->resulttype; +newcre->convertformat = cre->convertformat; +newcre->location = newcre->location; + +if (IsA(arg, Const)) + return ece_evaluate_expr((Node *) newcre); +return (Node *) newcre; + } default: break; } -- 1.7.9.5
Re: BUG #14941: Vacuum crashes
On Fri, Mar 30, 2018 at 05:19:46PM -0700, Andres Freund wrote: > On 2018-03-30 18:39:01 +, Bossart, Nathan wrote: >> I think the most straightforward approach for fixing this is to add >> skip-locked functionality in find_all_inheritors(), >> find_inheritance_children(), and vac_open_indexes(), but I am curious >> what others think. For vac_open_indexes() it may make the most sense to introduce a new try_open_index which wraps on top of try_relation_open with checks on the opened relation's relkind. > I'm actually wondering if we shouldn't just ignore this problem. While > the other cases VACUUM (SKIP LOCKED) are intended to solve seem common, > these seem less so. But it'd be a bit weird, too.. Wouldn't that cause deadlocks if one session has VACUUM (SKIP_LOCKED) and another session running vacuum or autovacuum if simply ignored? That looks dangerous to me than simply skipping a lock that cannot be taken. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] Add support for tuple routing to foreign partitions
Fujita-san, On 2018/04/02 21:29, Etsuro Fujita wrote: > (2018/04/02 18:49), Amit Langote wrote: >> I looked at the new patch. It looks good overall, although I have one >> question -- IIUC, BeginForeignInsert() performs actions that are >> equivalent of performing PlanForeignModify() + BeginForeignModify() for an >> INSERT as if it was directly executed on a given foreign table/partition. >> Did you face any problems in doing the latter itself in the core code? >> Doing it that way would mean no changes to a FDW itself will be required >> (and hence no need for additional APIs), but I may be missing something. > > The biggest issue in performing PlanForeignModify() plus > BeginForeignModify() instead of the new FDW API would be: can the core > code create a valid-looking planner state passed to PlanForeignModify() > such as the ModifyTable plan node or the query tree stored in PlannerInfo? Hmm, I can see the point. Passing mostly-dummy (garbage) PlannerInfo and query tree from the core code to FDW seems bad. By defining the new API with a clean interface (receiving fully valid ModifyTableState), we're not required to do that across the interface, but only inside the FDW's implementation. I was just slightly concerned that the new FDW function would have to implement what would normally be carried out across multiple special purpose callbacks, but maybe that's OK as long as it's clearly documented what its job is. Noticed a couple of things in the patch: + + When this is called by a COPY FROM command, the + plan-related global data in mtstate is not provided + and the planSlot parameter of + ExecForeignInsert called for each inserted tuple is How about s/called/subsequently called/g? + NULL, wether the foreign table is the partition chosen Typo: s/wether/whether/g Thanks, Amit
Re: Add default role 'pg_access_server_files'
On Mon, Apr 02, 2018 at 05:09:21PM -0400, Stephen Frost wrote: > * Michael Paquier (mich...@paquier.xyz) wrote: >> No refactoring for pg_file_unlink and its v1.1? > > I considered each function and thought about if it'd make sense to > refactor them or if they were simple enough that the additional function > wouldn't really be all that useful. I'm open to revisiting that, but > just want to make it clear that it was something I thought about and > considered. Since pg_file_unlink is basically two function calls, I > didn't think it worthwhile to refactor those into their own function. I don't mind if this is done your way. >> The argument checks are exactly the same for pg_file_rename and >> pg_file_rename_v1_1. Why about just passing fcinfo around and simplify >> the patch? > > In general, I prefer to keep the PG_FUNCTION_ARGS abstraction when we > can. Unfortunately, that does fall apart when wrapping an SRF as in > pg_logdir_ls(), but with pg_file_rename we can maintain it and it's > really not that much code to do so. As with the refactoring of > pg_file_unlink, this is something which could really go either way. Okay... > I'm not sure how useful it is to REVOKE the rights on the simple SQL > function; that would just mean that an admin has to go GRANT the rights > on that function as well as the three-argument version. Indeed, I had a brain fade here. > The more I think about it, the more I like the approach of just dropping > these deprecated "alternative names for things in core" with the new > adminpack version. In terms of applications, as I understand it, they > aren't used in the latest version of pgAdmin3 and they also aren't used > with pgAdmin4, so I don't think we need to be worrying about supporting > them in v11. +1 to simplify the code a bit. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] [PATCH] Lockable views
On Mon, 2 Apr 2018 18:32:53 +0900 Yugo Nagata wrote: > On Thu, 29 Mar 2018 17:26:36 -0700 > Andres Freund wrote: > > Thank you for your comments. I attach a patch to fix issues > you've pointed out. I found a typo in the documentation and attach the updated patch. Regards, > > > Hi, > > > > On 2018-03-28 20:26:48 +0900, Yugo Nagata wrote: > > > diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml > > > index b2c7203..96d477c 100644 > > > --- a/doc/src/sgml/ref/lock.sgml > > > +++ b/doc/src/sgml/ref/lock.sgml > > > @@ -46,6 +46,11 @@ LOCK [ TABLE ] [ ONLY ] > > class="parameter">name [ * ] > > > > > > > > > > > > + When a view is specified to be locked, all relations appearing in the > > > view > > > + definition query are also locked recursively with the same lock mode. > > > + > > > > Trailing space added. I'd remove "specified to be" from the sentence. > > Fixed. > > > > > I think mentioning how this interacts with permissions would be a good > > idea too. Given how operations use the view's owner to recurse, that's > > not obvious. Should also explain what permissions are required to do the > > operation on the view. > > Added a description about permissions into the documentation. > > > > > > > > @@ -86,15 +92,17 @@ RangeVarCallbackForLockTable(const RangeVar *rv, Oid > > > relid, Oid oldrelid, > > > return; /* woops, concurrently > > > dropped; no permissions > > >* check */ > > > > > > - /* Currently, we only allow plain tables to be locked */ > > > - if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE) > > > + > > > > This newline looks spurious to me. > > Removed. > > > > > > > > /* > > > + * Apply LOCK TABLE recursively over a view > > > + * > > > + * All tables and views appearing in the view definition query are locked > > > + * recursively with the same lock mode. > > > + */ > > > + > > > +typedef struct > > > +{ > > > + Oid root_reloid; > > > + LOCKMODE lockmode; > > > + bool nowait; > > > + Oid viewowner; > > > + Oid viewoid; > > > +} LockViewRecurse_context; > > > > Probably wouldn't hurt to pgindent the larger changes in the patch. > > > > > > > +static bool > > > +LockViewRecurse_walker(Node *node, LockViewRecurse_context *context) > > > +{ > > > + if (node == NULL) > > > + return false; > > > + > > > + if (IsA(node, Query)) > > > + { > > > + Query *query = (Query *) node; > > > + ListCell*rtable; > > > + > > > + foreach(rtable, query->rtable) > > > + { > > > + RangeTblEntry *rte = lfirst(rtable); > > > + AclResultaclresult; > > > + > > > + Oid relid = rte->relid; > > > + char relkind = rte->relkind; > > > + char *relname = get_rel_name(relid); > > > + > > > + /* The OLD and NEW placeholder entries in the view's > > > rtable are skipped. */ > > > + if (relid == context->viewoid && > > > + (!strcmp(rte->eref->aliasname, "old") || > > > !strcmp(rte->eref->aliasname, "new"))) > > > + continue; > > > + > > > + /* Currently, we only allow plain tables or views to be > > > locked. */ > > > + if (relkind != RELKIND_RELATION && relkind != > > > RELKIND_PARTITIONED_TABLE && > > > + relkind != RELKIND_VIEW) > > > + continue; > > > + > > > + /* Check infinite recursion in the view definition. */ > > > + if (relid == context->root_reloid) > > > + ereport(ERROR, > > > + > > > (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), > > > + errmsg("infinite recursion > > > detected in rules for relation \"%s\"", > > > + > > > get_rel_name(context->root_reloid; > > > > Hm, how can that happen? And if it can happen, why can it only happen > > with the root relation? > > For example, the following queries cause the infinite recursion of views. > This is detected and the error is raised. > > create table t (i int); > create view v1 as select 1; > create view v2 as select * from v1; > create or replace view v1 as select * from v2; > begin; > lock v1; > abort; > > However, I found that the previous patch could not handle the following > situation in which the root relation itself doesn't have infinite recursion. > > create view v3 as select * from v1; > begin; > lock v3; > abort; > > This fixed in the attached patch. > > Regards, > > > > > Greetings, > > > > Andres Freund > > > -- > Yugo Nagata -- Yugo Nagata diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml index 96d477c..a225cea
Re: [HACKERS] Add support for tuple routing to foreign partitions
On 2018/04/02 21:26, Etsuro Fujita wrote: > (2018/03/30 22:28), Alvaro Herrera wrote: >> Etsuro Fujita wrote: >> >>> Now we have ON CONFLICT for partitioned tables, which requires the >>> conversion map to be computed in ExecInitPartitionInfo, so I updated the >>> patch so that we keep the former step in ExecInitPartitionInfo and >>> ExecSetupPartitionTupleRouting and so that we just init the FDW in >>> ExecInitExtraPartitionInfo (changed to ExecInitForeignRouting). And I >>> added >>> comments to ExecInitForeignRouting and ExecPrepareTupleRouting. >> >> Hmm, I may be misreading what you mean here ... but as far as I >> understand, ON CONFLICT does not support foreign tables, does it? > > It doesn't, except for ON CONFLICT DO NOTHING without an inference > specification. > >> If >> that is so, I'm not sure why the conversion map would be necessary, if >> it is for ON CONFLICT alone. > > We wouldn't need that for foreign partitions (When DO NOTHING with an > inference specification or DO UPDATE on a partitioned table containing > foreign partitions, the planner would throw an error before we get to > ExecInitPartitionInfo). Actually, as you might know, since it is not possible to create an index on a partitioned table that has a foreign partition, there is no possibility of supporting any form of ON CONFLICT that requires an inference specification. > The reason why I updated the patch that way is > just to make the patch simple, but on reflection I don't think that's a > good idea; I think we should delay the map-creation step as well as the > FDW-initialization step for UPDATE subplan partitions as before for > improved efficiency for UPDATE-of-partition-key. However, I don't think > it'd still be a good idea to delay those steps for partitions created by > ExecInitPartitionInfo the same way as for the subplan partitions, because > in that case it'd be better to perform CheckValidResultRel against a > partition right after we do InitResultRelInfo for the partition in > ExecInitPartitionInfo, as that would save cycles in cases when the > partition is considered nonvalid by that check. It seems like a good idea to perform CheckValidResultRel right after the InitResultRelInfo in ExecInitPartitionInfo. However, if the partition is considered non-valid, that results in an error afaik, so I don't understand the point about saving cycles. > So, What I'm thinking is: > a) for the subplan partitions, we delay those steps as before, and b) for > the partitions created by ExecInitPartitionInfo, we do that check for a > partition right after InitResultRelInfo in that function, (and if valid, > we create a map and initialize the FDW for the partition in that function). Sounds good to me. I'm assuming that you're talking about delaying the is-valid-for-insert-routing check (using CheckValidResultRel) and parent-to-child map creation for a sub-plan result relation until ExecPrepareTupleRouting(). On a related note, I wonder if it'd be a good idea to rename the flag ri_PartitionIsValid to something that signifies that we only need it to be true if we want to do tuple routing (aka insert) using it. Maybe, ri_PartitionReadyForRouting or ri_PartitionReadyForInsert. I'm afraid that ri_PartitionIsValid is a bit ambiguous, although I'm not saying the name options I'm suggesting are the best. Thanks, Amit
Re: WIP: Covering + unique indexes.
On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkov wrote: > I thought abut that another time and I decided that it would be safer > to use 13th bit in index tuple flags. There are already attempt to > use whole 6 bytes of tid for not heap pointer information [1]. Thus, it > would be safe to use 13th bit for indicating alternative offset meaning > in pivot tuples, because it wouldn't block further work. Revised patchset > in the attachment implements it. This is definitely not the only time someone has talked about this 13th bit -- it's quite coveted. It also came up with UPSERT, and with WARM. That's just the cases that I can personally remember. I'm glad that you found a way to make this work, that will keep things flexible for future patches, and make testing easier. I think that we can find a flexible representation that makes almost everyone happy. > I don't know. We still need an offset number to check expected number > of attributes. Passing index tuple as separate attribute would be > redundant and open door for extra possible errors. You're right. I must have been tired when I wrote that. :-) >> Do you store an attribute number in the "minus infinity" item (the >> leftmost one of internal pages)? I guess that that should be zero, >> because it's totally truncated. > > > Yes, I store zero number of attributes in "minus infinity" item. See this > part of the patch. > However, note that I've to store (number_of_attributes + 1) in the offset > in order to correctly store zero number of attributes. Otherwise, assertion > is faised in ItemPointerIsValid() macro. Makes sense. > Yes. But that depends on how difficulty to adopt patch to big picture > correlate with difficulty, which non-adopted patch makes to that big > picture. My point was that second difficulty isn't high. But we can be > satisfied with implementation in the attached patchset (probably some > small enhancements are still required), then the first difficulty isn't high > too. I think it's possible. I didn't have time to look at this properly today, but I will try to do so tomorrow. Thanks -- Peter Geoghegan
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 2, 2018 at 7:54 PM, Robert Haas wrote: > Also, this really does make it impossible to write reliable programs. > Imagine that, while the server is running, somebody runs a program > which opens a file in the data directory, calls fsync() on it, and > closes it. If the fsync() fails, postgres is now borked and has no > way of being aware of the problem. If we knew, we could PANIC, but > we'll never find out, because the unrelated process ate the error. > This is exactly the sort of ill-considered behavior that makes fcntl() > locking nearly useless. I fear that the conventional wisdom from the Kernel people is now "you should be using O_DIRECT for granular control". The LWN article Thomas linked (https://lwn.net/Articles/718734) cites Ted Ts'o: "Monakhov asked why a counter was needed; Layton said it was to handle multiple overlapping writebacks. Effectively, the counter would record whether a writeback had failed since the file was opened or since the last fsync(). Ts'o said that should be fine; applications that want more information should use O_DIRECT. For most applications, knowledge that an error occurred somewhere in the file is all that is necessary; applications that require better granularity already use O_DIRECT." -- Peter Geoghegan
Re: [HACKERS] MERGE SQL Statement for PG11
Hi, On 2018-04-02 19:40:12 -0700, Peter Geoghegan wrote: > > So what happens if there's a concurrent insertion of a potentially > > matching tuple? > > It's not a special case. In all likelihood, you get a dup violation. > This is a behavior that I argued for from an early stage. Right. I think that should be mentioned in the comment... Greetings, Andres Freund
Re: [HACKERS] MERGE SQL Statement for PG11
On Mon, Apr 2, 2018 at 8:11 PM, Pavan Deolasee wrote: > Honestly I don't think Peter ever raised concerns about the join, though I > could be missing early discussions when I wasn't paying attention. It's > there from day 1. Peter raised concerns about the two RTE stuff which was > necessitated when we added support for partitioned table. We discussed that > at some length, with your inputs and agreed that it's not necessarily a bad > thing and probably the only way to deal with partitioned tables. > > Personally, I don't see why an internal join is bad. That's what MERGE is > doing anyways, so it closely matches with the overall procedure. The issue is not that there is a join as such. It's how it's represented in the parser, and how that affects other things. There is a lot of special case logic to make it work. -- Peter Geoghegan
Re: [HACKERS] MERGE SQL Statement for PG11
On Tue, Apr 3, 2018 at 8:31 AM, Robert Haas wrote: > On Mon, Apr 2, 2018 at 10:40 PM, Peter Geoghegan wrote: > > On Mon, Apr 2, 2018 at 7:18 PM, Andres Freund > wrote: > >> I did a scan through this, as I hadn't been able to keep with the thread > >> previously. Sorry if some of the things mentioned here have been > >> discussed previously. I am just reading through the patch in its own > >> order, so please excuse if there's things I remark on that only later > >> fully make sense. > >> > >> > >> later update: TL;DR: I don't think the parser / executor implementation > >> of MERGE is architecturally sound. I think creating hidden joins during > >> parse-analysis to implement MERGE is a seriously bad idea and it needs > >> to be replaced by a different executor structure. > > > > +1. I continue to have significant misgivings about this. It has many > > consequences that we know about, and likely quite a few more that we > > don't. > > +1. I didn't understand from Peter's earlier comments that we were > doing that, and I agree that it isn't a good design choice. > > Honestly I don't think Peter ever raised concerns about the join, though I could be missing early discussions when I wasn't paying attention. It's there from day 1. Peter raised concerns about the two RTE stuff which was necessitated when we added support for partitioned table. We discussed that at some length, with your inputs and agreed that it's not necessarily a bad thing and probably the only way to deal with partitioned tables. Personally, I don't see why an internal join is bad. That's what MERGE is doing anyways, so it closely matches with the overall procedure. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Mon, Apr 2, 2018 at 10:40 PM, Peter Geoghegan wrote: > On Mon, Apr 2, 2018 at 7:18 PM, Andres Freund wrote: >> I did a scan through this, as I hadn't been able to keep with the thread >> previously. Sorry if some of the things mentioned here have been >> discussed previously. I am just reading through the patch in its own >> order, so please excuse if there's things I remark on that only later >> fully make sense. >> >> >> later update: TL;DR: I don't think the parser / executor implementation >> of MERGE is architecturally sound. I think creating hidden joins during >> parse-analysis to implement MERGE is a seriously bad idea and it needs >> to be replaced by a different executor structure. > > +1. I continue to have significant misgivings about this. It has many > consequences that we know about, and likely quite a few more that we > don't. +1. I didn't understand from Peter's earlier comments that we were doing that, and I agree that it isn't a good design choice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos wrote: > Given precisely that the dirty pages which cannot been written-out are > practically thrown away, the semantics of fsync() (after the 4.13 fixes) > are essentially correct: the first call indicates that a writeback error > indeed occurred, while subsequent calls have no reason to indicate an error > (assuming no other errors occurred in the meantime). Like other people here, I think this is 100% unreasonable, starting with "the dirty pages which cannot been written out are practically thrown away". Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to throw away the user's data. If the writes are going to fail, then let them keep on failing every time. *That* wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery. Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless. Even leaving that aside, a PANIC means a prolonged outage on a prolonged system - it could easily take tens of minutes or longer to run recovery. So saying "oh, just do that" is not really an answer. Sure, we can do it, but it's like trying to lose weight by intentionally eating a tapeworm. Now, it's possible to shorten the checkpoint_timeout so that recovery runs faster, but then performance drops because data has to be fsync()'d more often instead of getting buffered in the OS cache for the maximum possible time. We could also dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is *necessary* for good performance. > The error reporting is thus consistent with the intended semantics (which > are sadly not properly documented). Repeated calls to fsync() simply do not > imply that the kernel will retry to writeback the previously-failed pages, > so the application needs to be aware of that. Persisting the error at the > fsync() level would essentially mean moving application policy into the > kernel. I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: Speedup of relation deletes during recovery
From: Jerry Sievers [mailto:gsiever...@comcast.net] > Wonder if this is the case for streaming standbys replaying truncates > also? Yes, As I wrote in my previous mail, TRUNCATE is worse than DROP TABLE. Regards Takayuki Tsunakawa
Re: Diagonal storage model
On Sun, Apr 01, 2018 at 03:48:07PM +0300, Konstantin Knizhnik wrote: > Attach please find patch with first prototype implementation. It > provides about 3.14 times improvement of performance at most of TPC-H > queries. Congratulations in finding a patch able to improve all workloads of Postgres in such a simple and magic way, especially on this particular date. I would have used M_PI if I were you. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] MERGE SQL Statement for PG11
On Mon, Apr 2, 2018 at 7:18 PM, Andres Freund wrote: > I did a scan through this, as I hadn't been able to keep with the thread > previously. Sorry if some of the things mentioned here have been > discussed previously. I am just reading through the patch in its own > order, so please excuse if there's things I remark on that only later > fully make sense. > > > later update: TL;DR: I don't think the parser / executor implementation > of MERGE is architecturally sound. I think creating hidden joins during > parse-analysis to implement MERGE is a seriously bad idea and it needs > to be replaced by a different executor structure. +1. I continue to have significant misgivings about this. It has many consequences that we know about, and likely quite a few more that we don't. > So what happens if there's a concurrent insertion of a potentially > matching tuple? It's not a special case. In all likelihood, you get a dup violation. This is a behavior that I argued for from an early stage. > It seems fairly bad architecturally to me that we now have > EvalPlanQual() loops in both this routine *and* > ExecUpdate()/ExecDelete(). I think that that makes sense, because ExecMerge() doesn't use any of the EPQ stuff from ExecUpdate() and ExecDelete(). It did at one point, in fact, and it looked quite a lot worse IMV. > *Gulp*. I'm extremely doubtful this is a sane approach. At the very > least the amount of hackery required to make existing code cope with > the approach / duplicate code seems indicative of that. I made a lot of the fact that there are two RTEs for the target, since that has fairly obvious consequences during every stage of query processing. However, I think you're right that the problem is broader than that. > + if (pstate->p_target_relation->rd_rel->relhasrules) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > +errmsg("MERGE is not supported for relations with rules"))); > > I guess we can add that later, but it's a bit sad that this won't work > against views with INSTEAD OF triggers. It also makes it hard to test deparsing support, which actually made it in in the end. That must be untested. -- Peter Geoghegan
Re: [HACKERS] MERGE SQL Statement for PG11
Hi, On 2018-03-30 12:10:04 +0100, Simon Riggs wrote: > On 29 March 2018 at 10:50, Simon Riggs wrote: > > On 28 March 2018 at 12:00, Pavan Deolasee wrote: > > > >> v27 attached, though review changes are in > >> the add-on 0005 patch. > > > > This all looks good now, thanks for making all of those changes. > > > > I propose [v27 patch1+patch3+patch5] as the initial commit candidate > > for MERGE, with other patches following later before end CF. > > > > I propose to commit this tomorrow, 30 March, about 26 hours from now. > > That will allow some time for buildfarm fixing/reversion before the > > Easter weekend, then other patches to follow starting 2 April. That > > then gives reasonable time to follow up on other issues that we will > > no doubt discover fairly soon after commit, such as additional runs by > > SQLsmith and more eyeballs. > > No problems found, but moving proposed commit to 2 April pm I did a scan through this, as I hadn't been able to keep with the thread previously. Sorry if some of the things mentioned here have been discussed previously. I am just reading through the patch in its own order, so please excuse if there's things I remark on that only later fully make sense. later update: TL;DR: I don't think the parser / executor implementation of MERGE is architecturally sound. I think creating hidden joins during parse-analysis to implement MERGE is a seriously bad idea and it needs to be replaced by a different executor structure. @@ -3828,8 +3846,14 @@ struct AfterTriggersTableData boolbefore_trig_done; /* did we already queue BS triggers? */ boolafter_trig_done;/* did we already queue AS triggers? */ AfterTriggerEventList after_trig_events;/* if so, saved list pointer */ - Tuplestorestate *old_tuplestore;/* "old" transition table, if any */ - Tuplestorestate *new_tuplestore;/* "new" transition table, if any */ + /* "old" transition table for UPDATE, if any */ + Tuplestorestate *old_upd_tuplestore; + /* "new" transition table for UPDATE, if any */ + Tuplestorestate *new_upd_tuplestore; + /* "old" transition table for DELETE, if any */ + Tuplestorestate *old_del_tuplestore; + /* "new" transition table INSERT, if any */ + Tuplestorestate *new_ins_tuplestore; }; A comment somewhere why we have all of these (presumably because they can now happen in the context of a single statement) would be good. @@ -5744,12 +5796,28 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo, newtup == NULL)); if (oldtup != NULL && - ((event == TRIGGER_EVENT_DELETE && delete_old_table) || -(event == TRIGGER_EVENT_UPDATE && update_old_table))) + (event == TRIGGER_EVENT_DELETE && delete_old_table)) { Tuplestorestate *old_tuplestore; - old_tuplestore = transition_capture->tcs_private->old_tuplestore; + old_tuplestore = transition_capture->tcs_private->old_del_tuplestore; + + if (map != NULL) + { + HeapTuple converted = do_convert_tuple(oldtup, map); + + tuplestore_puttuple(old_tuplestore, converted); + pfree(converted); + } + else + tuplestore_puttuple(old_tuplestore, oldtup); + } Very similar code is now repeated four times. Could you abstract this into a separate function? --- a/src/backend/executor/README +++ b/src/backend/executor/README @@ -37,6 +37,16 @@ the plan tree returns the computed tuples to be updated, plus a "junk" one. For DELETE, the plan tree need only deliver a CTID column, and the ModifyTable node visits each of those rows and marks the row deleted. +MERGE runs one generic plan that returns candidate target rows. Each row +consists of a super-row that contains all the columns needed by any of the +individual actions, plus a CTID and a TABLEOID junk columns. The CTID column is "plus a CTID and a TABLEOID junk columns" sounds a bit awkward? +required to know if a matching target row was found or not and the TABLEOID +column is needed to find the underlying target partition, in case when the +target table is a partition table. If the CTID column is set we attempt to +activate WHEN MATCHED actions, or if it is NULL then we will attempt to "if it is NULL then" sounds wrong. +activate WHEN NOT MATCHED actions. Once we know which action is activated we +form the final result row and apply only those changes. + XXX a great deal more documentation needs to be written here... Are we using tableoids in a similar way in other places? +/* + * Given OID of the partition leaf, return the index of the leaf in the + * partition hierarchy. + */ +int +ExecFindPartitionByOid(PartitionTupleRouting *proute, Oid partoid) +{ + int i; + + for (i = 0; i < proute->num_partitions; i++) + { + if (proute->partition_oids[i] == partoid) + break; + } + + Assert(i < proute->num_partitions); +
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Tue, Apr 3, 2018 at 3:03 AM, Craig Ringer wrote: > I see little benefit to not just PANICing unconditionally on EIO, really. It > shouldn't happen, and if it does, we want to be pretty conservative and > adopt a data-protective approach. > > I'm rather more worried by doing it on ENOSPC. Which looks like it might be > necessary from what I recall finding in my test case + kernel code reading. > I really don't want to respond to a possibly-transient ENOSPC by PANICing > the whole server unnecessarily. Yeah, it'd be nice to give an administrator the chance to free up some disk space after ENOSPC is reported, and stay up. Running out of space really shouldn't take down the database without warning! The question is whether the data remains in cache and marked dirty, so that retrying is a safe option (since it's potentially gone from our own buffers, so if the OS doesn't have it the only place your committed data can definitely still be found is the WAL... recovery time). Who can tell us? Do we need a per-filesystem answer? Delayed allocation is a somewhat filesystem-specific thing, so maybe. Interestingly, there don't seem to be many operating systems that can report ENOSPC from fsync(), based on a quick scan through some documentation: POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no Illumos/Solaris, Linux, macOS: yes I don't know if macOS really means it or not; it just tells you to see the errors for read(2) and write(2). By the way, speaking of macOS, I was curious to see if the common BSD heritage would show here. Yeah, somewhat. It doesn't appear to keep buffers on writeback error, if this is the right code[1] (though it could be handling it somewhere else for all I know). [1] https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695 -- Thomas Munro http://www.enterprisedb.com
Re: 2018-03 Commitfest Summary (Andres #1)
On Sat, Mar 31, 2018 at 10:08:34AM +0200, Fabien COELHO wrote: > However, it helps explain the disagreements about pgbench features: pgbench > internal developer-oriented use for postgres is somehow limited, and a lot > of new features are suggested with an external performance benchmarking use > in mind, about which core committers do not seem much interested. Agreed. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 2, 2018 at 5:05 PM, Andres Freund wrote: >>Even accepting that (I personally go with surprising over violation, as >>if my vote counted), it is highly unlikely that we will convince every >>kernel team to declare "What fools we've been!" and push a change... >>and even if they did, PostgreSQL can look forward to many years of >>running on kernels with the broken semantics. Given that, I think the >>PANIC option is the soundest one, as unappetizing as it is. > > Don't we pretty much already have agreement in that? And Craig is the main > proponent of it? I wonder how bad it will be in practice if we PANIC. Craig said "This isn't as bad as it seems because AFAICS fsync only returns EIO in cases where we should be stopping the world anyway, and many FSes will do that for us". It would be nice to get more information on that. -- Peter Geoghegan
Re: check_ssl_key_file_permissions should be in be-secure-common.c
On Mon, Apr 02, 2018 at 11:39:57AM -0400, Peter Eisentraut wrote: > committed Thanks for the commit, Peter. -- Michael signature.asc Description: PGP signature
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
> On Apr 2, 2018, at 17:05, Andres Freund wrote: > > Don't we pretty much already have agreement in that? And Craig is the main > proponent of it? For sure on the second sentence; the first was not clear to me. -- -- Christophe Pettus x...@thebuild.com
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On April 2, 2018 5:03:39 PM PDT, Christophe Pettus wrote: > >> On Apr 2, 2018, at 16:27, Craig Ringer wrote: >> >> They're undocumented and extremely surprising semantics that are >arguably a violation of the POSIX spec for fsync(), or at least a >surprising interpretation of it. > >Even accepting that (I personally go with surprising over violation, as >if my vote counted), it is highly unlikely that we will convince every >kernel team to declare "What fools we've been!" and push a change... >and even if they did, PostgreSQL can look forward to many years of >running on kernels with the broken semantics. Given that, I think the >PANIC option is the soundest one, as unappetizing as it is. Don't we pretty much already have agreement in that? And Craig is the main proponent of it? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
> On Apr 2, 2018, at 16:27, Craig Ringer wrote: > > They're undocumented and extremely surprising semantics that are arguably a > violation of the POSIX spec for fsync(), or at least a surprising > interpretation of it. Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare "What fools we've been!" and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is. -- -- Christophe Pettus x...@thebuild.com
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 3 April 2018 at 07:05, Anthony Iliopoulos wrote: > Hi Stephen, > > On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote: > > > > fsync() doesn't reflect the status of given pages, however, it reflects > > the status of the file descriptor, and as such the file, on which it's > > called. This notion that fsync() is actually only responsible for the > > changes which were made to a file since the last fsync() call is pure > > foolishness. If we were able to pass a list of pages or data ranges to > > fsync() for it to verify they're on disk then perhaps things would be > > different, but we can't, all we can do is ask to "please flush all the > > dirty pages associated with this file descriptor, which represents this > > file we opened, to disk, and let us know if you were successful." > > > > Give us a way to ask "are these specific pages written out to persistant > > storage?" and we would certainly be happy to use it, and to repeatedly > > try to flush out pages which weren't synced to disk due to some > > transient error, and to track those cases and make sure that we don't > > incorrectly assume that they've been transferred to persistent storage. > > Indeed fsync() is simply a rather blunt instrument and a narrow legacy > interface but further changing its established semantics (no matter how > unreasonable they may be) is probably not the way to go. > They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it. So I don't buy this argument. It really turns out that this is not how the fsync() semantics work > though, exactly because the nature of the errors: even if the kernel > retained the dirty bits on the failed pages, retrying persisting them > on the same disk location would simply fail. *might* simply fail. It depends on why the error ocurred. I originally identified this behaviour on a multipath system. Multipath defaults to "throw the writes away, nobody really cares anyway" on error. It seems to figure a higher level will retry, or the application will receive the error and retry. (See no_path_retry in multipath config. AFAICS the default is insanely dangerous and only suitable for specialist apps that understand the quirks; you should use no_path_retry=queue). Instead the kernel opts > for marking those pages clean (since there is no other recovery > strategy), and reporting once to the caller who can potentially deal > with it in some manner. It is sadly a bad and undocumented convention. > It could mark the FD. It's not just undocumented, it's a slightly creative interpretation of the POSIX spec for fsync. > > Consider two independent programs where the first one writes to a file > > and then calls the second one whose job it is to go out and fsync(), > > perhaps async from the first, those files. Is the second program > > supposed to go write to each page that the first one wrote to, in order > > to ensure that all the dirty bits are set so that the fsync() will > > actually return if all the dirty pages are written? > > I think what you have in mind are the semantics of sync() rather > than fsync(), but as long as an application needs to ensure data > are persisted to storage, it needs to retain those data in its heap > until fsync() is successful instead of discarding them and relying > on the kernel after write(). This is almost exactly what we tell application authors using PostgreSQL: the data isn't written until you receive a successful commit confirmation, so you'd better not forget it. We provide applications with *clear boundaries* so they can know *exactly* what was, and was not, written. I guess the argument from the kernel is the same is true: whatever was written since the last *successful* fsync is potentially lost and must be redone. But the fsync behaviour is utterly undocumented and dubiously standard. > I think we are conflating a few issues here: having the OS kernel being > responsible for error recovery (so that subsequent fsync() would fix > the problems) is one. This clearly is a design which most kernels have > not really adopted for reasons outlined above [citation needed] What do other major platforms do here? The post above suggests it's a bit of a mix of behaviours. Now, there is the issue of granularity of > error reporting: userspace could benefit from a fine-grained indication > of failed pages (or file ranges). Yep. I looked at AIO in the hopes that, if we used AIO, we'd be able to map a sync failure back to an individual AIO write. But it seems AIO just adds more problems and fixes none. Flush behaviour with AIO from what I can tell is inconsistent version to version and generally unhelpful. The kernel should really report such sync failures back to the app on its AIO write mapping, but it seems nothing of the sort happens. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Developmen
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 2018-04-03 01:05:44 +0200, Anthony Iliopoulos wrote: > Would using sync_file_range() be helpful? Potential errors would only > apply to pages that cover the requested file ranges. There are a few > caveats though: To quote sync_file_range(2): Warning This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the file's metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an over‐ write. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches. Given the lack of metadata safety that seems entirely a no go. We use sfr(2), but only to force the kernel's hand around writing back earlier without throwing away cache contents. > > > The application will need to deal with that first error irrespective of > > > subsequent return codes from fsync(). Conceptually every fsync() > > > invocation > > > demarcates an epoch for which it reports potential errors, so the caller > > > needs to take responsibility for that particular epoch. > > > > We do deal with that error- by realizing that it failed and later > > *retrying* the fsync(), which is when we get back an "all good! > > everything with this file descriptor you've opened is sync'd!" and > > happily expect that to be truth, when, in reality, it's an unfortunate > > lie and there are still pages associated with that file descriptor which > > are, in reality, dirty and not sync'd to disk. > > It really turns out that this is not how the fsync() semantics work > though Except on freebsd and solaris, and perhaps others. >, exactly because the nature of the errors: even if the kernel > retained the dirty bits on the failed pages, retrying persisting them > on the same disk location would simply fail. That's not guaranteed at all, think NFS. > Instead the kernel opts for marking those pages clean (since there is > no other recovery strategy), and reporting once to the caller who can > potentially deal with it in some manner. It is sadly a bad and > undocumented convention. It's broken behaviour justified post facto with the only rational that was available, which explains why it's so unconvincing. You could just say "this ship has sailed, and it's to onerous to change because xxx" and this'd be a done deal. But claiming this is reasonable behaviour is ridiculous. Again, you could just continue to error for this fd and still throw away the data. > > Consider two independent programs where the first one writes to a file > > and then calls the second one whose job it is to go out and fsync(), > > perhaps async from the first, those files. Is the second program > > supposed to go write to each page that the first one wrote to, in order > > to ensure that all the dirty bits are set so that the fsync() will > > actually return if all the dirty pages are written? > > I think what you have in mind are the semantics of sync() rather > than fsync() If you open the same file with two fds, and write with one, and fsync with another that's definitely supposed to work. And sync() isn't a realistic replacement in any sort of way because it's obviously systemwide, and thus entirely and completely unsuitable. Nor does it have any sort of better error reporting behaviour, does it? Greetings, Andres Freund
Re: Creating streaming replication standby
> A complain that I have about such concepts is that it creates a > dependency between utilities which are usually classified by > distributions as client-side (pg_basebackup goes to postgresql-client > for Debian), and the server itself. There is perhaps room for an > extension though. If many vendors are providing such an option at SQL > level, it would mean that there is demand for it. I think new function/SQL would not be necessarily bound to pg_basebackup. Rather, new GUC would be invented so that which tool is executed by the function/SQL for backup. For example, create_streaming_standby_command = '/path/to/create_standby.sh %h %p %d' where %h is the host, %p is the port on which postmaster is listening on and %d is the database cluster directory for the standby server to be created. Then the new function (suppose it is called "create_streaming_standby") would be called something like: create_streaming_standby('standby_host', 5432, '/usr/local/pgsql/data') and it execute create_standby.sh with the arguments. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Hi Stephen, On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote: > > fsync() doesn't reflect the status of given pages, however, it reflects > the status of the file descriptor, and as such the file, on which it's > called. This notion that fsync() is actually only responsible for the > changes which were made to a file since the last fsync() call is pure > foolishness. If we were able to pass a list of pages or data ranges to > fsync() for it to verify they're on disk then perhaps things would be > different, but we can't, all we can do is ask to "please flush all the > dirty pages associated with this file descriptor, which represents this > file we opened, to disk, and let us know if you were successful." > > Give us a way to ask "are these specific pages written out to persistant > storage?" and we would certainly be happy to use it, and to repeatedly > try to flush out pages which weren't synced to disk due to some > transient error, and to track those cases and make sure that we don't > incorrectly assume that they've been transferred to persistent storage. Indeed fsync() is simply a rather blunt instrument and a narrow legacy interface but further changing its established semantics (no matter how unreasonable they may be) is probably not the way to go. Would using sync_file_range() be helpful? Potential errors would only apply to pages that cover the requested file ranges. There are a few caveats though: (a) it still messes with the top-level error reporting so mixing it with callers that use fsync() and do care about errors will produce the same issue (clearing the error status). (b) the error-reporting granularity is coarse (failure reporting applies to the entire requested range so you still don't know which particular pages/file sub-ranges failed writeback) (c) the same "report and forget" semantics apply to repeated invocations of the sync_file_range() call, so again action will need to be taken upon first error encountered for the particular ranges. > > The application will need to deal with that first error irrespective of > > subsequent return codes from fsync(). Conceptually every fsync() invocation > > demarcates an epoch for which it reports potential errors, so the caller > > needs to take responsibility for that particular epoch. > > We do deal with that error- by realizing that it failed and later > *retrying* the fsync(), which is when we get back an "all good! > everything with this file descriptor you've opened is sync'd!" and > happily expect that to be truth, when, in reality, it's an unfortunate > lie and there are still pages associated with that file descriptor which > are, in reality, dirty and not sync'd to disk. It really turns out that this is not how the fsync() semantics work though, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail. Instead the kernel opts for marking those pages clean (since there is no other recovery strategy), and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention. > Consider two independent programs where the first one writes to a file > and then calls the second one whose job it is to go out and fsync(), > perhaps async from the first, those files. Is the second program > supposed to go write to each page that the first one wrote to, in order > to ensure that all the dirty bits are set so that the fsync() will > actually return if all the dirty pages are written? I think what you have in mind are the semantics of sync() rather than fsync(), but as long as an application needs to ensure data are persisted to storage, it needs to retain those data in its heap until fsync() is successful instead of discarding them and relying on the kernel after write(). The pattern should be roughly like: write() -> fsync() -> free(), rather than write() -> free() -> fsync(). For example, if a partition gets full upon fsync(), then the application has a chance to persist the data in a different location, while the kernel cannot possibly make this decision and recover. > > Callers that are not affected by the potential outcome of fsync() and > > do not react on errors, have no reason for calling it in the first place > > (and thus masking failure from subsequent callers that may indeed care). > > Reacting on an error from an fsync() call could, based on how it's > documented and actually implemented in other OS's, mean "run another > fsync() to see if the error has resolved itself." Requiring that to > mean "you have to go dirty all of the pages you previously dirtied to > actually get a subsequent fsync() to do anything" is really just not > reasonable- a given program may have no idea what was written to > previously nor any particular reason to need to know, on the expectation > that the fsync() call will flush any dirty pages, as it's documente
Re: Disabling memory display in EXPLAIN ANALYZE
Hi, On 2018-04-02 17:01:08 -0400, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > i.e. memory consumption differs between environments. Which isn't > > surprising. > > > > I wonder if we could disable the display with a separate switch or tie > > it to !'COSTS OFF && TIMING OFF' or such? > > Yeah, I agree with having these suppressed when we're running the > regression tests. I don't particularly care for having them suppressed > under either costs or timing off though, as those are really pretty > specific and independnet flags. We already put a number of loosely related things in there, so I don't think that'd be that weird to continue along those lines. > Perhaps what we need is a flag that says 'regression mode'... DWIM mode? reproducible mode? Greetings, Andres Freund
Re: Rethinking -L switch handling and construction of LDFLAGS
So here's a draft patch for this. It fixes the originally complained-of issue about --with-libxml breaking the build on Macs. I've not tested it very far beyond that plus a sanity cross-check on Linux, but I doubt there's any point in further manual testing; we might as well just throw it at the buildfarm. I'm pretty happy with the way this turned out. It's not a complicated change, and we're no longer breaking our own rules about how to manage LDFLAGS adjustments. Some notes: * I ended up adding not only LDFLAGS_INTERNAL and SHLIB_LINK_INTERNAL, but also PG_LIBS_INTERNAL. As things stood, the only uses of PG_LIBS were for within-tree library references, so that it was fine that the make rule that uses this variable put it before LDFLAGS. But if anyone ever tried to use PG_LIBS for the seemingly obvious purpose of referencing an external library, we'd be right back in hazard land. So that variable needs to be split as well. * I documented SHLIB_LINK_INTERNAL and PG_LIBS_INTERNAL in the comments in pgxs.mk, but not in the user-facing documentation about using PGXS. This is on the theory that only within-tree modules need either of those variables, so users don't need them. I could be convinced otherwise, though. * Some of these changes aren't really essential, eg the change in hstore_plperl/Makefile to use SHLIB_LINK_INTERNAL instead of SHLIB_LINK; since there's no -L switch there, it'd work either way. But I thought it best to maintain a consistent rule of using the _INTERNAL variable for within-tree references. * The change to STD_LDFLAGS in src/common/Makefile is because I noticed that "pg_config --ldflags" has been printing "-L../../../src/common", which it surely should not. Apparently we forgot to fix this rule when we rearranged things to create src/common/. I'm tempted to make the rule be STD_LDFLAGS := $(filter-out $(LDFLAGS_INTERNAL),$(LDFLAGS)) in hopes of preventing a repetition of such silliness, but desisted for the moment, because I'm not quite sure if this is a good idea, or if it risks taking too much out. Anybody have an opinion about it? regards, tom lane diff --git a/contrib/dblink/Makefile b/contrib/dblink/Makefile index 5189758..b1a5e06 100644 --- a/contrib/dblink/Makefile +++ b/contrib/dblink/Makefile @@ -3,7 +3,7 @@ MODULE_big = dblink OBJS = dblink.o $(WIN32RES) PG_CPPFLAGS = -I$(libpq_srcdir) -SHLIB_LINK = $(libpq) +SHLIB_LINK_INTERNAL = $(libpq) EXTENSION = dblink DATA = dblink--1.2.sql dblink--1.1--1.2.sql dblink--1.0--1.1.sql \ diff --git a/contrib/hstore_plperl/Makefile b/contrib/hstore_plperl/Makefile index c0906db..f63cba2 100644 --- a/contrib/hstore_plperl/Makefile +++ b/contrib/hstore_plperl/Makefile @@ -28,7 +28,7 @@ ifeq ($(PORTNAME), win32) # these settings are the same as for plperl override CPPFLAGS += -DPLPERL_HAVE_UID_GID -Wno-comment # ... see silliness in plperl Makefile ... -SHLIB_LINK += $(sort $(wildcard ../../src/pl/plperl/libperl*.a)) +SHLIB_LINK_INTERNAL += $(sort $(wildcard ../../src/pl/plperl/libperl*.a)) else rpathdir = $(perl_archlibexp)/CORE SHLIB_LINK += $(perl_embed_ldflags) diff --git a/contrib/hstore_plpython/Makefile b/contrib/hstore_plpython/Makefile index 7ff787a..b81735a 100644 --- a/contrib/hstore_plpython/Makefile +++ b/contrib/hstore_plpython/Makefile @@ -26,7 +26,7 @@ endif # We must link libpython explicitly ifeq ($(PORTNAME), win32) # ... see silliness in plpython Makefile ... -SHLIB_LINK += $(sort $(wildcard ../../src/pl/plpython/libpython*.a)) +SHLIB_LINK_INTERNAL += $(sort $(wildcard ../../src/pl/plpython/libpython*.a)) else rpathdir = $(python_libdir) SHLIB_LINK += $(python_libspec) $(python_additional_libs) diff --git a/contrib/jsonb_plpython/Makefile b/contrib/jsonb_plpython/Makefile index 8c7090a..b3c98e6 100644 --- a/contrib/jsonb_plpython/Makefile +++ b/contrib/jsonb_plpython/Makefile @@ -26,7 +26,7 @@ endif # We must link libpython explicitly ifeq ($(PORTNAME), win32) # ... see silliness in plpython Makefile ... -SHLIB_LINK += $(sort $(wildcard ../../src/pl/plpython/libpython*.a)) +SHLIB_LINK_INTERNAL += $(sort $(wildcard ../../src/pl/plpython/libpython*.a)) else rpathdir = $(python_libdir) SHLIB_LINK += $(python_libspec) $(python_additional_libs) diff --git a/contrib/ltree_plpython/Makefile b/contrib/ltree_plpython/Makefile index bc7502b..7e988c7 100644 --- a/contrib/ltree_plpython/Makefile +++ b/contrib/ltree_plpython/Makefile @@ -26,7 +26,7 @@ endif # We must link libpython explicitly ifeq ($(PORTNAME), win32) # ... see silliness in plpython Makefile ... -SHLIB_LINK += $(sort $(wildcard ../../src/pl/plpython/libpython*.a)) +SHLIB_LINK_INTERNAL += $(sort $(wildcard ../../src/pl/plpython/libpython*.a)) else rpathdir = $(python_libdir) SHLIB_LINK += $(python_libspec) $(python_additional_libs) diff --git a/contrib/oid2name/Makefile b/contrib/oid2name/Makefile index 3414b4a..3eef8f6 100644 --- a/contrib/oid2name/Makefile +++ b/contrib/oid2name/Makefile @@ -7
Re: Add default role 'pg_access_server_files'
Michael, all, * Michael Paquier (mich...@paquier.xyz) wrote: > On Sun, Apr 01, 2018 at 09:39:02AM -0400, Stephen Frost wrote: > > Thanks for checking. Attached is an updated version which also includes > > the changes for adminpack, done in a similar manner to how pgstattuple > > was updated, as discussed. Regression tests updated and extended a bit, > > doc updates also included. > > > > If you get a chance to take a look, that'd be great. I'll do my own > > review of it again also after stepping away for a day or so. > > I have spotted some issues mainly in patch 3. Thanks for taking a look. > I am not sure what has happened to your editor, but git diff --check is > throwing a dozen of warnings coming from adminpack.c. Ahhh, those came from switching over to tmux.. I need to figure out how to get it to copy/paste like I had set up before with screen. Anyhow, they were all in patch 3 and I've fixed them all. > c0cbe00 has stolen the OIDs your patch is using for the new roles, so > patch 2 needs a refresh. Fixed and generally rebased. > @@ -68,6 +77,15 @@ convert_and_check_filename(text *arg, bool logAllowed) > [...] > + /* > +* Members of the 'pg_read_server_files' role are allowed to access any > +* files on the server as the PG user, so no need to do any further checks > +* here. > +*/ > + if (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) > + return filename; > So... If a user has loaded adminpack v1.0 with Postgres v11, then > convert_and_check_filename would actually be able to read paths out of > the data folder for a user within pg_read_server_files, while with > Postgres v10 then only paths within the data folder were allowed. And > that7s actually fine because a superuser check happens before entering > in this code path. Yes, all of the adminpack v1.0 code paths still have superuser checks, similar to how the older pgstattuple code paths do. When an upgrade to adminpack v1.1 is done, the new v1_1 functions don't have those superuser checks but the upgrade script REVOKE's execute rights from public, so the right to execute the functions has to be explicitly GRANT'd for non-superusers. > pg_file_rename(PG_FUNCTION_ARGS) > +{ > + text *file1; > + text *file2; > + text *file3; > + boolresult; > + > + if (PG_ARGISNULL(0) || PG_ARGISNULL(1)) > + PG_RETURN_NULL(); > + > + file1 = PG_GETARG_TEXT_PP(0); > + file2 = PG_GETARG_TEXT_PP(1); > + > + if (PG_ARGISNULL(2)) > + file3 = NULL; > + else > + file3 = PG_GETARG_TEXT_PP(2); > + > + requireSuperuser(); > Here requireSuperuser() should be called before looking at the > argument values. Moved up. > No refactoring for pg_file_unlink and its v1.1? I considered each function and thought about if it'd make sense to refactor them or if they were simple enough that the additional function wouldn't really be all that useful. I'm open to revisiting that, but just want to make it clear that it was something I thought about and considered. Since pg_file_unlink is basically two function calls, I didn't think it worthwhile to refactor those into their own function. > The argument checks are exactly the same for +pg_file_rename and > pg_file_rename_v1_1. Why about just passing fcinfo around and simplify > the patch? In general, I prefer to keep the PG_FUNCTION_ARGS abstraction when we can. Unfortunately, that does fall apart when wrapping an SRF as in pg_logdir_ls(), but with pg_file_rename we can maintain it and it's really not that much code to do so. As with the refactoring of pg_file_unlink, this is something which could really go either way. > +CREATE OR REPLACE FUNCTION pg_catalog.pg_file_rename(text, text) > +RETURNS bool > +AS 'SELECT pg_catalog.pg_file_rename($1, $2, NULL::pg_catalog.text);' > +LANGUAGE SQL VOLATILE STRICT; > You forgot a REVOKE clause for pg_file_rename(text, text). No, I explicitly didn't include it because that's a security-invoker SQL function that basically doesn't do anything but turn around and call pg_file_rename(), which will handle the privilege check: =*> select pg_file_rename('abc','123'); ERROR: permission denied for function pg_file_rename CONTEXT: SQL function "pg_file_rename" statement 1 I'm not sure how useful it is to REVOKE the rights on the simple SQL function; that would just mean that an admin has to go GRANT the rights on that function as well as the three-argument version. > In adminpack.c, convert_and_check_filename is always called with false > as second argument. Why not dropping it and use the version in > genfile.c instead? As far as I can see, both functions are the same. Hmm. I'm pretty sure that function was actually copied from adminpack into core, so I'm not surprised that they're basically the same, but it was implemented in core as static and I'm not really sure that we want to export it- it wasn't when it was first copied into core, after all. Also, they act
Re: [PATCH] Logical decoding of TRUNCATE
> On Apr 2, 2018, at 11:50, Andres Freund wrote: > I'm not convinced. I think it's perfectly reasonable to want to truncate > away data on the live node, but maintain it on an archival node. +1. We've had at least one specific request for this, in maintaining a data warehouse from an OLTP system. -- -- Christophe Pettus x...@thebuild.com
Re: Disabling memory display in EXPLAIN ANALYZE
Greetings, * Andres Freund (and...@anarazel.de) wrote: > i.e. memory consumption differs between environments. Which isn't > surprising. > > I wonder if we could disable the display with a separate switch or tie > it to !'COSTS OFF && TIMING OFF' or such? Yeah, I agree with having these suppressed when we're running the regression tests. I don't particularly care for having them suppressed under either costs or timing off though, as those are really pretty specific and independnet flags. Perhaps what we need is a flag that says 'regression mode'... Thanks! Stephen signature.asc Description: PGP signature
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Greetings, * Anthony Iliopoulos (ail...@altatus.com) wrote: > On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote: > > On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote: > > > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote: > > > > Throwing away the dirty pages *and* persisting the error seems a lot > > > > more reasonable. Then provide a fcntl (or whatever) extension that can > > > > clear the error status in the few cases that the application that wants > > > > to gracefully deal with the case. > > > > > > Given precisely that the dirty pages which cannot been written-out are > > > practically thrown away, the semantics of fsync() (after the 4.13 fixes) > > > are essentially correct: the first call indicates that a writeback error > > > indeed occurred, while subsequent calls have no reason to indicate an > > > error > > > (assuming no other errors occurred in the meantime). > > > > Meh^2. > > > > "no reason" - except that there's absolutely no way to know what state > > the data is in. And that your application needs explicit handling of > > such failures. And that one FD might be used in a lots of different > > parts of the application, that fsyncs in one part of the application > > might be an ok failure, and in another not. Requiring explicit actions > > to acknowledge "we've thrown away your data for unknown reason" seems > > entirely reasonable. > > As long as fsync() indicates error on first invocation, the application > is fully aware that between this point of time and the last call to fsync() > data has been lost. Persisting this error any further does not change this > or add any new info - on the contrary it adds confusion as subsequent write()s > and fsync()s on other pages can succeed, but will be reported as failures. fsync() doesn't reflect the status of given pages, however, it reflects the status of the file descriptor, and as such the file, on which it's called. This notion that fsync() is actually only responsible for the changes which were made to a file since the last fsync() call is pure foolishness. If we were able to pass a list of pages or data ranges to fsync() for it to verify they're on disk then perhaps things would be different, but we can't, all we can do is ask to "please flush all the dirty pages associated with this file descriptor, which represents this file we opened, to disk, and let us know if you were successful." Give us a way to ask "are these specific pages written out to persistant storage?" and we would certainly be happy to use it, and to repeatedly try to flush out pages which weren't synced to disk due to some transient error, and to track those cases and make sure that we don't incorrectly assume that they've been transferred to persistent storage. > The application will need to deal with that first error irrespective of > subsequent return codes from fsync(). Conceptually every fsync() invocation > demarcates an epoch for which it reports potential errors, so the caller > needs to take responsibility for that particular epoch. We do deal with that error- by realizing that it failed and later *retrying* the fsync(), which is when we get back an "all good! everything with this file descriptor you've opened is sync'd!" and happily expect that to be truth, when, in reality, it's an unfortunate lie and there are still pages associated with that file descriptor which are, in reality, dirty and not sync'd to disk. Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written? > Callers that are not affected by the potential outcome of fsync() and > do not react on errors, have no reason for calling it in the first place > (and thus masking failure from subsequent callers that may indeed care). Reacting on an error from an fsync() call could, based on how it's documented and actually implemented in other OS's, mean "run another fsync() to see if the error has resolved itself." Requiring that to mean "you have to go dirty all of the pages you previously dirtied to actually get a subsequent fsync() to do anything" is really just not reasonable- a given program may have no idea what was written to previously nor any particular reason to need to know, on the expectation that the fsync() call will flush any dirty pages, as it's documented to do. Thanks! Stephen signature.asc Description: PGP signature
Re: BRIN FSM vacuuming questions
Alvaro Herrera writes: > Tom Lane wrote: >> I noticed that several places in brin_pageops.c do this sort of thing: >> ... >> ie they put *one* page into the index's free space map and then invoke a >> scan of the index's entire FSM to ensure that that info is bubbled up to >> the top immediately. I wonder how carefully this was thought through. >> We don't generally think that it's worth doing an FSM vacuum for a single >> page worth of free space. > Yeah, there was no precedent for doing that, but it was actually > intentional: there were enough cases where we would extend the relation, > only to have the operation fail for unrelated reasons (things like a > concurrent heap insertion), so it was possible to lose free space very > fast, and since BRIN indexes are slow to grow it would become > noticeable. We could have waited for FreeSpaceMapVacuum to occur > naturally, but this would cause the index to bloat until next vacuum, > which could happen a long time later. Also, the idea behind BRIN is > that the indexed tables are very large, so vacuuming is infrequent. Fair enough. It seemed like an odd thing, but as long as it was intentional I'm good with it. >> some places that do RecordPageWithFreeSpace follow it up with an immediate >> FreeSpaceMapVacuum, and some don't. > Hmm I analyzed the callsites carefully to ensure it wasn't possible to > bail out without vacuuming in the normal corner cases. Maybe you > spotted some gap in my analysis -- or worse, maybe I had to change > nearby code for unrelated reasons and broke it afterwards inadvertently. Well, I did not trace the logic fully, but there are places that record free space and don't have an obviously connected FreeSpaceMapVacuum call. Maybe those all expect the caller to do it. > I agree that brin_getinsertbuffer should have a better comment to > explain the intention (and that routine seems to carry the most guilt in > failing to do the FSM vacuuming). Got a proposal? I can take care of the other stuff, unless you wanted to. regards, tom lane
Disabling memory display in EXPLAIN ANALYZE
Hi, The, now reverted, MERGE patch used EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF) on a few queries. That doesn't seem like a bad idea when it's interesting to display the plan and run a query without (interesting) results. Otherwise one has to duplicate the query. But right now it triggers differences like: *** *** 887,893 Hash Cond: (s.sid = t_1.tid) -> Seq Scan on source s (actual rows=3 loops=1) -> Hash (actual rows=3 loops=1) !Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on target t_1 (actual rows=3 loops=1) Trigger merge_ard: calls=1 Trigger merge_ari: calls=1 --- 887,893 Hash Cond: (s.sid = t_1.tid) -> Seq Scan on source s (actual rows=3 loops=1) -> Hash (actual rows=3 loops=1) !Buckets: 1024 Batches: 1 Memory Usage: 5kB -> Seq Scan on target t_1 (actual rows=3 loops=1) Trigger merge_ard: calls=1 Trigger merge_ari: calls=1 *** *** 1320,1330 Merge Cond: (t_1.a = s.a) -> Sort (actual rows=50 loops=1) Sort Key: t_1.a !Sort Method: quicksort Memory: 27kB -> Seq Scan on ex_mtarget t_1 (actual rows=50 loops=1) -> Sort (actual rows=100 loops=1) Sort Key: s.a !Sort Method: quicksort Memory: 33kB -> Seq Scan on ex_msource s (actual rows=100 loops=1) (15 rows) --- 1320,1330 Merge Cond: (t_1.a = s.a) -> Sort (actual rows=50 loops=1) Sort Key: t_1.a !Sort Method: quicksort Memory: 19kB -> Seq Scan on ex_mtarget t_1 (actual rows=50 loops=1) -> Sort (actual rows=100 loops=1) Sort Key: s.a !Sort Method: quicksort Memory: 24kB -> Seq Scan on ex_msource s (actual rows=100 loops=1) (15 rows) i.e. memory consumption differs between environments. Which isn't surprising. I wonder if we could disable the display with a separate switch or tie it to !'COSTS OFF && TIMING OFF' or such? Greetings, Andres Freund
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote: > On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote: > > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote: > > > Throwing away the dirty pages *and* persisting the error seems a lot > > > more reasonable. Then provide a fcntl (or whatever) extension that can > > > clear the error status in the few cases that the application that wants > > > to gracefully deal with the case. > > > > Given precisely that the dirty pages which cannot been written-out are > > practically thrown away, the semantics of fsync() (after the 4.13 fixes) > > are essentially correct: the first call indicates that a writeback error > > indeed occurred, while subsequent calls have no reason to indicate an error > > (assuming no other errors occurred in the meantime). > > Meh^2. > > "no reason" - except that there's absolutely no way to know what state > the data is in. And that your application needs explicit handling of > such failures. And that one FD might be used in a lots of different > parts of the application, that fsyncs in one part of the application > might be an ok failure, and in another not. Requiring explicit actions > to acknowledge "we've thrown away your data for unknown reason" seems > entirely reasonable. As long as fsync() indicates error on first invocation, the application is fully aware that between this point of time and the last call to fsync() data has been lost. Persisting this error any further does not change this or add any new info - on the contrary it adds confusion as subsequent write()s and fsync()s on other pages can succeed, but will be reported as failures. The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch. Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care). Best regards, Anthony
Re: BRIN FSM vacuuming questions
Tom Lane wrote: > I noticed that several places in brin_pageops.c do this sort of thing: > > if (extended) > { > Assert(BlockNumberIsValid(newblk)); > RecordPageWithFreeSpace(idxrel, newblk, freespace); > FreeSpaceMapVacuum(idxrel); > } > > ie they put *one* page into the index's free space map and then invoke a > scan of the index's entire FSM to ensure that that info is bubbled up to > the top immediately. I wonder how carefully this was thought through. > We don't generally think that it's worth doing an FSM vacuum for a single > page worth of free space. Yeah, there was no precedent for doing that, but it was actually intentional: there were enough cases where we would extend the relation, only to have the operation fail for unrelated reasons (things like a concurrent heap insertion), so it was possible to lose free space very fast, and since BRIN indexes are slow to grow it would become noticeable. We could have waited for FreeSpaceMapVacuum to occur naturally, but this would cause the index to bloat until next vacuum, which could happen a long time later. Also, the idea behind BRIN is that the indexed tables are very large, so vacuuming is infrequent. I also considered the fact that we can save the newly acquired page in relcache's "target block", but I was scared of relying just on that because AFAIR it's easily lost on relcache invalidations. > We could make these operations somewhat cheaper, in the wake of 851a26e26, > by doing > > - FreeSpaceMapVacuum(idxrel); > + FreeSpaceMapVacuumRange(idxrel, newblk, newblk + 1); > > so that only the relevant subtree of the FSM gets scanned. But it seems > at least as plausible to just drop the FreeSpaceMapVacuum calls entirely > and let the next VACUUM fix up the map, like the entire rest of the > system does. This is particularly true because the code is inconsistent: > some places that do RecordPageWithFreeSpace follow it up with an immediate > FreeSpaceMapVacuum, and some don't. Hmm I analyzed the callsites carefully to ensure it wasn't possible to bail out without vacuuming in the normal corner cases. Maybe you spotted some gap in my analysis -- or worse, maybe I had to change nearby code for unrelated reasons and broke it afterwards inadvertently. > Or, perhaps, there's actually some method behind this madness and there's > good reason to think that FreeSpaceMapVacuum calls in these specific > places are worth the cost? If so, a comment documenting the reasoning > would be appropriate. Range-vacuuming the FSM seems a good idea for these places -- no need to trawl the rest of the tree. I'm not too excited about dropping the FSM vacuuming completely and waiting till regular VACUUM. (Why do we call this FSM operation "vacuum"? It seems pretty confusing.) I agree that brin_getinsertbuffer should have a better comment to explain the intention (and that routine seems to carry the most guilt in failing to do the FSM vacuuming). > I'm also finding this code in brin_page_cleanup a bit implausible: > > /* Measure free space and record it */ > freespace = br_page_get_freespace(page); > if (freespace > GetRecordedFreeSpace(idxrel, BufferGetBlockNumber(buf))) > { > RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buf), freespace); > return true; > } > > because of the fact that GetRecordedFreeSpace will report a rounded-down > result owing to the limited resolution of the FSM's numbers. If we > assume that the result of br_page_get_freespace is always maxaligned, > then by my math there's a 75% chance (on maxalign-8 machines; more if > maxalign is 4) that this will think there's a discrepancy even when > there is none. It seems questionable to me that it's worth checking > GetRecordedFreeSpace at all. If it really is, we'd need to get the > recorded free space, invoke RecordPageWithFreeSpace, and then see if > the result of GetRecordedFreeSpace has changed in order to decide whether > anything really changed. But that seems overly complicated, and again > unlike anything done elsewhere. Also, the result value is used to > decide whether FreeSpaceMapVacuum is needed, but I'm unconvinced that > it's a good idea to skip said call just because the leaf FSM values did > not change; that loses the opportunity to clean up any upper-level FSM > errors that may exist. > > In short, I'd suggest making this RecordPageWithFreeSpace call > unconditional, dropping the result value of brin_page_cleanup, > and doing FreeSpaceMapVacuum unconditionally too, back in > brin_vacuum_scan. I don't see a reason for brin to be unlike > the rest of the system here. As I recall, this was just trying to save unnecessary FSM updates, and failed to make the connection with lossiness in GetRecordedFreeSpace. But since FSM updates are quite cheap anyway (since they're not WAL logged), I agree with your proposed fix. > Lastly, I notice that brin_va
Re: BUG #15112: Unable to run pg_upgrade with earthdistance extension
On Fri, Mar 30, 2018 at 04:52:29PM -0400, Bruce Momjian wrote: > On Wed, Mar 14, 2018 at 11:12:26PM +, PG Bug reporting form wrote: > > The following bug has been logged on the website: > > > > Bug reference: 15112 > > Logged by: Keiko Oda > > Email address: keiko...@gmail.com > > PostgreSQL version: 10.3 > > Operating system: Ubuntu > > Description: > > > > This is similar to the bug report from 2016 > > (https://www.postgresql.org/message-id/20161015001424.1413.93...@wrigleys.postgresql.org). > > where earthdistance extension is assuming that it's executed with public > > schema in search_path. > > > > With 10.3 release, Postgres has tighten the search_path during pg_upgrade > > only to pg_catalog. This is problematic as it's now impossible to run > > pg_upgrade with earthdistance extension (it fails with create index with > > ll_to_earth function). > > Prior to this, we were at least able to workaround by adding public schema > > in search_path with pg_upgrade. > > > > As it's recommended in the release note, earthdistance should adjust these > > functions to not assume anything about what search path they are invoked > > under. > > Uh, I can reproduce this failure. :-( > > I tested it by installing earchdistance (and cube) and an index using an > earchdistance function in the old cluster and ran pg_upgrade. I used > the instructions in the referenced email: > > CREATE TABLE zip_code_geo_poly_data ( > id serial NOT NULL PRIMARY KEY, > zip_codeTEXT, > latitudeNUMERIC, > longitude NUMERIC > ); > > CREATE INDEX zip_code_geo_poly_data_ll_to_earth > ON zip_code_geo_poly_data > USING gist(ll_to_earth(latitude, longitude)); > > The failure is actually in pg_dump/restore, but pg_upgrade relies on > that, so it fails too. The failure output is: > > LINE 1: ...ians($1))*sin(radians($2))),earth()*sin(radians($1)))::earth > ^ > QUERY: SELECT > cube(cube(cube(earth()*cos(radians($1))*cos(radians($2))),earth()*cos(radians($1))*sin(radians($2))),earth()*sin(radians($1)))::earth > CONTEXT: SQL function "ll_to_earth" during inlining > Command was: > -- For binary upgrade, must preserve pg_class oids > SELECT > pg_catalog.binary_upgrade_set_next_index_pg_class_oid('16498'::pg_catalog.oid); > > CREATE INDEX "zip_code_geo_poly_data_ll_to_earth" ON > "public"."zip_code_geo_poly_data" USING "gist" > ("public"."ll_to_earth"(("latitude")::double precision, ("longitude")::double > precision)); > > The function definition of ll_to_earth() is (as defined in > earthdistance--1.1.sql): > > CREATE FUNCTION ll_to_earth(float8, float8) > RETURNS earth > LANGUAGE SQL > IMMUTABLE STRICT > PARALLEL SAFE > AS 'SELECT > cube(cube(cube(earth()*cos(radians($1))*cos(radians($2))),earth()*cos(radians($1))*sin(radians($2))),earth()*sin(radians($1)))::earth'; > > As you can see, the SQL function is doing a cast to ::earth, but in this > test case the earth cast is stored in the public schema, and the restore > doesn't reference it. > > I am not sure we can fix this without requiring people to drop and > recreate such indexes. However, I am even at a loss in how to fix the > CREATE FUNCTION to reference a cast in the same schema as the function, > in this case 'public'. We can rewrite the cast to not use :: and use a > function call with schema qualification. e.g. public.earth(), but how do > we know what schema that is in, i.e. what if the extension is loaded > into a schema other than public? > > FYI, earthdistance is certainly not the only case of this problem. > > This is also part of a larger problem with our new schema qualification > approach. While we are qualifying where the object is to be created, we > are not always qualifying the object's contents, i.e. in this case, the > SQL function body. I have not received a reply to this so I am CCing the hackers list for more input. (If this is not the proper procedure, please let me know.) The only additional thoughts I have are that when we removed public from pg_dump's search_path and decided to instead fully qualify object creation names, I was concerned about function creation failing because of 'public' schema object references in the function body, but someone pointed out that we don't check function bodies during restore because of: SET check_function_bodies = false; However, what I did not consider was creation of indexes based on function bodies with 'public' schema references. I have ran some tests and found out that only SQL functions cause such failures. I believe this is because they are inlined during CREATE INDEX. The attached pg_dump output file recreates the failure, and shows that prefixing it with "public" avoids the failure, and shows that index
fdw_startup/tuple_cost vs. parallel_startup/tuple_cost
postgres_fdw has values called fdw_startup_cost and fdw_tuple_cost that you can adjust in order to control the perceived cost of starting a remote query and schlepping a tuple over the network. Parallel query, similarly, has parallel_startup_cost and parallel_tuple_cost. Today, I noticed that on a relative basis, the defaults are insane. parallel_startup_cost = 1000 fdw_startup_cost = 100 parallel_tuple_cost = 0.1 fdw_tuple_cost = 0.01 So, we think that starting a query on a remote node is 10x cheaper than starting a worker within the same cluster, and similarly we think that the cost of moving a tuple from the worker to the leader is 10x more than the cost of moving a tuple from a remote node to the local node. I think that's gotta be wrong. Communication within the same database should be considerably *cheaper*, because we don't have to convert tuples to next or fit them through the network pipe. Another useful point of comparison is the cost of an Append node, which is currently defined to be one-half of cpu_tuple_cost, or by default 0.005 - or in other words half the cost of sending a tuple over the network. It's good that it's less, but I think the actual cost of sending a tuple over the network is considerably more than twice the cost of passing it through an Append node. I suspect it's at least 10x more expensive. I don't think all of this mattered very much before we had join pushdown, aggregate pushdown, and partition-wise join and aggregate, but it does matter now. I'm seeing the planner fail to push down aggregates in cases that are obvious wins unless I crank up fdw_tuple_cost. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Passing current_database to BackgroundWorkerInitializeConnection
Hi, On 2018-04-02 14:33:54 -0500, Jeremy Finzel wrote: > Hmmm... not sure if I follow. My goal is to run a SQL statement every 10 > seconds (or what value is chosen) in a particular database, using a > background worker. Those are the two arguments. Am I missing some way to > implement this apart from passing those 2 arguments into the launcher > function? Is the way to do this properly then to allocate shared memory > for it, as opposed to trying to pass args into the main function? Yes, that's the proper way. Allocate shared memory and pass a pointer to that as the argument. Greetings, Andres Freund
Re: Passing current_database to BackgroundWorkerInitializeConnection
On Mon, Apr 2, 2018 at 2:27 PM, Andres Freund wrote: > Hi, > > On 2018-04-02 14:24:53 -0500, Jeremy Finzel wrote: > > Thank you, this makes sense. However, how can this be done since I can > > only pass one argument to bgw_main? Is there any way to do this without > > having to store the value in shared memory? > > No (I mean you can store it in the filesystem or such as well, but > ...). Pretty fundamentally sharing data between concurrently running > processes needs a medium to share the data over. The bgw infrastructure > allocates just enough so you can put an index to it into > shmem. Allocating more would be wasteful and/or not enough for some > users. > > Greetings, > > Andres Freund > Hmmm... not sure if I follow. My goal is to run a SQL statement every 10 seconds (or what value is chosen) in a particular database, using a background worker. Those are the two arguments. Am I missing some way to implement this apart from passing those 2 arguments into the launcher function? Is the way to do this properly then to allocate shared memory for it, as opposed to trying to pass args into the main function? Thanks, Jeremy
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote: > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote: > > Throwing away the dirty pages *and* persisting the error seems a lot > > more reasonable. Then provide a fcntl (or whatever) extension that can > > clear the error status in the few cases that the application that wants > > to gracefully deal with the case. > > Given precisely that the dirty pages which cannot been written-out are > practically thrown away, the semantics of fsync() (after the 4.13 fixes) > are essentially correct: the first call indicates that a writeback error > indeed occurred, while subsequent calls have no reason to indicate an error > (assuming no other errors occurred in the meantime). Meh^2. "no reason" - except that there's absolutely no way to know what state the data is in. And that your application needs explicit handling of such failures. And that one FD might be used in a lots of different parts of the application, that fsyncs in one part of the application might be an ok failure, and in another not. Requiring explicit actions to acknowledge "we've thrown away your data for unknown reason" seems entirely reasonable. > The error reporting is thus consistent with the intended semantics (which > are sadly not properly documented). Repeated calls to fsync() simply do not > imply that the kernel will retry to writeback the previously-failed pages, > so the application needs to be aware of that. Which isn't what I've suggested. > Persisting the error at the fsync() level would essentially mean > moving application policy into the kernel. Meh. Greetings, Andres Freund
Re: Passing current_database to BackgroundWorkerInitializeConnection
Hi, On 2018-04-02 14:24:53 -0500, Jeremy Finzel wrote: > Thank you, this makes sense. However, how can this be done since I can > only pass one argument to bgw_main? Is there any way to do this without > having to store the value in shared memory? No (I mean you can store it in the filesystem or such as well, but ...). Pretty fundamentally sharing data between concurrently running processes needs a medium to share the data over. The bgw infrastructure allocates just enough so you can put an index to it into shmem. Allocating more would be wasteful and/or not enough for some users. Greetings, Andres Freund
Re: Passing current_database to BackgroundWorkerInitializeConnection
On Fri, Mar 30, 2018 at 5:37 PM, Andres Freund wrote: > > > On March 30, 2018 3:16:31 PM PDT, Jeremy Finzel wrote: > >> What do you mean with "current database"? Before you > >> BackgroundWorkerInitializeConnection() there is no such thing? > > > > > >My module is based directly off the worker_spi example. The worker is > >dynamically launched via SQL command. But in the worker_spi example, > >the > >database postgres is just hardcoded as the database in which to start > >the > >background worker process. Instead, I want to start it in the database > >in > >which I run the SQL command. > > The started worker isn't associated with the original database. You can > pass the database oid as an argument to the launched bgworker. > > Thank you, this makes sense. However, how can this be done since I can only pass one argument to bgw_main? Is there any way to do this without having to store the value in shared memory? I was going to try passing an array instead of an int, but I'm not liking that much. I am trying to pass naptime and database_name (or oid). Thanks, Jeremy
Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.
Why do we need AccessExclusiveLock on all children of a relation that we want to scan to search for rows not satisfying the constraint? I think it should be enough to get ShareLock, which prevents INSERT, no? I have a feeling I'm missing something here, but I don't know what, and all tests pass with that change. Also: the change proposed to remove the get_default_oid_from_partdesc() call actually fixes the bug, but to me it was not at all obvious why. To figure out why, I first had to realize that ValidatePartitionConstraints was lying to me, both in name and in comments: it doesn't actually validate anything, it merely queues entries so that alter table's phase 3 would do the validation. I found this extremely confusing, so I fixed the comments to match reality, and later decided to rename the function also. At that point I was able to understand what the problem was: when attaching the default partition, we were setting the constraint to be validated for that partition to the correctly computed partition constraint; and later in the same function we would set the constraint to "the immature constraint" because we were now seeing that the default partition OID was not invalid. So it was rather a bug in ValidatePartitionConstraints, in that it was accepting to set the expression to validate when another expression had already been set! I added an assert to protect against this. And then changed the decision of whether or not to run this block based on the attached partition being the default one or not; because as the "if" test was, it was just a recipe for confusion. (Now, if you look carefully you realize that the new test for bound->is_default I added is kinda redundant: it can only be set if there was a default partition OID at start of the function, and therefore we know we're not adding a default partition. And for the case where we *are* adding the default partition, it is still Invalid because we don't re-read its own OID. But relying on that seems brittle because it breaks if we happen to update the default partition OID in the middle of that function, which is what we were doing. Hence the new is_default test.) I looked at the implementation of ValidatePartitionConstraints and didn't like it, so I rewrote it also. This comment is mistaken, too: - /* -* Skip if the partition is itself a partitioned table. We can only -* ever scan RELKIND_RELATION relations. -*/ ... because it ignores the possibility of a partition being a foreign table. The code seems to work, but I think there is no test to cover the case that a foreign table containing data that doesn't satisfy the constraint is attached, because when I set that case to return doing nothing (ie. ATGetQueueEntry is not called for a foreign partition), no test failed. Generally speaking, I didn't like ATExecAttachPartition; it's doing too much that should have been done in ATPrepAttachPartition. The only reason that's not broken is because we don't allow ATTACH PARTITION to appear together with other commands in alter table, which seems disappointing ... for example, when attaching multiple tables and a default partition exist, you have to scan the default one multiple times, which is lame. (Speaking of lame: I noticed that RelationGetPartitionQual obtains lock on the parent relation ... I wonder if this can be used to cause a deadlock during InitResultRelationInfo.) BTW, I think this is already broken for the case where the default partition is partitioned and you attach a partition colliding with a row that's being concurrently inserted in a partition of the default partition, though I didn't bother to write a test for that. Anyway, I'm just an onlooker fixing a CommandCounterIncrement change. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index c8da82217d..357b1078f8 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -483,10 +483,9 @@ static void RemoveInheritance(Relation child_rel, Relation parent_rel); static ObjectAddress ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd); static void AttachPartitionEnsureIndexes(Relation rel, Relation attachrel); -static void ValidatePartitionConstraints(List **wqueue, Relation scanrel, -List *scanrel_children, -List *partConstraint, -bool validate_default); +static void QueuePartitionConstraintValidation(List **wqueue, Relation scanrel, + List *partConstraint, + bool validate_default); static vo
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote: > Hi, > > On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote: > > On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote: > > > Craig Ringer writes: > > > > So we should just use the big hammer here. > > > > > > And bitch, loudly and publicly, about how broken this kernel behavior is. > > > If we make enough of a stink maybe it'll get fixed. > > > > It is not likely to be fixed (beyond what has been done already with the > > manpage patches and errseq_t fixes on the reporting level). The issue is, > > the kernel needs to deal with hard IO errors at that level somehow, and > > since those errors typically persist, re-dirtying the pages would not > > really solve the problem (unless some filesystem remaps the request to a > > different block, assuming the device is alive). > > Throwing away the dirty pages *and* persisting the error seems a lot > more reasonable. Then provide a fcntl (or whatever) extension that can > clear the error status in the few cases that the application that wants > to gracefully deal with the case. Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime). The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel. Best regards, Anthony
Re: Feature Request - DDL deployment with logical replication
Hi, On 2018-04-02 23:07:17 +0800, Craig Ringer wrote: > We then lack any mechanism by which you can NACK, saying "I can't apply > this". Sure, but nothing forces this mechanism to be in-band. > So upstream will wait indefinitely. I guess we just expect the user to > intervene and ROLLBACK if they decide a replica isn't going to get the job > done, or have checked the replica's logs and found it can't apply it for > some hopefully-sane reason. > > It's not like we'd auto-ROLLBACK PREPARED in response to a nack from a > downstream anyway, so all we're missing is probably info in the upstream > logs about which replica(s) cannot apply it and why. > > OK. So it'd be a nice-to-have, but not vital. I'm not sure that an in-band mechanism that's the same for all potential users is flexible enough (actually unsure, not intimating it's wrong). It doesn't seem crazy to do these checks over a separate connection. That'd allow more flexible error handling etc. Greetings, Andres Freund
Re: [PATCH] Logical decoding of TRUNCATE
Hi, On 2018-04-02 07:39:57 +0100, Simon Riggs wrote: > On 1 April 2018 at 21:01, Andres Freund wrote: > > >> *** > >> *** 111,116 CREATE PUBLICATION >> class="parameter">name > >> --- 111,121 > >> and so the default value for this option is > >> 'insert, update, delete'. > >> > >> + > >> +TRUNCATE is treated as a form of > >> +DELETE for the purpose of deciding whether > >> +to publish, or not. > >> + > >> > >> > >> > > > > Why is this a good idea? > > TRUNCATE removes rows, just as DELETE does, so anybody that wants to > publish the removal of rows will be interested in this. I'm not convinced. I think it's perfectly reasonable to want to truncate away data on the live node, but maintain it on an archival node. In that case one commonly wants to continue to maintain OLTP changes (hence DELETE), but not the bulk truncation. I think there's a reasonable counter-argument in that this isn't fine grained enough. > >> + * Write a WAL record to allow this set of actions to be logically > >> decoded. > >> + * We could optimize this away when !RelationIsLogicallyLogged(rel) > >> + * but that doesn't save much space or time. > > > > What you're saying isn't that you're not logging anything, but that > > you're allocating the header regardless? Because this certainly sounds > > like you unconditionally log a WAL record. > > It says that, yes, my idea - as explained. My complaint is that the comment and the actual implementation differ. The above sounds like it's unconditionally WAL logging, but: + /* +* Write a WAL record to allow this set of actions to be logically decoded. +* We could optimize this away when !RelationIsLogicallyLogged(rel) +* but that doesn't save much space or time. +* +* Assemble an array of relids, then an array of seqrelids so we can write +* a single WAL record for the whole action. +*/ + logrelids = palloc(maxrelids * sizeof(Oid)); + foreach (cell, relids_logged) + { + nrelids++; + if (nrelids > maxrelids) + { + maxrelids *= 2; + logrelids = repalloc(logrelids, maxrelids * sizeof(Oid)); + } + logrelids[nrelids - 1] = lfirst_oid(cell); + } + + foreach (cell, seq_relids_logged) + { + nseqrelids++; + if ((nrelids + nseqrelids) > maxrelids) + { + maxrelids *= 2; + logrelids = repalloc(logrelids, maxrelids * sizeof(Oid)); + } + logrelids[nrelids + nseqrelids - 1] = lfirst_oid(cell); + } + + if ((nrelids + nseqrelids) > 0) + { + xl_heap_truncate xlrec; + + xlrec.dbId = MyDatabaseId; + xlrec.nrelids = nrelids; + xlrec.nseqrelids = nseqrelids; + xlrec.flags = 0; + if (behavior == DROP_CASCADE) + xlrec.flags |= XLH_TRUNCATE_CASCADE; + if (restart_seqs) + xlrec.flags |= XLH_TRUNCATE_RESTART_SEQS; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfHeapTruncate); + XLogRegisterData((char *) logrelids, + (nrelids + nseqrelids) * sizeof(Oid)); + + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); + + (void) XLogInsert(RM_HEAP_ID, XLOG_HEAP_TRUNCATE); + } Note that the XLogInsert() is in an if branch that's only executed if there's either logged relids or sequence relids. I think logrelids should be allocated at the max size immediately, and the comment rewritten to explicitly only talk about the allocation, rather than sounding like it's about WAL logging as well. Greetings, Andres Freund
Re: [PATCH] Logical decoding of TRUNCATE
Hi, On 2018-04-02 14:30:50 -0400, Peter Eisentraut wrote: > On 4/1/18 16:01, Andres Freund wrote: > > On 2018-01-25 14:21:15 +0100, Marco Nenciarini wrote: > >> + if (SessionReplicationRole != SESSION_REPLICATION_ROLE_REPLICA) > >> + { > >> + > >> + /* > >> + * Check foreign key references. In CASCADE mode, this should > >> be > >> + * unnecessary since we just pulled in all the references; but > >> as a > >> + * cross-check, do it anyway if in an Assert-enabled build. > >> + */ > >> #ifdef USE_ASSERT_CHECKING > >>heap_truncate_check_FKs(rels, false); > >> + #else > >> + if (stmt->behavior == DROP_RESTRICT) > >> + heap_truncate_check_FKs(rels, false); > >> #endif > >> + } > > > > That *can't* be right. > > You mean the part that ignores this in session_replication_role = > replica? I don't like that either. And it's also not clear why it's > needed for this feature. I primarily the way the code is written. For me code differing like this between USE_ASSERT_CHECKING and not seems like a recipe for disaster. And yea, I'm not sure why the session_replication_role bit is here either. > > I know this has been discussed in the thread already, but it really > > strikes me as wrong to basically do some mini DDL replication feature > > via per-command WAL records. > > The problem is that the interaction of TRUNCATE and foreign keys is a mess. > > Let's say I have a provider database with table1, table2, and table3, > all connected by foreign keys, and a replica database with table1, > table2, and table4, all connected by foreign keys. I run TRUNCATE > table1 CASCADE on the provider. What should replication do? > > The proposed patch applies the TRUNCATE table1 CASCADE on the replica, > which ends up truncating table1, table2, and table4, which is different > from what happened on the origin. I actually think that's a fairly sane behaviour because it allows you to have different schemas on both sides and still deal in a reasonably sane manner. What I'm concerned about is more that we're developing a one-of DDL replication feature w/ corresponding bespoke WAL-logging. > An alternative might be to make the replication protocol message say "I > truncated table1, table2" (let's say table3 is not replicated). (This > sounds more like logical replication rather than DDL replication.) That > will then fail to apply on the replica, because it requires that you > include all tables connected by foreign keys. And entirely fails if there's schema differences. > We could then do what we do in the DELETE case and just ignore foreign > keys altogether, which is obviously bad and not a forward-looking behavior. > > Or we could do what we *should* be doing in the DELETE case and apply > cascading actions on the replica to table4, but that kind of row-by-row > processing is what we want to avoid by using TRUNCATE in the first place. Well, you could also queue a re-check at the end of the processed message, doing a bulk re-check at the end. But that's obviously more work. > >> + > >> +TRUNCATE is treated as a form of > >> +DELETE for the purpose of deciding whether > >> +to publish, or not. > >> + > >> > >> > >> > > > > Why is this a good idea? > > I think it seemed like a good idea at the time, so to speak, but several > people have argued against it, so we should probably change this in the > final version. I think it's e.g. perfectly reasonable to want OLTP changes to be replicated, but not to truncate historical data. So for me those actions should be separate... Greetings, Andres Freund
Re: [PATCH] Logical decoding of TRUNCATE
On 2 April 2018 at 19:30, Peter Eisentraut wrote: >>> *** >>> *** 111,116 CREATE PUBLICATION >> class="parameter">name >>> --- 111,121 >>> and so the default value for this option is >>> 'insert, update, delete'. >>> >>> + >>> +TRUNCATE is treated as a form of >>> +DELETE for the purpose of deciding whether >>> +to publish, or not. >>> + >>> >>> >>> >> >> Why is this a good idea? > > I think it seemed like a good idea at the time, so to speak, but several > people have argued against it, so we should probably change this in the > final version. Who has argued against it? I see that Andres has asked twice about it and been answered twice, but expressed no opinion. Petr has said he thinks that's right. Nobody else has commented. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: disable SSL compression?
Hi, On 2018-04-02 12:46:25 -0400, Tom Lane wrote: > It seemed like the attack you described wasn't all that dependent on > whether the data is compressed or not: if you can see the size of the > server's reply to "select ... where account_number = x", you can pretty > well tell the difference between 0 and 1 rows, with or without > compression. So I'm still not very clear on what the threat model is. Imagine that the attacker has control over *parts* of a query. The server always sends a query like SELECT mysecret FROM secrets WHERE ...; but somehow, and there's definitely sneaky ways, there's a second query sent alongside. SELECT '$user-defined-string'; If those are compressed together *and* the attacker can observe the size of the returned result, the attacker can change $user-defined-string iteratively and infer knowledge about what the contents of mysecret are. If $user-defined-string and mysecret are the same you're going to get a smaller total response packet. One such injection vector that's commonly changable can just be a username or such. Greetings, Andres Freund
Re: [PATCH] Logical decoding of TRUNCATE
On 4/1/18 16:01, Andres Freund wrote: > On 2018-01-25 14:21:15 +0100, Marco Nenciarini wrote: >> +if (SessionReplicationRole != SESSION_REPLICATION_ROLE_REPLICA) >> +{ >> + >> +/* >> + * Check foreign key references. In CASCADE mode, this should >> be >> + * unnecessary since we just pulled in all the references; but >> as a >> + * cross-check, do it anyway if in an Assert-enabled build. >> + */ >> #ifdef USE_ASSERT_CHECKING >> heap_truncate_check_FKs(rels, false); >> + #else >> +if (stmt->behavior == DROP_RESTRICT) >> +heap_truncate_check_FKs(rels, false); >> #endif >> +} > > That *can't* be right. You mean the part that ignores this in session_replication_role = replica? I don't like that either. And it's also not clear why it's needed for this feature. >> +case REORDER_BUFFER_CHANGE_TRUNCATE: >> +appendStringInfoString(ctx->out, " TRUNCATE:"); >> + >> +if (change->data.truncate_msg.restart_seqs >> +|| change->data.truncate_msg.cascade) >> +{ >> +if (change->data.truncate_msg.restart_seqs) >> +appendStringInfo(ctx->out, " >> restart_seqs"); >> +if (change->data.truncate_msg.cascade) >> +appendStringInfo(ctx->out, " cascade"); >> +} >> +else >> +appendStringInfoString(ctx->out, " (no-flags)"); >> +break; >> default: >> Assert(false); >> } > > I know this has been discussed in the thread already, but it really > strikes me as wrong to basically do some mini DDL replication feature > via per-command WAL records. The problem is that the interaction of TRUNCATE and foreign keys is a mess. Let's say I have a provider database with table1, table2, and table3, all connected by foreign keys, and a replica database with table1, table2, and table4, all connected by foreign keys. I run TRUNCATE table1 CASCADE on the provider. What should replication do? The proposed patch applies the TRUNCATE table1 CASCADE on the replica, which ends up truncating table1, table2, and table4, which is different from what happened on the origin. An alternative might be to make the replication protocol message say "I truncated table1, table2" (let's say table3 is not replicated). (This sounds more like logical replication rather than DDL replication.) That will then fail to apply on the replica, because it requires that you include all tables connected by foreign keys. We could then do what we do in the DELETE case and just ignore foreign keys altogether, which is obviously bad and not a forward-looking behavior. Or we could do what we *should* be doing in the DELETE case and apply cascading actions on the replica to table4, but that kind of row-by-row processing is what we want to avoid by using TRUNCATE in the first place. None of these solutions are good. I don't have any other ideas, though. >> *** >> *** 111,116 CREATE PUBLICATION > class="parameter">name >> --- 111,121 >> and so the default value for this option is >> 'insert, update, delete'. >> >> + >> +TRUNCATE is treated as a form of >> +DELETE for the purpose of deciding whether >> +to publish, or not. >> + >> >> >> > > Why is this a good idea? I think it seemed like a good idea at the time, so to speak, but several people have argued against it, so we should probably change this in the final version. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: disable SSL compression?
Hi, On 2018-04-02 10:25:04 -0400, Robert Haas wrote: > In general, I'd expect compressing data to be beneficial for the > security of encryption because it should increase the entropy of the > encrypted bytes, but obviously it's not hard to hypothesize cases > where the opposite is true for one reason or another. I don't think it's actually ever a really positive thing for security to compress before encrypting, and encrypting after should always be useless. The problem is that that opens one hell of a sidechannel attack, because you're suddenly leaking information about the compressability of the transferred data. If there's any way attackers have knowledge, or worse influence, of any of the transported data that allows to make inferrerences about the content and potentially the key. Whereas there should never be a observable difference in the encrypted stream, if you use a sane cipher mode (i.e. NOT ECB). Greetings, Andres Freund
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Hi, On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote: > On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote: > > Craig Ringer writes: > > > So we should just use the big hammer here. > > > > And bitch, loudly and publicly, about how broken this kernel behavior is. > > If we make enough of a stink maybe it'll get fixed. > > It is not likely to be fixed (beyond what has been done already with the > manpage patches and errseq_t fixes on the reporting level). The issue is, > the kernel needs to deal with hard IO errors at that level somehow, and > since those errors typically persist, re-dirtying the pages would not > really solve the problem (unless some filesystem remaps the request to a > different block, assuming the device is alive). Throwing away the dirty pages *and* persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case. > Keeping around dirty > pages that cannot possibly be written out is essentially a memory leak, > as those pages would stay around even after the application has exited. Why do dirty pages need to be kept around in the case of persistent errors? I don't think the lack of automatic recovery in that case is what anybody is complaining about. It's that the error goes away and there's no reasonable way to separate out such an error from some potential transient errors. Greetings, Andres Freund
Re: [HACKERS] logical decoding of two-phase transactions
On 2018-04-02 09:23:10 +0100, Simon Riggs wrote: > On 29 March 2018 at 23:24, Andres Freund wrote: > > >> I agree with the former, of course - docs are a must. I disagree with > >> the latter, though - there have been about no proposals how to do it > >> without the locking. If there are, I'd like to hear about it. > > > > I don't care. Either another solution needs to be found, or the locking > > needs to be automatically performed when necessary. > > That seems unreasonable. > It's certainly a nice future goal to have it all happen automatically, > but we don't know what the plugin will do. No, fighting too complicated APIs is not unreasonable. And we've found an alternative. > How can we ever make an unknown task happen automatically? We can't. The task isn't unknown, so this just seems like a non sequitur. Greetings, Andres Freund
Re: disable SSL compression?
On Mon, Apr 02, 2018 at 12:46:25PM -0400, Tom Lane wrote: > Peter Eisentraut writes: > > I agree the attack is less likely to be applicable in typical database > > installations. I think we should move forward with considering protocol > > compression proposals, but any final result should put a warning in the > > documentation that using compression is potentially insecure. > > It seemed like the attack you described wasn't all that dependent on > whether the data is compressed or not: I think it is. I wrote something longer about this in 2013. https://www.postgresql.org/message-id/20130115172253.ga6...@isc.upenn.edu The key idea is you can now to a radix search of a secret rather than a single 1-bit match/doesn't match oracle type attack which is not feasible when you have an high entropy low density value like a big ascii number or hex string. I do something like set my preferred email (or anything an application lets a user change) to something like a3bf and keep extending the guess as I go and sometimes I compress along with the secret and rather than have to guess a whole application session id all at once and have a work factor of O(2^128) or something I've got a very tractable search. It might not be something that a problem for everyone, but in some situations it's attackable. Garick
Re: disable SSL compression?
On 4/2/18 12:46, Tom Lane wrote: > Peter Eisentraut writes: >> I agree the attack is less likely to be applicable in typical database >> installations. I think we should move forward with considering protocol >> compression proposals, but any final result should put a warning in the >> documentation that using compression is potentially insecure. > > It seemed like the attack you described wasn't all that dependent on > whether the data is compressed or not: if you can see the size of the > server's reply to "select ... where account_number = x", you can pretty > well tell the difference between 0 and 1 rows, with or without > compression. So I'm still not very clear on what the threat model is. Well these could also be update commands or procedure calls with a constant response size. Also, it doesn't matter whether the select returns anything. Maybe it's not querying the main accounts table. But it already shows that the client thinks that the account number is a real one. There are probably even better examples. But the main point is that if an attacker can control part of what you send alongside some secret thing, compression is going to be a security concern for some. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: tab complete for procedures for \sf and \ef commands
On 4/1/18 04:14, Pavel Stehule wrote: > Hi > > small bugfix patch committed -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: disable SSL compression?
Peter Eisentraut writes: > I agree the attack is less likely to be applicable in typical database > installations. I think we should move forward with considering protocol > compression proposals, but any final result should put a warning in the > documentation that using compression is potentially insecure. It seemed like the attack you described wasn't all that dependent on whether the data is compressed or not: if you can see the size of the server's reply to "select ... where account_number = x", you can pretty well tell the difference between 0 and 1 rows, with or without compression. So I'm still not very clear on what the threat model is. regards, tom lane
Re: new function for tsquery creartion
I've fixed a bug and added some tests and documentation. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com The Russian Postgres Companydiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 5abb1c46fb..c3b7be6e4e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -9609,6 +9609,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple phraseto_tsquery('english', 'The Fat Rats') 'fat' <-> 'rat' + + + + websearch_to_tsquery + + websearch_to_tsquery( config regconfig , query text) + +tsquery +produce tsquery from a web search style query +websearch_to_tsquery('english', '"fat rat" or rat') +'fat' <-> 'rat' | 'rat' + diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index 610b7bf033..19f58511c8 100644 --- a/doc/src/sgml/textsearch.sgml +++ b/doc/src/sgml/textsearch.sgml @@ -797,13 +797,16 @@ UPDATE tt SET ti = PostgreSQL provides the functions to_tsquery, -plainto_tsquery, and -phraseto_tsquery +plainto_tsquery, +phraseto_tsquery and +websearch_to_tsquery for converting a query to the tsquery data type. to_tsquery offers access to more features than either plainto_tsquery or -phraseto_tsquery, but it is less forgiving -about its input. +phraseto_tsquery, but it is less forgiving about its +input. websearch_to_tsquery is a simplified version +of to_tsquery with an alternative syntax, similar +to the one used by web search engines. @@ -962,6 +965,87 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C'); + +websearch_to_tsquery( config regconfig, querytext text) returns tsquery + + + +websearch_to_tsquery creates a tsquery +value from querytext using an alternative +syntax in which simple unformatted text is a valid query. +Unlike plainto_tsquery +and phraseto_tsquery, it also recognizes certain +operators. Moreover, this function should never raise syntax errors, +which makes it possible to use raw user-supplied input for search. +The following syntax is supported: + + + +unquoted text: text not inside quote marks will be +converted to terms separated by & operators, as +if processed by +plainto_tsquery. + + + + +"quoted text": text inside quote marks will be +converted to terms separated by <-> +operators, as if processed by phraseto_tsquery. + + + + + OR: logical or will be converted to + the | operator. + + + + + -: the logical not operator, converted to the + the ! operator. + + + + + +Examples: + + select websearch_to_tsquery('english', 'The fat rats'); + websearch_to_tsquery + - + 'fat' & 'rat' + (1 row) + + + select websearch_to_tsquery('english', '"supernovae stars" -crab'); + websearch_to_tsquery + -- + 'supernova' <-> 'star' & !'crab' + (1 row) + + + select websearch_to_tsquery('english', '"sad cat" or "fat rat"'); + websearch_to_tsquery + --- + 'sad' <-> 'cat' | 'fat' <-> 'rat' + (1 row) + + + select websearch_to_tsquery('english', 'signal -"segmentation fault"'); + websearch_to_tsquery + --- + 'signal' & !( 'segment' <-> 'fault' ) + (1 row) + + + select websearch_to_tsquery('english', '""" )( dummy \\ query <->'); + websearch_to_tsquery + -- + 'dummi' & 'queri' + (1 row) + + diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c index ea5947a3a8..6055fb6b4e 100644 --- a/src/backend/tsearch/to_tsany.c +++ b/src/backend/tsearch/to_tsany.c @@ -490,7 +490,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS) query = parse_tsquery(text_to_cstring(in), pushval_morph, PointerGetDatum(&data), - false); + 0); PG_RETURN_TSQUERY(query); } @@ -520,7 +520,7 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS) query = parse_tsquery(text_to_cstring(in), pushval_morph, PointerGetDatum(&data), - true); + P_TSQ_PLAIN); PG_RETURN_POINTER(query); } @@ -551,7 +551,7 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS) query = parse_tsquery(text_to_cstring(in), pushval_morph, PointerGetDatum(&data), - true); + P_TSQ_PLAIN); PG_RETURN_TSQUERY(query); } @@ -567,3 +567,35 @@ phraseto_tsquery(PG_FUNCTION_ARGS) ObjectIdGetDatum(cfgId), PointerGetDatum(in))); } + +Datum +websearch_to_tsquery_byid(P
Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.
> /* > - * Check whether default partition has a row that would fit the > partition > - * being attached. > + * Check if the default partition contains a row that would belong in > the > + * partition being attached. >*/ > - defaultPartOid = > - get_default_oid_from_partdesc(RelationGetPartitionDesc(rel)); > if (OidIsValid(defaultPartOid)) Oh my. This code is terrible, and I think this patch is wrong. More later. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: disable SSL compression?
On 4/2/18 11:05, Robert Haas wrote: > Ah. Yeah, that makes sense. Although to use PG as a vector, it seems > like the attacker would need to the ability to snoop network traffic > between the application server and PG. If both of those are > presumably inside the bank's network and yet an attacker can sniff > them, to some degree you've already lost. Now it could be that a > rogue bank employee is trying to gain access from within the bank, or > maybe your bank exposes application-to-database traffic on the public > Internet. But in general that seems like traffic that should be > well-secured anyway for lots of reasons, as opposed to the case where > one part of your browser is trying to hide information from another > part of your browser, which is a lot harder to isolate thoroughly. I agree the attack is less likely to be applicable in typical database installations. I think we should move forward with considering protocol compression proposals, but any final result should put a warning in the documentation that using compression is potentially insecure. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: check_ssl_key_file_permissions should be in be-secure-common.c
On 4/2/18 02:51, Michael Paquier wrote: > It seems to me that this routine should be logically put into > be-secure-common.c so as future SSL implementations can use it. This > makes the code more consistent with the frontend refactoring that has > happened in f75a959. I would not have bothered about this refactoring > if be-secure-openssl.c did not exist yet, but as it does I think that we > should bite the bullet, and do that for v11 so as a good base is in > place for the future. committed -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key
On Wed, Mar 28, 2018 at 2:12 PM, Andres Freund wrote: > How will it break it? They'll see an invalid ctid and conclude that the > tuple is dead? Without any changes that's already something that can > happen if a later tuple in the chain has been pruned away. Sure, that > code won't realize it should error out because the tuple is now in a > different partition, but neither would a infomask bit. > > I think my big problem is that I just don't see what the worst that can > happen is. We'd potentially see a broken ctid chain, something that very > commonly happens, and consider the tuple to be invisible. That seems > pretty sane behaviour for unadapted code, and not any worse than other > potential solutions. This is more or less my feeling as well. I think it's better to conserve our limited supply of infomask bits as much as we can, and I do think that we should try to reclaimed HEAP_MOVED_IN and HEAP_MOVED_OFF in the future instead of defining the combination of the two of them to mean something now. The only scenario in which I can see this patch really leading to disaster is if there's some previous release out there where the bit pattern chosen by this patch has some other, incompatible meaning. As far as we know, that's not the case: this bit pattern was previously unused. Code seeing that bit pattern could potentially therefore (1) barf on the valid CTID, but the whole point of this is to throw an ERROR anyway, so if that happens then we're getting basically the right behavior with the wrong error message or (2) just treat it as a broken CTID link, in which case the result should be pretty much the same as if this patch hadn't been committed in the first place. Where the multixact patch really caused us a lot of trouble is that the implications weren't just for the heap itself -- the relevant SLRUs became subject to new retention requirements which in turn affected vacuum, autovacuum, and checkpoint behavior. There is no similar problem here -- the flag indicating the problematic situation, however it ends up being stored, doesn't point to any external data. Now, that doesn't mean there can't be some other kind of problem with this patch, but I don't think that we should block the patch on the theory that it might have an undiscovered problem that destroys the entire PostgreSQL ecosystem with no theory as to what that problem might actually be. Modulo implementation quality, I think the risk level of this patch is somewhat but not vastly higher than 37484ad2aacef5ec794f4dd3d5cf814475180a78, which similarly defined a previously-unused bit pattern in the tuple header. The reason I think this one might be somewhat riskier is because AFAICS it's not so easy to make sure we've found all the code, even in core, that might care, as it was in that case; and also because updates happen more than freezing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Feature Request - DDL deployment with logical replication
On 1 April 2018 at 00:57, Andres Freund wrote: > On 2018-03-31 22:13:42 +0800, Craig Ringer wrote: > > We'll still need a mechanism to transport them to downstreams (like WAL > > messages) and to send responses upstream. For responses I think we will > > finally want to add a backchannel to the logical replication protocol as > > I've wanted for a long while: downstream can send a COPY message on COPY > > BOTH proto back to upstream, which passes it to a callback on the output > > plugin for the output plugin to act on. > > Not necessarily? You can just send out the prepare, wait for all > clients to ack it, and then commit/rollback prepared. > We then lack any mechanism by which you can NACK, saying "I can't apply this". So upstream will wait indefinitely. I guess we just expect the user to intervene and ROLLBACK if they decide a replica isn't going to get the job done, or have checked the replica's logs and found it can't apply it for some hopefully-sane reason. It's not like we'd auto-ROLLBACK PREPARED in response to a nack from a downstream anyway, so all we're missing is probably info in the upstream logs about which replica(s) cannot apply it and why. OK. So it'd be a nice-to-have, but not vital. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: disable SSL compression?
On Mon, Apr 2, 2018 at 10:52 AM, Peter Eisentraut wrote: > On 4/2/18 10:25, Robert Haas wrote: >> As I understand it on a brief review of the Google search >> results^W^W^Wliterature, the reason that was done was to prevent >> things like the CRIME attack, which apparently involves Javascript >> running in your browser from deducing information that it shouldn't be >> able to get, like the Cookies that are sent along with the requests it >> initiates. No similar attack should be possible against PostgreSQL >> because there's no similar kind of privilege separation. Your PG >> driver doesn't have untrusted Javascript running inside of it, we >> hope. > > I think the attack is much more general involving two server end points, > one of which is under the control of the attacker and provides > information that the client is using to query the second server, > independent of the protocols. > > For example, if your application code does maybe a geoip lookup and then > does > > select * from sometable > where a = $geo_data and b = 'secret bank account number'; > > then the operator of the geoip service (or someone impersonating it, of > course) can just rotate the lookup results through the bank account > number space until they notice that the compression result changes, in > which case they have matched the bank account number. > > In the web space, that is easier because the client code is typically > viewable by the attacker, and this kind of query is more common (the > "bank account number" is typically a session key), but the principle is > the same. Ah. Yeah, that makes sense. Although to use PG as a vector, it seems like the attacker would need to the ability to snoop network traffic between the application server and PG. If both of those are presumably inside the bank's network and yet an attacker can sniff them, to some degree you've already lost. Now it could be that a rogue bank employee is trying to gain access from within the bank, or maybe your bank exposes application-to-database traffic on the public Internet. But in general that seems like traffic that should be well-secured anyway for lots of reasons, as opposed to the case where one part of your browser is trying to hide information from another part of your browser, which is a lot harder to isolate thoroughly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Change RangeVarGetRelidExtended() to take flags argument?
On 3/30/18, 7:09 PM, "Andres Freund" wrote: > Pushed. Thanks! Nathan
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 2 April 2018 at 02:24, Thomas Munro wrote: > > Maybe my drive-by assessment of those kernel routines is wrong and > someone will correct me, but I'm starting to think you might be better > to assume the worst on all systems. Perhaps a GUC that defaults to > panicking, so that users on those rare OSes could turn that off? Even > then I'm not sure if the failure mode will be that great anyway or if > it's worth having two behaviours. Thoughts? > > I see little benefit to not just PANICing unconditionally on EIO, really. It shouldn't happen, and if it does, we want to be pretty conservative and adopt a data-protective approach. I'm rather more worried by doing it on ENOSPC. Which looks like it might be necessary from what I recall finding in my test case + kernel code reading. I really don't want to respond to a possibly-transient ENOSPC by PANICing the whole server unnecessarily. BTW, the support team at 2ndQ is presently working on two separate issues where ENOSPC resulted in DB corruption, though neither of them involve logs of lost page writes. I'm planning on taking some time tomorrow to write a torture tester for Pg's ENOSPC handling and to verify ENOSPC handling in the test case I linked to in my original StackOverflow post. If this is just an EIO issue then I see no point doing anything other than PANICing unconditionally. If it's a concern for ENOSPC too, we should try harder to fail more nicely whenever we possibly can. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Partition-wise aggregation/grouping
On Thu, Mar 29, 2018 at 9:02 AM, Jeevan Chalke wrote: > Yep, I see the risk. Committed 0001 last week and 0002 just now. I don't really see 0003 a a critical need. If somebody demonstrates that this saves a meaningful amount of planning time, we can consider that part for a future release. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: disable SSL compression?
On 4/2/18 10:25, Robert Haas wrote: > As I understand it on a brief review of the Google search > results^W^W^Wliterature, the reason that was done was to prevent > things like the CRIME attack, which apparently involves Javascript > running in your browser from deducing information that it shouldn't be > able to get, like the Cookies that are sent along with the requests it > initiates. No similar attack should be possible against PostgreSQL > because there's no similar kind of privilege separation. Your PG > driver doesn't have untrusted Javascript running inside of it, we > hope. I think the attack is much more general involving two server end points, one of which is under the control of the attacker and provides information that the client is using to query the second server, independent of the protocols. For example, if your application code does maybe a geoip lookup and then does select * from sometable where a = $geo_data and b = 'secret bank account number'; then the operator of the geoip service (or someone impersonating it, of course) can just rotate the lookup results through the bank account number space until they notice that the compression result changes, in which case they have matched the bank account number. In the web space, that is easier because the client code is typically viewable by the attacker, and this kind of query is more common (the "bank account number" is typically a session key), but the principle is the same. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Rewriting the test of pg_upgrade as a TAP test - take two
On 3/21/18 21:49, Michael Paquier wrote: > On Thu, Mar 22, 2018 at 09:42:35AM +0800, Craig Ringer wrote: >> I'm super excited by the idea of multi-version support in TAP, if that's >> what you mean. >> >> Why? Because I use TAP heavily in extensions. Especially replication >> extensions. Which like to talk across multiple versions. I currently need >> external test frameworks and some hideous hacks to do this. > > Okay, in front of such enthusiasm we could keep at least the refactoring > part of PostgresNode.pm :) I took a quick look at that part. It appears to be quite invasive, more than I would have hoped. Basically, it imposes that from now on all program invocations must observe the bindir setting, which someone is surely going to forget. The pg_upgrade tests aren't going to exercise all possible paths where programs are called, so this is going to lead to omissions and inconsistencies -- which will then possibly only be found much later by the extensions that Craig was talking about. I'd like to see this more isolated, maybe via a path change, or something inside system_or_bail or something less invasive and more maintainable. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: disable SSL compression?
On Wed, Mar 28, 2018 at 7:16 PM, Andres Freund wrote: > +analysis of whether that's safe to do from a cryptographic POV. There's a > reason compression has been disabled for just about all SSL/TLS libraries. As I understand it on a brief review of the Google search results^W^W^Wliterature, the reason that was done was to prevent things like the CRIME attack, which apparently involves Javascript running in your browser from deducing information that it shouldn't be able to get, like the Cookies that are sent along with the requests it initiates. No similar attack should be possible against PostgreSQL because there's no similar kind of privilege separation. Your PG driver doesn't have untrusted Javascript running inside of it, we hope. In general, I'd expect compressing data to be beneficial for the security of encryption because it should increase the entropy of the encrypted bytes, but obviously it's not hard to hypothesize cases where the opposite is true for one reason or another. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Foreign keys and partitioned tables
Comments on the code: @@ -7226,12 +7235,23 @@ ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation rel, * numbers) */ if (pkrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + { + /* fix recursion in ATExecValidateConstraint to enable this case */ + if (fkconstraint->skip_validation && !fkconstraint->initially_valid) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), +errmsg("cannot add NOT VALID foreign key to relation \"%s\"", + RelationGetRelationName(pkrel; + } Maybe this error message should be more along the lines of "is not supported yet". Also, this restriction should perhaps be mentioned in the documentation additions that we have been discussing. The first few hunks in ri_triggers.c (the ones that refer to the pktable) are apparently for the postponed part of the feature, foreign keys referencing partitioned tables. So I think those hunks should be dropped from this patch. The removal of the ONLY for the foreign key queries also affects inherited tables, in undesirable ways. For example, consider this script: create table test1 (a int primary key); insert into test1 values (1), (2), (3); create table test2 (x int primary key, y int references test1 on delete cascade); insert into test2 values (11, 1), (22, 2), (33, 3); create table test2a () inherits (test2); insert into test2a values (111, 1), (222, 2); delete from test1 where a = 1; select * from test1 order by 1; select * from test2 order by 1, 2; With the patch, this will have deleted the (111, 1) record in test2a, which it did not do before. I think the trigger functions need to look up whether the table is a partitioned table and decide whether the use ONLY based on that. (This will probably also apply to the PK side, when that is implemented.) In pg_dump, maybe this can be refined: + /* +* For partitioned tables, it doesn't work to emit constraints as not +* inherited. +*/ + if (tbinfo->relkind == RELKIND_PARTITIONED_TABLE) + only = ""; + else + only = "ONLY "; We use ONLY for foreign keys on inherited tables, but foreign keys are not inherited anyway, so this is at best future-proofing. We could just drop the ONLY unconditionally. Or at least explain this more. Other than that, it looks OK. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote: > Craig Ringer writes: > > So we should just use the big hammer here. > > And bitch, loudly and publicly, about how broken this kernel behavior is. > If we make enough of a stink maybe it'll get fixed. It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive). Keeping around dirty pages that cannot possibly be written out is essentially a memory leak, as those pages would stay around even after the application has exited. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote: >On 31 March 2018 at 21:24, Anthony Iliopoulos <[1]ail...@altatus.com> >wrote: > > On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote: > > > >> Yeah, I see why you want to PANIC. > > > > > > Indeed. Even doing that leaves question marks about all the kernel > > > versions before v4.13, which at this point is pretty much everything > > > out there, not even detecting this reliably. This is messy. > > There may still be a way to reliably detect this on older kernel > versions from userspace, but it will be messy whatsoever. On EIO > errors, the kernel will not restore the dirty page flags, but it > will flip the error flags on the failed pages. One could mmap() > the file in question, obtain the PFNs (via /proc/pid/pagemap) > and enumerate those to match the ones with the error flag switched > on (via /proc/kpageflags). This could serve at least as a detection > mechanism, but one could also further use this info to logically > map the pages that failed IO back to the original file offsets, > and potentially retry IO just for those file ranges that cover > the failed pages. Just an idea, not tested. > >That sounds like a huge amount of complexity, with uncertainty as to how >it'll behave kernel-to-kernel, for negligble benefit. Those interfaces have been around since the kernel 2.6 times and are rather stable, but I was merely responding to your original post comment regarding having a way of finding out which page(s) failed. I assume that indeed there would be no benefit, especially since those errors are usually not transient (typically they come from hard medium faults), and although a filesystem could theoretically mask the error by allocating a different logical block, I am not aware of any implementation that currently does that. >I was exploring the idea of doing selective recovery of one relfilenode, >based on the assumption that we know the filenode related to the fd that >failed to fsync(). We could redo only WAL on that relation. But it fails >the same test: it's too complex for a niche case that shouldn't happen in >the first place, so it'll probably have bugs, or grow bugs in bitrot over >time. Fully agree, those cases should be sufficiently rare that a complex and possibly non-maintainable solution is not really warranted. >Remember, if you're on ext4 with errors=remount-ro, you get shut down even >harder than a PANIC. So we should just use the big hammer here. I am not entirely sure what you mean here, does Pg really treat write() errors as fatal? Also, the kind of errors that ext4 detects with this option is at the superblock level and govern metadata rather than actual data writes (recall that those are buffered anyway, no actual device IO has to take place at the time of write()). Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote: > >> Yeah, I see why you want to PANIC. > > > > Indeed. Even doing that leaves question marks about all the kernel > > versions before v4.13, which at this point is pretty much everything > > out there, not even detecting this reliably. This is messy. There may still be a way to reliably detect this on older kernel versions from userspace, but it will be messy whatsoever. On EIO errors, the kernel will not restore the dirty page flags, but it will flip the error flags on the failed pages. One could mmap() the file in question, obtain the PFNs (via /proc/pid/pagemap) and enumerate those to match the ones with the error flag switched on (via /proc/kpageflags). This could serve at least as a detection mechanism, but one could also further use this info to logically map the pages that failed IO back to the original file offsets, and potentially retry IO just for those file ranges that cover the failed pages. Just an idea, not tested. Best regards, Anthony
Re: [HACKERS] logical decoding of two-phase transactions
> > Quick thought: Should be simple to release lock when interacting with network. I don’t think this will be that simple. The network calls will typically happen from inside the plugins and we don’t want to make plugin authors responsible for that. > Could also have abort signal lockers. With the decodegroup locking we do have access to all the decoding backend pids. So we could signal them. But am not sure signaling will work if the plugin is in the midst of a network Call. I agree with Petr. With this decodegroup Lock implementation we have an inexpensive but workable implementation for locking around the plugin call. Sure, the abort will be penalized but it’s bounded by the Wal sender timeout or a max of one change apply cycle. As he mentioned if we can optimize this later we can do so without changing plugin coding semantics later. Regards, Nikhils > > Andres > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: hot_standby_feedback vs excludeVacuum and snapshots
On Sun, Apr 1, 2018 at 3:30 PM, Simon Riggs wrote: > On 31 March 2018 at 14:21, Amit Kapila wrote: >> >> I think the vacuum assigns xids only if it needs to truncate some of >> the pages in the relation which happens towards the end of vacuum. >> So, it shouldn't hold back the xmin horizon for long. > > Yes, that's the reason. I recall VACUUMs giving lots of problems > during development of Hot Standby. > > VACUUM FULL was the thing that needed to be excluded in the past > because it needed an xid to move rows. > > Greg's concern is a good one and his noticing that we hadn't > specifically excluded VACUUMs is valid, so we should exclude them. > Well spotted, Greg. > > So although this doesn't have the dramatic effect it might have had, > there is still the possibility of some effect and I think we should > treat it as a bug. > +1. I think you missed to update the comments on top of the modified function ("Similar to GetSnapshotData but returns more information. We include all PGXACTs with an assigned TransactionId, even VACUUM processes." ). It seems last part of the sentence should be omitted after your patch, otherwise, patch looks good to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] Verify Checksums during Basebackups
Hi! On Sun, Apr 01, 2018 at 05:27:21PM +0200, Magnus Hagander wrote: > It might be a micro-optimisation, but ISTM that we shouldn't do the > basename(palloc(fn)) in is_heap_file() unless we actually need it -- so not > at the top of the function. (And surely "." and ".." should not occur here? > I think that's a result of copy/paste from the online checksum patch? > > We also do exactly the same basename(palloc(fn)) in sendFile(). Can we > find a way to reuse that duplication? Perhaps we want to split it into the > two pieces already out in sendFile() and pass it in as separate parameters? I've done the latter now, although it looks a bit weird that the second argument data is a subset of the first. I couldn't find another way to not have it done twice, though. > If not then this second one should at least only be done inside the if > (verify_checksums). We can't have both, as we need to call the is_heap_file() function in order to determine whether we should verify the checksums. > There is a bigger problem next to that though -- I believe basename() does > not exist on Win32. I haven't tested it, but there is zero documentation of > it existing, which usually indicates it doesn't. That's the part that > definitely needs to get fixed. > > I think you need to look into the functionality in port/path.c, in > particular last_dir_separator()? Thanks for the pointer, I've used that now; I mentioned before that basename() might be a portability hazard, but couldn't find a good substitute myself. V6 of the patch is attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 8c488506fa..b94dd4ac65 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2382,7 +2382,7 @@ The commands accepted in replication mode are: -BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ] [ MAX_RATE rate ] [ TABLESPACE_MAP ] +BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ] [ MAX_RATE rate ] [ TABLESPACE_MAP ] [ NOVERIFY_CHECKSUMS ] BASE_BACKUP @@ -2481,6 +2481,17 @@ The commands accepted in replication mode are: + + +NOVERIFY_CHECKSUMS + + + By default, checksums are verified during a base backup if they are + enabled. Specifying NOVERIFY_CHECKSUMS disables + this verification. + + + diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index e8921b1bb4..95045669c9 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -506,6 +506,22 @@ PostgreSQL documentation + + -k + --no-verify-checksums + + +Disables verification of checksums, if they are enabled on the server +the base backup is taken from. + + +By default, checksums are verified and checksum failures will result in +a non-zero exit status. However, the base backup will not be removed in +this case, as if the --no-clean option was used. + + + + -v --verbose diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 516eea57f8..3c1fd620ae 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -16,6 +16,7 @@ #include #include +#include "access/xlog.h" #include "access/xlog_internal.h" /* for pg_start/stop_backup */ #include "catalog/catalog.h" #include "catalog/pg_type.h" @@ -31,6 +32,8 @@ #include "replication/basebackup.h" #include "replication/walsender.h" #include "replication/walsender_private.h" +#include "storage/bufpage.h" +#include "storage/checksum.h" #include "storage/dsm_impl.h" #include "storage/fd.h" #include "storage/ipc.h" @@ -99,6 +102,15 @@ static TimeOffset elapsed_min_unit; /* The last check of the transfer rate. */ static TimestampTz throttled_last; +/* The starting XLOG position of the base backup. */ +static XLogRecPtr startptr; + +/* Total number of checksum failures during base backup. */ +static int64 total_checksum_failures; + +/* Do not verify checksums. */ +static bool noverify_checksums = false; + /* * The contents of these directories are removed or recreated during server * start so they are not included in backups. The directories themselves are @@ -194,7 +206,6 @@ base_backup_cleanup(int code, Datum arg) static void perform_base_backup(basebackup_options *opt) { - XLogRecPtr startptr; TimeLineID starttli; XLogRecPtr endptr; TimeLineID endtli; @@ -210,6 +2
Re: Foreign keys and partitioned tables
On 3/31/18 18:21, Alvaro Herrera wrote: > Yeah, I started by putting what I thought was going to be just ALTER > TABLE in that test, then moved to the other file and added what I > thought were more complete tests there and failed to move stuff to > alter_table. Honestly, I think these should mostly all belong in > foreign_key, right > > - Partitioned tables do not support EXCLUDE or > - FOREIGN KEY constraints; however, you can define > - these constraints on individual partitions. > + Partitioned tables do not support EXCLUDE > constraints; > + however, you can define these constraints on individual partitions. > + Also, while it's possible to define PRIMARY KEY > + constraints on partitioned tables, it is not supported to create > foreign > + keys cannot that reference them. This restriction will be lifted in a This doesn't read correctly. > + future release. > > - tables and permanent tables. > + tables and permanent tables. Also note that while it is permitted to > + define a foreign key on a partitioned table, declaring a foreign key > + that references a partitioned table is not allowed. > Maybe use "possible" or "supported" instead of "allowed" and "permitted" in this and similar cases. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Add support for tuple routing to foreign partitions
(2018/04/02 18:49), Amit Langote wrote: 2. If I understand the description you provided in [1] correctly, the point of having ri_PartitionIsValid and ExecInitExtraPartitionInfo() is to avoid possibly-redundantly performing following two steps in ExecSetupPartitionTupleRouting() for *reused* partition ResultRelInfos that may not be used for tuple routing after all: - create the parent_child_tupconv_maps[] entry for it - perform FDW tuple routing initialization. Sorry, my explanation was not enough, but that was just one of the reasons why I introduced those; another is to do CheckValidResultRel against a partition after the partition was chosen for tuple routing rather than in ExecSetupPartitionTupleRouting, to avoid aborting an UPDATE of a partition key unnecessarily due to non-routable foreign-partitions that may not be chosen for tuple routing after all. I see. So, currently in ExecSetupPartitionTupleRouting, we call CheckValidResultRel() to check if resultRelInfos reused from those initialized for UPDATE are valid for insertions (or rather for routing inserted tuples into it). But you're proposing to delay that check until ExecPrepareTupleRouting is called for a given resultRelInfo, at which point it's clear that we actually want to insert using that resultRelInfo. That seems reasonable. That's exactly what I'm thinking. Now we have ON CONFLICT for partitioned tables, which requires the conversion map to be computed in ExecInitPartitionInfo, so I updated the patch so that we keep the former step in ExecInitPartitionInfo and ExecSetupPartitionTupleRouting and so that we just init the FDW in ExecInitExtraPartitionInfo (changed to ExecInitForeignRouting). And I added comments to ExecInitForeignRouting and ExecPrepareTupleRouting. That looks good. I revisited this. Please see the reply to Alvaro I sent right now. I looked at the new patch. It looks good overall, although I have one question -- IIUC, BeginForeignInsert() performs actions that are equivalent of performing PlanForeignModify() + BeginForeignModify() for an INSERT as if it was directly executed on a given foreign table/partition. Did you face any problems in doing the latter itself in the core code? Doing it that way would mean no changes to a FDW itself will be required (and hence no need for additional APIs), but I may be missing something. The biggest issue in performing PlanForeignModify() plus BeginForeignModify() instead of the new FDW API would be: can the core code create a valid-looking planner state passed to PlanForeignModify() such as the ModifyTable plan node or the query tree stored in PlannerInfo? One thing that seems good about your approach is that newly added support for COPY FROM on foreign tables/partitions takes minimal changes as implemented by using the new API, whereas with the alternative approach it would require duplicated code (same code would have to be present in both copy.c and execPartition.c, but it could perhaps be avoided). I agree. Thanks for the review! Best regards, Etsuro Fujita
Re: [HACKERS] Add support for tuple routing to foreign partitions
(2018/03/30 22:28), Alvaro Herrera wrote: Etsuro Fujita wrote: Now we have ON CONFLICT for partitioned tables, which requires the conversion map to be computed in ExecInitPartitionInfo, so I updated the patch so that we keep the former step in ExecInitPartitionInfo and ExecSetupPartitionTupleRouting and so that we just init the FDW in ExecInitExtraPartitionInfo (changed to ExecInitForeignRouting). And I added comments to ExecInitForeignRouting and ExecPrepareTupleRouting. Hmm, I may be misreading what you mean here ... but as far as I understand, ON CONFLICT does not support foreign tables, does it? It doesn't, except for ON CONFLICT DO NOTHING without an inference specification. If that is so, I'm not sure why the conversion map would be necessary, if it is for ON CONFLICT alone. We wouldn't need that for foreign partitions (When DO NOTHING with an inference specification or DO UPDATE on a partitioned table containing foreign partitions, the planner would throw an error before we get to ExecInitPartitionInfo). The reason why I updated the patch that way is just to make the patch simple, but on reflection I don't think that's a good idea; I think we should delay the map-creation step as well as the FDW-initialization step for UPDATE subplan partitions as before for improved efficiency for UPDATE-of-partition-key. However, I don't think it'd still be a good idea to delay those steps for partitions created by ExecInitPartitionInfo the same way as for the subplan partitions, because in that case it'd be better to perform CheckValidResultRel against a partition right after we do InitResultRelInfo for the partition in ExecInitPartitionInfo, as that would save cycles in cases when the partition is considered nonvalid by that check. So, What I'm thinking is: a) for the subplan partitions, we delay those steps as before, and b) for the partitions created by ExecInitPartitionInfo, we do that check for a partition right after InitResultRelInfo in that function, (and if valid, we create a map and initialize the FDW for the partition in that function). Best regards, Etsuro Fujita
Re: Diagonal storage model
> 2 апр. 2018 г., в 16:57, Ashutosh Bapat > написал(а): > On Mon, Apr 2, 2018 at 3:49 AM, Marko Tiikkaja wrote: >> >> I'm a little worried about the fact that even with this model we're still >> limited to only two dimensions. That's bound to cause problems sooner or >> later. > How about a 3D storage model, whose first dimension gives horizontal > view, second provides vertical or columnar view and third one provides > diagonal view. It also provides capability to add extra dimensions to > provide additional views like double diagonal view. Alas! it all > collapses since I was late to the party. BTW, MDX expression actually provides mulitidimensional result. They have COLUMNS, ROWS, PAGES, SECTIONS, CHAPTERS, and AXIS(N) for those who is not satisfied with named dimensions. Best regards, Andrey Borodin.
Re: [HACKERS] path toward faster partition pruning
On 2 April 2018 at 17:18, Amit Langote wrote: > On 2018/03/31 0:55, David Rowley wrote: >> explain (analyze, costs off, summary off, timing off) execute q1 (1,2,2,0); >>QUERY PLAN >> >> Result (actual rows=0 loops=1) >>One-Time Filter: false >> (2 rows) > > Hmm. It is the newly added inversion step that's causing this. When > creating a generic plan (that is when the planning happens via > BuildCachedPlan called with boundParams set to NULL), the presence of > Params will cause an inversion step's source step to produce > scan-all-partitions sort of result, which the inversion step dutifully > inverts to a scan-no-partitions result. > > I have tried to attack that problem by handling the > no-values-to-prune-with case using a side-channel to propagate the > scan-all-partitions result through possibly multiple steps. That is, a > base pruning step will set datum_offsets in a PruneStepResult only if > pruning is carried out by actually comparing values with the partition > bounds. If no values were provided (like in the generic plan case), it > will set a scan_all_nonnull flag instead and return without setting > datum_offsets. Combine steps perform their combining duty only if > datum_offset contains a valid value, that is, if scan_all_nonnulls is not set. I'm afraid this is still not correct :-( The following code is not doing the right thing: + case COMBINE_UNION: + foreach(lc1, cstep->source_stepids) + { + int step_id = lfirst_int(lc1); + PruneStepResult *step_result; + + /* + * step_results[step_id] must contain a valid result, + * which is confirmed by the fact that cstep's step_id is + * greater than step_id and the fact that results of the + * individual steps are evaluated in sequence of their + * step_ids. + */ + if (step_id >= cstep->step.step_id) + elog(ERROR, "invalid pruning combine step argument"); + step_result = step_results[step_id]; + Assert(step_result != NULL); + + result->scan_all_nonnull = step_result->scan_all_nonnull; The last line there is not properly performing a union, it just sets the result_scan_all_nonnull to whatever the last step's value was. At the very least it should be |= but I don't really like this new code. Why did you move away from just storing the matching partitions in a Bitmapset? If you want to store all non-null partitions, then why not just set the bits for all non-null partitions? That would cut down on bugs like this since the combining of step results would just be simple unions or intersects. Also, the following code could be made a bit nicer + result = (PruneStepResult *) palloc0(sizeof(PruneStepResult)); + + switch (context->strategy) + { + case PARTITION_STRATEGY_HASH: + result->bound_offsets = get_matching_hash_bound(context, + opstep->opstrategy, + values, nvalues, + partsupfunc, + opstep->nullkeys, + &result->scan_all_nonnull); Why not allocate the PruneStepResult inside the get_matching_*_bound, that way you wouldn't need all those out parameters to set the bool fields. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Diagonal storage model
On Mon, Apr 2, 2018 at 3:49 AM, Marko Tiikkaja wrote: > On Sun, Apr 1, 2018 at 3:48 PM, Konstantin Knizhnik > wrote: >> >> I want to announce new model, "diagonal storage" which combines benefits >> of both approaches. >> The idea is very simple: we first store column 1 of first record, then >> column 2 of second record, ... and so on until we reach the last column. >> After it we store second column of first record, third column of the >> second record,... > > > I'm a little worried about the fact that even with this model we're still > limited to only two dimensions. That's bound to cause problems sooner or > later. > How about a 3D storage model, whose first dimension gives horizontal view, second provides vertical or columnar view and third one provides diagonal view. It also provides capability to add extra dimensions to provide additional views like double diagonal view. Alas! it all collapses since I was late to the party. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: json(b)_to_tsvector with numeric values
> On 2 April 2018 at 11:27, Arthur Zakirov wrote: > On Mon, Apr 02, 2018 at 11:41:12AM +0300, Oleg Bartunov wrote: >> On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov >> wrote: >> I found this bug, when working on presentation about FTS and it looked >> annoying, since it validates >> the consistency of FTS.I think this is a bug, which needs to be fixed, >> else inconsistency with existing full text search will be gets >> deeper. >> >> The fix looks trivial, but needs a review, of course. > > Oh, I understood. The code looks good, tests passed. But maybe it is > better to use NumericGetDatum() instead of PointerGetDatum()? Well, technically speaking they're the same, but yes, NumericGetDatum would be more precise. I've modified it in the attached patch. jsonb_to_tsvector_numeric_v2.patch Description: Binary data
Re: [HACKERS] Add support for tuple routing to foreign partitions
Fujita-san, Thanks for updating the patch. On 2018/03/30 21:56, Etsuro Fujita wrote: > I modified the patch to use the existing API ExecForeignInsert instead of > that API and removed that API including this doc. OK. >> 2. If I understand the description you provided in [1] correctly, the >> point of having ri_PartitionIsValid and ExecInitExtraPartitionInfo() is to >> avoid possibly-redundantly performing following two steps in >> ExecSetupPartitionTupleRouting() for *reused* partition ResultRelInfos >> that may not be used for tuple routing after all: >> >> - create the parent_child_tupconv_maps[] entry for it >> - perform FDW tuple routing initialization. > > Sorry, my explanation was not enough, but that was just one of the reasons > why I introduced those; another is to do CheckValidResultRel against a > partition after the partition was chosen for tuple routing rather than in > ExecSetupPartitionTupleRouting, to avoid aborting an UPDATE of a partition > key unnecessarily due to non-routable foreign-partitions that may not be > chosen for tuple routing after all. I see. So, currently in ExecSetupPartitionTupleRouting, we call CheckValidResultRel() to check if resultRelInfos reused from those initialized for UPDATE are valid for insertions (or rather for routing inserted tuples into it). But you're proposing to delay that check until ExecPrepareTupleRouting is called for a given resultRelInfo, at which point it's clear that we actually want to insert using that resultRelInfo. That seems reasonable. > Now we have ON CONFLICT for partitioned tables, which requires the > conversion map to be computed in ExecInitPartitionInfo, so I updated the > patch so that we keep the former step in ExecInitPartitionInfo and > ExecSetupPartitionTupleRouting and so that we just init the FDW in > ExecInitExtraPartitionInfo (changed to ExecInitForeignRouting). And I > added comments to ExecInitForeignRouting and ExecPrepareTupleRouting. That looks good. I looked at the new patch. It looks good overall, although I have one question -- IIUC, BeginForeignInsert() performs actions that are equivalent of performing PlanForeignModify() + BeginForeignModify() for an INSERT as if it was directly executed on a given foreign table/partition. Did you face any problems in doing the latter itself in the core code? Doing it that way would mean no changes to a FDW itself will be required (and hence no need for additional APIs), but I may be missing something. One thing that seems good about your approach is that newly added support for COPY FROM on foreign tables/partitions takes minimal changes as implemented by using the new API, whereas with the alternative approach it would require duplicated code (same code would have to be present in both copy.c and execPartition.c, but it could perhaps be avoided). Thanks, Amit
Re: [HACKERS] [PATCH] Lockable views
On Thu, 29 Mar 2018 17:26:36 -0700 Andres Freund wrote: Thank you for your comments. I attach a patch to fix issues you've pointed out. > Hi, > > On 2018-03-28 20:26:48 +0900, Yugo Nagata wrote: > > diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml > > index b2c7203..96d477c 100644 > > --- a/doc/src/sgml/ref/lock.sgml > > +++ b/doc/src/sgml/ref/lock.sgml > > @@ -46,6 +46,11 @@ LOCK [ TABLE ] [ ONLY ] > class="parameter">name [ * ] > > > > > > > > + When a view is specified to be locked, all relations appearing in the > > view > > + definition query are also locked recursively with the same lock mode. > > + > > Trailing space added. I'd remove "specified to be" from the sentence. Fixed. > > I think mentioning how this interacts with permissions would be a good > idea too. Given how operations use the view's owner to recurse, that's > not obvious. Should also explain what permissions are required to do the > operation on the view. Added a description about permissions into the documentation. > > > > @@ -86,15 +92,17 @@ RangeVarCallbackForLockTable(const RangeVar *rv, Oid > > relid, Oid oldrelid, > > return; /* woops, concurrently > > dropped; no permissions > > * check */ > > > > - /* Currently, we only allow plain tables to be locked */ > > - if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE) > > + > > This newline looks spurious to me. Removed. > > > > /* > > + * Apply LOCK TABLE recursively over a view > > + * > > + * All tables and views appearing in the view definition query are locked > > + * recursively with the same lock mode. > > + */ > > + > > +typedef struct > > +{ > > + Oid root_reloid; > > + LOCKMODE lockmode; > > + bool nowait; > > + Oid viewowner; > > + Oid viewoid; > > +} LockViewRecurse_context; > > Probably wouldn't hurt to pgindent the larger changes in the patch. > > > > +static bool > > +LockViewRecurse_walker(Node *node, LockViewRecurse_context *context) > > +{ > > + if (node == NULL) > > + return false; > > + > > + if (IsA(node, Query)) > > + { > > + Query *query = (Query *) node; > > + ListCell*rtable; > > + > > + foreach(rtable, query->rtable) > > + { > > + RangeTblEntry *rte = lfirst(rtable); > > + AclResultaclresult; > > + > > + Oid relid = rte->relid; > > + char relkind = rte->relkind; > > + char *relname = get_rel_name(relid); > > + > > + /* The OLD and NEW placeholder entries in the view's > > rtable are skipped. */ > > + if (relid == context->viewoid && > > + (!strcmp(rte->eref->aliasname, "old") || > > !strcmp(rte->eref->aliasname, "new"))) > > + continue; > > + > > + /* Currently, we only allow plain tables or views to be > > locked. */ > > + if (relkind != RELKIND_RELATION && relkind != > > RELKIND_PARTITIONED_TABLE && > > + relkind != RELKIND_VIEW) > > + continue; > > + > > + /* Check infinite recursion in the view definition. */ > > + if (relid == context->root_reloid) > > + ereport(ERROR, > > + > > (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), > > + errmsg("infinite recursion > > detected in rules for relation \"%s\"", > > + > > get_rel_name(context->root_reloid; > > Hm, how can that happen? And if it can happen, why can it only happen > with the root relation? For example, the following queries cause the infinite recursion of views. This is detected and the error is raised. create table t (i int); create view v1 as select 1; create view v2 as select * from v1; create or replace view v1 as select * from v2; begin; lock v1; abort; However, I found that the previous patch could not handle the following situation in which the root relation itself doesn't have infinite recursion. create view v3 as select * from v1; begin; lock v3; abort; This fixed in the attached patch. Regards, > > Greetings, > > Andres Freund -- Yugo Nagata diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml index 96d477c..12303fe 100644 --- a/doc/src/sgml/ref/lock.sgml +++ b/doc/src/sgml/ref/lock.sgml @@ -46,8 +46,8 @@ LOCK [ TABLE ] [ ONLY ] name [ * ] - When a view is specified to be locked, all relations appearing in the view - definition query are also locked recursively with the same lock mode. + When a view is locked, all relations appearing in the view definition + query are a
Re: json(b)_to_tsvector with numeric values
On Mon, Apr 02, 2018 at 11:41:12AM +0300, Oleg Bartunov wrote: > On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov > wrote: > I found this bug, when working on presentation about FTS and it looked > annoying, since it validates > the consistency of FTS.I think this is a bug, which needs to be fixed, > else inconsistency with existing full text search will be gets > deeper. > > The fix looks trivial, but needs a review, of course. Oh, I understood. The code looks good, tests passed. But maybe it is better to use NumericGetDatum() instead of PointerGetDatum()? -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
make installcheck-world in a clean environment
Hello, I am trying to perform "make installcheck-world" after "make clean" (or after installing a precompiled binaries), but it fails. The error I get is related to ECPG regression test: make -C test installcheck make[2]: Entering directory `/home/postgres/postgres/src/interfaces/ecpg/test' gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 '-I../../../../src/port' '-I../../../../src/test/regress' '-DHOST_TUPLE="x86_64-pc-linux-gnu"' '-DSHELLPROG="/bin/sh"' '-DDLSUFFIX=".so"' -I../../../../src/include -D_GNU_SOURCE -c -o pg_regress_ecpg.o pg_regress_ecpg.c gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 -L../../../../src/port -L../../../../src/common -Wl,--as-needed -Wl,-rpath,'/usr/local/pgsql/lib',--enable-new-dtags pg_regress_ecpg.o ../../../../src/test/regress/pg_regress.o -lpgcommon -lpgport -lpthread -lz -lreadline -lrt -lcrypt -ldl -lm -o pg_regress make -C connect all make[3]: Entering directory `/home/postgres/postgres/src/interfaces/ecpg/test/connect' make[3]: *** No rule to make target `test1.c', needed by `test1.o'. Stop. make[3]: Leaving directory `/home/postgres/postgres/src/interfaces/ecpg/test/connect' Is it a supported scenario to make installcheck-world without performing "make" first? (If I do "make -C src/interfaces/ecpg" and then "make installcheck-world", then this error is gone. And when I set up all the extensions, all tests passed successfully.) And even if we need to perform make, I wonder, should the recompiled ecpg binary be checked instead of installed one? I tried to modify Makefile to target installed ecpg binary and it's libs (see the patch attached), it works, but this fix is only suited for installcheck. So if this scenario should be supported, a more elaborated fix is needed. Best regards, -- Alexander Lakhin Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/interfaces/ecpg/test/Makefile.regress b/src/interfaces/ecpg/test/Makefile.regress index b3d7c1e..5df1a44 100644 --- a/src/interfaces/ecpg/test/Makefile.regress +++ b/src/interfaces/ecpg/test/Makefile.regress @@ -5,14 +5,14 @@ override CPPFLAGS := -I../../include -I$(top_srcdir)/src/interfaces/ecpg/include -I$(libpq_srcdir) $(CPPFLAGS) override CFLAGS += $(PTHREAD_CFLAGS) -override LDFLAGS := -L../../ecpglib -L../../pgtypeslib $(filter-out -l%, $(libpq)) $(LDFLAGS) +override LDFLAGS := -L'$(DESTDIR)$(libdir)/' $(filter-out -l%, $(libpq)) $(LDFLAGS) override LIBS := -lecpg -lpgtypes $(filter -l%, $(libpq)) $(LIBS) $(PTHREAD_LIBS) # Standard way to invoke the ecpg preprocessor -ECPG = ../../preproc/ecpg --regression -I$(srcdir)/../../include -I$(srcdir) +ECPG = '$(DESTDIR)$(bindir)/ecpg' --regression -I$(srcdir)/../../include -I$(srcdir) # Files that most or all ecpg preprocessor test outputs depend on -ECPG_TEST_DEPENDENCIES = ../../preproc/ecpg$(X) \ +ECPG_TEST_DEPENDENCIES = \ $(srcdir)/../regression.h \ $(srcdir)/../../include/sqlca.h \ $(srcdir)/../../include/sqlda.h \
Re: json(b)_to_tsvector with numeric values
Postgres Professional: http://www.postgrespro.com The Russian Postgres Company On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov wrote: > Hello Dmitry, > > Dmitry Dolgov <9erthali...@gmail.com> wrote: >> >> Any opinions about this suggestion? Can it be considered as a bug fix and >> included into this release? > > > I think there is no chance to include it into v11. You can add the patch to > the 2018-09 commitfest. I found this bug, when working on presentation about FTS and it looked annoying, since it validates the consistency of FTS.I think this is a bug, which needs to be fixed, else inconsistency with existing full text search will be gets deeper. The fix looks trivial, but needs a review, of course. > > > -- > Arthur Zakirov > Postgres Professional: http://www.postgrespro.com > Russian Postgres Company
Re: [HACKERS] logical decoding of two-phase transactions
On 29 March 2018 at 23:30, Petr Jelinek wrote: > On 29/03/18 23:58, Andres Freund wrote: >> On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote: I have added details about this in src/backend/storage/lmgr/README as suggested by you. >>> >>> Thanks. I think the README is a good start, but I think we also need to >>> improve the comments, which is usually more detailed than the README. >>> For example, it's not quite acceptable that LogicalLockTransaction and >>> LogicalUnlockTransaction have about no comments, especially when it's >>> meant to be public API for decoding plugins. >> >> FWIW, for me that's ground to not accept the feature. Burdening output >> plugins with this will make their development painful (because they'll >> have to adapt regularly) and correctness doubful (there's nothing >> checking for the lock being skipped). Another way needs to be found. >> > > I have to agree with Andres here. It's also visible in the latter > patches. The pgoutput patch forgets to call these new APIs completely. > The test_decoding calls them, but it does so even when it's processing > changes for committed transaction.. I think that should be avoided as it > means potentially doing SLRU lookup for every change. So doing it right > is indeed not easy. Yet you spotted these problems easily enough. Similar to finding missing LWlocks. > I as wondering how to hide this. Best idea I had so far would be to put > it in heap_beginscan (and index_beginscan given that catalog scans use > is as well) behind some condition. That would also improve performance > because locking would not need to happen for syscache hits. The problem > is however how to inform the heap_beginscan about the fact that we are > in 2PC decoding. We definitely don't want to change all the scan apis > for this. I wonder if we could add some kind of property to Snapshot > which would indicate this fact - logical decoding is using it's own > snapshots it could inject the information about being inside the 2PC > decoding. Perhaps, but how do we know we've covered all the right places? We don't know what every plugin will require, do we? The plugin needs to take responsibility for its own correctness, whether we make it easier or not. It seems clear that we would need a generalized API (the proposed locking approach) to cover all requirements. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] logical decoding of two-phase transactions
On 29 March 2018 at 23:24, Andres Freund wrote: >> I agree with the former, of course - docs are a must. I disagree with >> the latter, though - there have been about no proposals how to do it >> without the locking. If there are, I'd like to hear about it. > > I don't care. Either another solution needs to be found, or the locking > needs to be automatically performed when necessary. That seems unreasonable. It's certainly a nice future goal to have it all happen automatically, but we don't know what the plugin will do. How can we ever make an unknown task happen automatically? We can't. We have a reasonable approach here. Locking shared resources before using them is not a radical new approach, its just standard development. If we find a better way in the future, we can use that, but requiring a better solution when there isn't one is unreasonable. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services