Re: ALTER TABLE ADD COLUMN fast default

2018-04-02 Thread Andrew Dunstan
On Fri, Mar 30, 2018 at 10:08 AM, Andrew Dunstan
 wrote:
> On Fri, Mar 30, 2018 at 5:00 AM, Andres Freund  wrote:

[ missing values are loaded in the TupleDesc by RelationBuildTupleDesc ]

>>> > I'm still strongly object to do doing this unconditionally for queries
>>> > that never need any of this.  We're can end up with a significant number
>>> > of large tuples in memory here, and then copy that through dozens of
>>> > tupledesc copies in queries.
>>
>>> We're just doing the same thing we do for default values.
>>
>> That's really not a defense. It's hurting us there too.
>>
>
>
> I will take a look and see if it can be avoided easily.
>
>


For missing values there are three places where we would need to load
them on demand: getmissingattr(), slot_getmissingattrs() and
expand_tuple(). However, none of those functions have obvious access
to the information required to load the missing values. We could
probably do something very ugly like getting the relid from the
TupleDesc's first attribute. Maybe someone else has a better option.
But I can't see a simple and clean way to do this.


cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rewriting the test of pg_upgrade as a TAP test - take two

2018-04-02 Thread Michael Paquier
On Mon, Apr 02, 2018 at 10:35:09AM -0400, Peter Eisentraut wrote:
> I took a quick look at that part.  It appears to be quite invasive, more
> than I would have hoped.  Basically, it imposes that from now on all
> program invocations must observe the bindir setting, which someone is
> surely going to forget.  The pg_upgrade tests aren't going to exercise
> all possible paths where programs are called, so this is going to lead
> to omissions and inconsistencies -- which will then possibly only be
> found much later by the extensions that Craig was talking about.  I'd
> like to see this more isolated, maybe via a path change, or something
> inside system_or_bail or something less invasive and more maintainable.

Hm.  PostgresNode.pm uses now a couple of different ways to trigger
commands, which makes the problem harder than it seems:
- system_or_bail
- system_log
- IPC::Run::timeout
- IPC::Run::run

Because of that, and if one wants to minimize the number of places where
bindir is called, would be to modify the command strings themselves.
What I could think of would be to have a wrapper like build_bin_path
which adds $bindir on top of the binary, but that would still cause all
the callers to use this wrapper.  The use of this wrapper could still be
forgotten.

Enforcing PATH temporarily in a code path or another implies that the
previous value of PATH needs to be added back, which could be forgotten
as well.

At the end, it seems to me that what I am proposing is themost readable
approach, and with proper documentation things should be handled
finely...  Or there is an approach you have in mind I do not foresee?
--
Michael


signature.asc
Description: PGP signature


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

2018-04-02 Thread Kyotaro HORIGUCHI
Hello.

At Mon, 2 Apr 2018 16:11:12 -0300, Alvaro Herrera  
wrote in <20180402191112.wneiyj4v5upnfjst@alvherre.pgsql>
> Why do we need AccessExclusiveLock on all children of a relation that we
> want to scan to search for rows not satisfying the constraint?  I think
> it should be enough to get ShareLock, which prevents INSERT, no?  I have
> a feeling I'm missing something here, but I don't know what, and all
> tests pass with that change.

Mmm. I'm not sure if there's a lock-upgrade case but the
following sentense is left at the last of one of the modified
comments. ATREwriteTables is called with AEL after that (that has
finally no effect in this case).

|   But we cannot risk a deadlock by taking
| * a weaker lock now and the stronger one only when needed.

I don't actually find places where the children's lock can be
raised but this seems just following the lock parent first
principle.

By the way check_default_allows_bound() (CREATE TABLE case)
contains a similar usage of find_all_inheritors(default_rel,
AEL).

> Also: the change proposed to remove the get_default_oid_from_partdesc()
> call actually fixes the bug, but to me it was not at all obvious why.

CommandCounterIncrement updates the content of a relcache entry
via invalidation. It can be surprising for callers of a function
like StorePartitionBound.

CommandCounterIncrement
 
   RelationCacheInvalidateEntry
 RelationFlushRelation
   RelationClearRelation

> To figure out why, I first had to realize that
> ValidatePartitionConstraints was lying to me, both in name and in
> comments: it doesn't actually validate anything, it merely queues
> entries so that alter table's phase 3 would do the validation.  I found
> this extremely confusing, so I fixed the comments to match reality, and
> later decided to rename the function also.

It is reasonable. Removing exccessive extension of lower-level
partitions is also reasonable. The previous code extended
inheritances in different manner for levels at odd and even
depth.

> At that point I was able to understand what the problem was: when
> attaching the default partition, we were setting the constraint to be
> validated for that partition to the correctly computed partition
> constraint; and later in the same function we would set the constraint
> to "the immature constraint" because we were now seeing that the default
> partition OID was not invalid.  So it was rather a bug in
> ValidatePartitionConstraints, in that it was accepting to set the
> expression to validate when another expression had already been set!  I
> added an assert to protect against this.  And then changed the decision
> of whether or not to run this block based on the attached partition
> being the default one or not; because as the "if" test was, it was just
> a recipe for confusion.  (Now, if you look carefully you realize that
> the new test for bound->is_default I added is kinda redundant: it can
> only be set if there was a default partition OID at start of the
> function, and therefore we know we're not adding a default partition.
> And for the case where we *are* adding the default partition, it is
> still Invalid because we don't re-read its own OID.  But relying on that
> seems brittle because it breaks if we happen to update the default
> partition OID in the middle of that function, which is what we were
> doing.  Hence the new is_default test.)

Seems reasonable. But even if we assume so, rereading
defaultPartOid is still breaking the assumption that
defaultPartOid is that at the time of entering to this function
and the added condition just conceals the fact.

So I think it should be an assertion.

| if (OidIsValid(defaultPartOid))
| {
|  /*
|   *  The command cannot be adding default partition if the
|   *  defaultPartOid is valid.
|   */
|  Assert(!cmd->bound->is_default);


> I looked at the implementation of ValidatePartitionConstraints and
> didn't like it, so I rewrote it also.
> 
> This comment is mistaken, too:
> -   /*
> -* Skip if the partition is itself a partitioned table.  We can only
> -* ever scan RELKIND_RELATION relations.
> -*/
> ... because it ignores the possibility of a partition being a foreign
> table.  The code seems to work, but I think there is no test to cover
> the case that a foreign table containing data that doesn't satisfy the
> constraint is attached, because when I set that case to return doing
> nothing (ie. ATGetQueueEntry is not called for a foreign partition), no
> test failed.

Foreign tables are intentionally not verified on attaching (in my
understanding). ATRewriteTables ignores foreign tables so it
contradicts to what ATRewriteTables thinks. As my understanding
foreign tables are assumed to be unrestrictable by local
constraint after attaching so the users are responsible to the
consistency.

> Generally speaking, I didn't like ATExecAttachPartition; it's doing too
> much that should have b

Re: Optimizing nested ConvertRowtypeExpr execution

2018-04-02 Thread Ashutosh Bapat
On Mon, Apr 2, 2018 at 1:40 AM, Andres Freund  wrote:
> Hi,
>
> On 2018-02-26 17:20:05 +0530, Ashutosh Bapat wrote:
>> In a multi-level partitioned table, a parent whole-row reference gets
>> translated into nested ConvertRowtypeExpr with child whole-row
>> reference as the leaf. During the execution, the child whole-row
>> reference gets translated into all all intermediate parents' whole-row
>> references, ultimately represented as parent's whole-row reference.
>> AFAIU, the intermediate translations are unnecessary. The leaf child
>> whole-row can be directly translated into top parent's whole-row
>> reference. Here's a WIP patch which does that by eliminating
>> intermediate ConvertRowtypeExprs during ExecInitExprRec().
>
> Why is this done appropriately at ExecInitExpr() time, rather than at
> plan time? Seems like eval_const_expressions() would be a bit more
> appropriate (being badly named aside...)?

That seems to be a better idea. Here's patch.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From b99ae826931516a7c6ad981435e8ac9ef754f15e Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Tue, 3 Apr 2018 09:00:19 +0530
Subject: [PATCH] Optimize nested ConvertRowtypeExprs

A ConvertRowtypeExprs is used to translate a whole-row reference of a
child to that of a parent. The planner produces nested
ConvertRowtypeExpr while translating whole-row reference of a leaf
partition in a multi-level partition hierarchy. Executor then
translates the whole-row reference from the leaf partition into all
the intermediate parent's whole-row references before arriving at the
final whole-row reference. It could instead translate the whole-row
reference from the leaf partition directly to the top-most parent's
whole-row reference skipping any intermediate translations.

Ashutosh Bapat, reviewed by Andres Freund.
---
 src/backend/optimizer/util/clauses.c |   25 +
 1 file changed, 25 insertions(+)

diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index ed6b680..3af5033 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -3657,6 +3657,31 @@ eval_const_expressions_mutator(Node *node,
 	  context);
 			}
 			break;
+		case T_ConvertRowtypeExpr:
+			{
+ConvertRowtypeExpr *cre = castNode(ConvertRowtypeExpr, node);
+Expr	   *arg = cre->arg;
+ConvertRowtypeExpr *newcre;
+
+/*
+ * In case of a nested ConvertRowtypeExpr, we can convert the
+ * leaf row directly to the topmost row format without any
+ * intermediate conversions.
+ */
+Assert(arg != NULL);
+while (IsA(arg, ConvertRowtypeExpr))
+	arg = castNode(ConvertRowtypeExpr, arg)->arg;
+
+newcre = makeNode(ConvertRowtypeExpr);
+newcre->arg = arg;
+newcre->resulttype = cre->resulttype;
+newcre->convertformat = cre->convertformat;
+newcre->location = newcre->location;
+
+if (IsA(arg, Const))
+	return ece_evaluate_expr((Node *) newcre);
+return (Node *) newcre;
+			}
 		default:
 			break;
 	}
-- 
1.7.9.5



Re: BUG #14941: Vacuum crashes

2018-04-02 Thread Michael Paquier
On Fri, Mar 30, 2018 at 05:19:46PM -0700, Andres Freund wrote:
> On 2018-03-30 18:39:01 +, Bossart, Nathan wrote:
>> I think the most straightforward approach for fixing this is to add
>> skip-locked functionality in find_all_inheritors(),
>> find_inheritance_children(), and vac_open_indexes(), but I am curious
>> what others think.

For vac_open_indexes() it may make the most sense to introduce a new
try_open_index which wraps on top of try_relation_open with checks on
the opened relation's relkind. 

> I'm actually wondering if we shouldn't just ignore this problem. While
> the other cases VACUUM (SKIP LOCKED) are intended to solve seem common,
> these seem less so. But it'd be a bit weird, too..

Wouldn't that cause deadlocks if one session has VACUUM (SKIP_LOCKED)
and another session running vacuum or autovacuum if simply ignored?
That looks dangerous to me than simply skipping a lock that cannot be
taken.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-02 Thread Amit Langote
Fujita-san,

On 2018/04/02 21:29, Etsuro Fujita wrote:
> (2018/04/02 18:49), Amit Langote wrote:
>> I looked at the new patch.  It looks good overall, although I have one
>> question -- IIUC, BeginForeignInsert() performs actions that are
>> equivalent of performing PlanForeignModify() + BeginForeignModify() for an
>> INSERT as if it was directly executed on a given foreign table/partition.
>> Did you face any problems in doing the latter itself in the core code?
>> Doing it that way would mean no changes to a FDW itself will be required
>> (and hence no need for additional APIs), but I may be missing something.
> 
> The biggest issue in performing PlanForeignModify() plus
> BeginForeignModify() instead of the new FDW API would be: can the core
> code create a valid-looking planner state passed to PlanForeignModify()
> such as the ModifyTable plan node or the query tree stored in PlannerInfo?

Hmm, I can see the point.  Passing mostly-dummy (garbage) PlannerInfo and
query tree from the core code to FDW seems bad.  By defining the new API
with a clean interface (receiving fully valid ModifyTableState), we're not
required to do that across the interface, but only inside the FDW's
implementation.  I was just slightly concerned that the new FDW function
would have to implement what would normally be carried out across multiple
special purpose callbacks, but maybe that's OK as long as it's clearly
documented what its job is.

Noticed a couple of things in the patch:

+ 
+  When this is called by a COPY FROM command, the
+  plan-related global data in mtstate is not provided
+  and the planSlot parameter of
+  ExecForeignInsert called for each inserted
tuple is

How about s/called/subsequently called/g?

+  NULL, wether the foreign table is the partition
chosen

Typo: s/wether/whether/g

Thanks,
Amit




Re: Add default role 'pg_access_server_files'

2018-04-02 Thread Michael Paquier
On Mon, Apr 02, 2018 at 05:09:21PM -0400, Stephen Frost wrote:
> * Michael Paquier (mich...@paquier.xyz) wrote:
>> No refactoring for pg_file_unlink and its v1.1?
> 
> I considered each function and thought about if it'd make sense to
> refactor them or if they were simple enough that the additional function
> wouldn't really be all that useful.  I'm open to revisiting that, but
> just want to make it clear that it was something I thought about and
> considered.  Since pg_file_unlink is basically two function calls, I
> didn't think it worthwhile to refactor those into their own function.

I don't mind if this is done your way.

>> The argument checks are exactly the same for pg_file_rename and
>> pg_file_rename_v1_1.  Why about just passing fcinfo around and simplify
>> the patch?
> 
> In general, I prefer to keep the PG_FUNCTION_ARGS abstraction when we
> can.  Unfortunately, that does fall apart when wrapping an SRF as in
> pg_logdir_ls(), but with pg_file_rename we can maintain it and it's
> really not that much code to do so.  As with the refactoring of
> pg_file_unlink, this is something which could really go either way.

Okay...

> I'm not sure how useful it is to REVOKE the rights on the simple SQL
> function; that would just mean that an admin has to go GRANT the rights
> on that function as well as the three-argument version.

Indeed, I had a brain fade here.

> The more I think about it, the more I like the approach of just dropping
> these deprecated "alternative names for things in core" with the new
> adminpack version.  In terms of applications, as I understand it, they
> aren't used in the latest version of pgAdmin3 and they also aren't used
> with pgAdmin4, so I don't think we need to be worrying about supporting
> them in v11.

+1 to simplify the code a bit.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] Lockable views

2018-04-02 Thread Yugo Nagata
On Mon, 2 Apr 2018 18:32:53 +0900
Yugo Nagata  wrote:

> On Thu, 29 Mar 2018 17:26:36 -0700
> Andres Freund  wrote:
> 
> Thank you for your comments. I attach a patch to fix issues
> you've pointed out.

I found a typo in the documentation and attach the updated patch.

Regards,

> 
> > Hi,
> > 
> > On 2018-03-28 20:26:48 +0900, Yugo Nagata wrote:
> > > diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml
> > > index b2c7203..96d477c 100644
> > > --- a/doc/src/sgml/ref/lock.sgml
> > > +++ b/doc/src/sgml/ref/lock.sgml
> > > @@ -46,6 +46,11 @@ LOCK [ TABLE ] [ ONLY ]  > > class="parameter">name [ * ]
> > >
> > >  
> > >
> > > +   When a view is specified to be locked, all relations appearing in the 
> > > view
> > > +   definition query are also locked recursively with the same lock mode. 
> > > +  
> > 
> > Trailing space added. I'd remove "specified to be" from the sentence.
> 
> Fixed.
> 
> > 
> > I think mentioning how this interacts with permissions would be a good
> > idea too. Given how operations use the view's owner to recurse, that's
> > not obvious. Should also explain what permissions are required to do the
> > operation on the view.
> 
> Added a description about permissions into the documentation.
> 
> > 
> > 
> > > @@ -86,15 +92,17 @@ RangeVarCallbackForLockTable(const RangeVar *rv, Oid 
> > > relid, Oid oldrelid,
> > >   return; /* woops, concurrently 
> > > dropped; no permissions
> > >* check */
> > >  
> > > - /* Currently, we only allow plain tables to be locked */
> > > - if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE)
> > > +
> > 
> > This newline looks spurious to me.
> 
> Removed.
> 
> > 
> > 
> > >  /*
> > > + * Apply LOCK TABLE recursively over a view
> > > + *
> > > + * All tables and views appearing in the view definition query are locked
> > > + * recursively with the same lock mode.
> > > + */
> > > +
> > > +typedef struct
> > > +{
> > > + Oid root_reloid;
> > > + LOCKMODE lockmode;
> > > + bool nowait;
> > > + Oid viewowner;
> > > + Oid viewoid;
> > > +} LockViewRecurse_context;
> > 
> > Probably wouldn't hurt to pgindent the larger changes in the patch.
> > 
> > 
> > > +static bool
> > > +LockViewRecurse_walker(Node *node, LockViewRecurse_context *context)
> > > +{
> > > + if (node == NULL)
> > > + return false;
> > > +
> > > + if (IsA(node, Query))
> > > + {
> > > + Query   *query = (Query *) node;
> > > + ListCell*rtable;
> > > +
> > > + foreach(rtable, query->rtable)
> > > + {
> > > + RangeTblEntry   *rte = lfirst(rtable);
> > > + AclResultaclresult;
> > > +
> > > + Oid relid = rte->relid;
> > > + char relkind = rte->relkind;
> > > + char *relname = get_rel_name(relid);
> > > +
> > > + /* The OLD and NEW placeholder entries in the view's 
> > > rtable are skipped. */
> > > + if (relid == context->viewoid &&
> > > + (!strcmp(rte->eref->aliasname, "old") || 
> > > !strcmp(rte->eref->aliasname, "new")))
> > > + continue;
> > > +
> > > + /* Currently, we only allow plain tables or views to be 
> > > locked. */
> > > + if (relkind != RELKIND_RELATION && relkind != 
> > > RELKIND_PARTITIONED_TABLE &&
> > > + relkind != RELKIND_VIEW)
> > > + continue;
> > > +
> > > + /* Check infinite recursion in the view definition. */
> > > + if (relid == context->root_reloid)
> > > + ereport(ERROR,
> > > + 
> > > (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> > > + errmsg("infinite recursion 
> > > detected in rules for relation \"%s\"",
> > > + 
> > > get_rel_name(context->root_reloid;
> > 
> > Hm, how can that happen? And if it can happen, why can it only happen
> > with the root relation?
> 
> For example, the following queries cause the infinite recursion of views. 
> This is detected and the error is raised.
> 
>  create table t (i int);
>  create view v1 as select 1;
>  create view v2 as select * from v1;
>  create or replace view v1 as select * from v2;
>  begin;
>  lock v1;
>  abort;
> 
> However, I found that the previous patch could not handle the following
> situation in which the root relation itself doesn't have infinite recursion.
> 
>  create view v3 as select * from v1;
>  begin;
>  lock v3;
>  abort;
> 
> This fixed in the attached patch.
> 
> Regards,
> 
> > 
> > Greetings,
> > 
> > Andres Freund
> 
> 
> -- 
> Yugo Nagata 


-- 
Yugo Nagata 
diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml
index 96d477c..a225cea

Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-02 Thread Amit Langote
On 2018/04/02 21:26, Etsuro Fujita wrote:
> (2018/03/30 22:28), Alvaro Herrera wrote:
>> Etsuro Fujita wrote:
>>
>>> Now we have ON CONFLICT for partitioned tables, which requires the
>>> conversion map to be computed in ExecInitPartitionInfo, so I updated the
>>> patch so that we keep the former step in ExecInitPartitionInfo and
>>> ExecSetupPartitionTupleRouting and so that we just init the FDW in
>>> ExecInitExtraPartitionInfo (changed to ExecInitForeignRouting).  And I
>>> added
>>> comments to ExecInitForeignRouting and ExecPrepareTupleRouting.
>>
>> Hmm, I may be misreading what you mean here ... but as far as I
>> understand, ON CONFLICT does not support foreign tables, does it?
> 
> It doesn't, except for ON CONFLICT DO NOTHING without an inference
> specification.
> 
>> If
>> that is so, I'm not sure why the conversion map would be necessary, if
>> it is for ON CONFLICT alone.
> 
> We wouldn't need that for foreign partitions (When DO NOTHING with an
> inference specification or DO UPDATE on a partitioned table containing
> foreign partitions, the planner would throw an error before we get to
> ExecInitPartitionInfo).

Actually, as you might know, since it is not possible to create an index
on a partitioned table that has a foreign partition, there is no
possibility of supporting any form of ON CONFLICT that requires an
inference specification.

> The reason why I updated the patch that way is
> just to make the patch simple, but on reflection I don't think that's a
> good idea; I think we should delay the map-creation step as well as the
> FDW-initialization step for UPDATE subplan partitions as before for
> improved efficiency for UPDATE-of-partition-key.  However, I don't think
> it'd still be a good idea to delay those steps for partitions created by
> ExecInitPartitionInfo the same way as for the subplan partitions, because
> in that case it'd be better to perform CheckValidResultRel against a
> partition right after we do InitResultRelInfo for the partition in
> ExecInitPartitionInfo, as that would save cycles in cases when the
> partition is considered nonvalid by that check.

It seems like a good idea to perform CheckValidResultRel right after the
InitResultRelInfo in ExecInitPartitionInfo.  However, if the partition is
considered non-valid, that results in an error afaik, so I don't
understand the point about saving cycles.

> So, What I'm thinking is:
> a) for the subplan partitions, we delay those steps as before, and b) for
> the partitions created by ExecInitPartitionInfo, we do that check for a
> partition right after InitResultRelInfo in that function, (and if valid,
> we create a map and initialize the FDW for the partition in that function).

Sounds good to me.  I'm assuming that you're talking about delaying the
is-valid-for-insert-routing check (using CheckValidResultRel) and
parent-to-child map creation for a sub-plan result relation until
ExecPrepareTupleRouting().

On a related note, I wonder if it'd be a good idea to rename the flag
ri_PartitionIsValid to something that signifies that we only need it to be
true if we want to do tuple routing (aka insert) using it.  Maybe,
ri_PartitionReadyForRouting or ri_PartitionReadyForInsert.  I'm afraid
that ri_PartitionIsValid is a bit ambiguous, although I'm not saying the
name options I'm suggesting are the best.

Thanks,
Amit




Re: WIP: Covering + unique indexes.

2018-04-02 Thread Peter Geoghegan
On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkov
 wrote:
> I thought abut that another time and I decided that it would be safer
> to use 13th bit in index tuple flags.  There are already attempt to
> use whole 6 bytes of tid for not heap pointer information [1].  Thus, it
> would be safe to use 13th bit for indicating alternative offset meaning
> in pivot tuples, because it wouldn't block further work.  Revised patchset
> in the attachment implements it.

This is definitely not the only time someone has talked about this
13th bit -- it's quite coveted. It also came up with UPSERT, and with
WARM. That's just the cases that I can personally remember.

I'm glad that you found a way to make this work, that will keep things
flexible for future patches, and make testing easier. I think that we
can find a flexible representation that makes almost everyone happy.

> I don't know.  We still need an offset number to check expected number
> of attributes.  Passing index tuple as separate attribute would be
> redundant and open door for extra possible errors.

You're right. I must have been tired when I wrote that. :-)

>> Do you store an attribute number in the "minus infinity" item (the
>> leftmost one of internal pages)? I guess that that should be zero,
>> because it's totally truncated.
>
>
> Yes, I store zero number of attributes in "minus infinity" item.  See this
> part of the patch.

> However, note that I've to store (number_of_attributes + 1) in the offset
> in order to correctly store zero number of attributes.  Otherwise, assertion
> is faised in ItemPointerIsValid() macro.

Makes sense.

> Yes.  But that depends on how difficulty to adopt patch to big picture
> correlate with difficulty, which non-adopted patch makes to that big
> picture.  My point was that second difficulty isn't high.  But we can be
> satisfied with implementation in the attached patchset (probably some
> small enhancements are still required), then the first difficulty isn't high
> too.

I think it's possible.

I didn't have time to look at this properly today, but I will try to
do so tomorrow.

Thanks
-- 
Peter Geoghegan



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Peter Geoghegan
On Mon, Apr 2, 2018 at 7:54 PM, Robert Haas  wrote:
> Also, this really does make it impossible to write reliable programs.
> Imagine that, while the server is running, somebody runs a program
> which opens a file in the data directory, calls fsync() on it, and
> closes it.  If the fsync() fails, postgres is now borked and has no
> way of being aware of the problem.  If we knew, we could PANIC, but
> we'll never find out, because the unrelated process ate the error.
> This is exactly the sort of ill-considered behavior that makes fcntl()
> locking nearly useless.

I fear that the conventional wisdom from the Kernel people is now "you
should be using O_DIRECT for granular control".  The LWN article
Thomas linked (https://lwn.net/Articles/718734) cites Ted Ts'o:

"Monakhov asked why a counter was needed; Layton said it was to handle
multiple overlapping writebacks. Effectively, the counter would record
whether a writeback had failed since the file was opened or since the
last fsync(). Ts'o said that should be fine; applications that want
more information should use O_DIRECT. For most applications, knowledge
that an error occurred somewhere in the file is all that is necessary;
applications that require better granularity already use O_DIRECT."

-- 
Peter Geoghegan



Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-02 Thread Andres Freund
Hi,

On 2018-04-02 19:40:12 -0700, Peter Geoghegan wrote:
> > So what happens if there's a concurrent insertion of a potentially
> > matching tuple?
> 
> It's not a special case. In all likelihood, you get a dup violation.
> This is a behavior that I argued for from an early stage.

Right. I think that should be mentioned in the comment...

Greetings,

Andres Freund



Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-02 Thread Peter Geoghegan
On Mon, Apr 2, 2018 at 8:11 PM, Pavan Deolasee  wrote:
> Honestly I don't think Peter ever raised concerns about the join, though I
> could be missing early discussions when I wasn't paying attention. It's
> there from day 1. Peter raised concerns about the two RTE stuff which was
> necessitated when we added support for partitioned table. We discussed that
> at some length, with your inputs and agreed that it's not necessarily a bad
> thing and probably the only way to deal with partitioned tables.
>
> Personally, I don't see why an internal join is bad. That's what MERGE is
> doing anyways, so it closely matches with the overall procedure.

The issue is not that there is a join as such. It's how it's
represented in the parser, and how that affects other things. There is
a lot of special case logic to make it work.


-- 
Peter Geoghegan



Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-02 Thread Pavan Deolasee
On Tue, Apr 3, 2018 at 8:31 AM, Robert Haas  wrote:

> On Mon, Apr 2, 2018 at 10:40 PM, Peter Geoghegan  wrote:
> > On Mon, Apr 2, 2018 at 7:18 PM, Andres Freund 
> wrote:
> >> I did a scan through this, as I hadn't been able to keep with the thread
> >> previously. Sorry if some of the things mentioned here have been
> >> discussed previously. I am just reading through the patch in its own
> >> order, so please excuse if there's things I remark on that only later
> >> fully make sense.
> >>
> >>
> >> later update: TL;DR: I don't think the parser / executor implementation
> >> of MERGE is architecturally sound.  I think creating hidden joins during
> >> parse-analysis to implement MERGE is a seriously bad idea and it needs
> >> to be replaced by a different executor structure.
> >
> > +1. I continue to have significant misgivings about this. It has many
> > consequences that we know about, and likely quite a few more that we
> > don't.
>
> +1.  I didn't understand from Peter's earlier comments that we were
> doing that, and I agree that it isn't a good design choice.
>
>
Honestly I don't think Peter ever raised concerns about the join, though I
could be missing early discussions when I wasn't paying attention. It's
there from day 1. Peter raised concerns about the two RTE stuff which was
necessitated when we added support for partitioned table. We discussed that
at some length, with your inputs and agreed that it's not necessarily a bad
thing and probably the only way to deal with partitioned tables.

Personally, I don't see why an internal join is bad. That's what MERGE is
doing anyways, so it closely matches with the overall procedure.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-02 Thread Robert Haas
On Mon, Apr 2, 2018 at 10:40 PM, Peter Geoghegan  wrote:
> On Mon, Apr 2, 2018 at 7:18 PM, Andres Freund  wrote:
>> I did a scan through this, as I hadn't been able to keep with the thread
>> previously. Sorry if some of the things mentioned here have been
>> discussed previously. I am just reading through the patch in its own
>> order, so please excuse if there's things I remark on that only later
>> fully make sense.
>>
>>
>> later update: TL;DR: I don't think the parser / executor implementation
>> of MERGE is architecturally sound.  I think creating hidden joins during
>> parse-analysis to implement MERGE is a seriously bad idea and it needs
>> to be replaced by a different executor structure.
>
> +1. I continue to have significant misgivings about this. It has many
> consequences that we know about, and likely quite a few more that we
> don't.

+1.  I didn't understand from Peter's earlier comments that we were
doing that, and I agree that it isn't a good design choice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Robert Haas
On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos  wrote:
> Given precisely that the dirty pages which cannot been written-out are
> practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> are essentially correct: the first call indicates that a writeback error
> indeed occurred, while subsequent calls have no reason to indicate an error
> (assuming no other errors occurred in the meantime).

Like other people here, I think this is 100% unreasonable, starting
with "the dirty pages which cannot been written out are practically
thrown away".  Who decided that was OK, and on the basis of what
wording in what specification?  I think it's always unreasonable to
throw away the user's data.  If the writes are going to fail, then let
them keep on failing every time.  *That* wouldn't cause any data loss,
because we'd never be able to checkpoint, and eventually the user
would have to kill the server uncleanly, and that would trigger
recovery.

Also, this really does make it impossible to write reliable programs.
Imagine that, while the server is running, somebody runs a program
which opens a file in the data directory, calls fsync() on it, and
closes it.  If the fsync() fails, postgres is now borked and has no
way of being aware of the problem.  If we knew, we could PANIC, but
we'll never find out, because the unrelated process ate the error.
This is exactly the sort of ill-considered behavior that makes fcntl()
locking nearly useless.

Even leaving that aside, a PANIC means a prolonged outage on a
prolonged system - it could easily take tens of minutes or longer to
run recovery.  So saying "oh, just do that" is not really an answer.
Sure, we can do it, but it's like trying to lose weight by
intentionally eating a tapeworm.  Now, it's possible to shorten the
checkpoint_timeout so that recovery runs faster, but then performance
drops because data has to be fsync()'d more often instead of getting
buffered in the OS cache for the maximum possible time.  We could also
dodge this issue in another way: suppose that when we write a page
out, we don't consider it really written until fsync() succeeds.  Then
we wouldn't need to PANIC if an fsync() fails; we could just re-write
the page.  Unfortunately, this would also be terrible for performance,
for pretty much the same reasons: letting the OS cache absorb lots of
dirty blocks and do write-combining is *necessary* for good
performance.

> The error reporting is thus consistent with the intended semantics (which
> are sadly not properly documented). Repeated calls to fsync() simply do not
> imply that the kernel will retry to writeback the previously-failed pages,
> so the application needs to be aware of that. Persisting the error at the
> fsync() level would essentially mean moving application policy into the
> kernel.

I might accept this argument if I accepted that it was OK to decide
that an fsync() failure means you can forget that the write() ever
happened in the first place, but it's hard to imagine an application
that wants that behavior.  If the application didn't care about
whether the bytes really got to disk or not, it would not have called
fsync() in the first place.  If it does care, reporting the error only
once is never an improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: Speedup of relation deletes during recovery

2018-04-02 Thread Tsunakawa, Takayuki
From: Jerry Sievers [mailto:gsiever...@comcast.net]
> Wonder if this is the case for streaming standbys replaying truncates
> also?

Yes, As I wrote in my previous mail, TRUNCATE is worse than DROP TABLE.

Regards
Takayuki Tsunakawa





Re: Diagonal storage model

2018-04-02 Thread Michael Paquier
On Sun, Apr 01, 2018 at 03:48:07PM +0300, Konstantin Knizhnik wrote:
> Attach please find patch with first prototype implementation. It
> provides about 3.14 times improvement of performance at most of TPC-H
> queries. 

Congratulations in finding a patch able to improve all workloads of
Postgres in such a simple and magic way, especially on this particular
date.  I would have used M_PI if I were you.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-02 Thread Peter Geoghegan
On Mon, Apr 2, 2018 at 7:18 PM, Andres Freund  wrote:
> I did a scan through this, as I hadn't been able to keep with the thread
> previously. Sorry if some of the things mentioned here have been
> discussed previously. I am just reading through the patch in its own
> order, so please excuse if there's things I remark on that only later
> fully make sense.
>
>
> later update: TL;DR: I don't think the parser / executor implementation
> of MERGE is architecturally sound.  I think creating hidden joins during
> parse-analysis to implement MERGE is a seriously bad idea and it needs
> to be replaced by a different executor structure.

+1. I continue to have significant misgivings about this. It has many
consequences that we know about, and likely quite a few more that we
don't.

> So what happens if there's a concurrent insertion of a potentially
> matching tuple?

It's not a special case. In all likelihood, you get a dup violation.
This is a behavior that I argued for from an early stage.

> It seems fairly bad architecturally to me that we now have
> EvalPlanQual() loops in both this routine *and*
> ExecUpdate()/ExecDelete().

I think that that makes sense, because ExecMerge() doesn't use any of
the EPQ stuff from ExecUpdate() and ExecDelete(). It did at one point,
in fact, and it looked quite a lot worse IMV.

> *Gulp*.  I'm extremely doubtful this is a sane approach. At the very
> least the amount of hackery required to make existing code cope with
> the approach / duplicate code seems indicative of that.

I made a lot of the fact that there are two RTEs for the target, since
that has fairly obvious consequences during every stage of query
processing. However, I think you're right that the problem is broader
than that.

> +   if (pstate->p_target_relation->rd_rel->relhasrules)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +errmsg("MERGE is not supported for relations with rules")));
>
> I guess we can add that later, but it's a bit sad that this won't work
> against views with INSTEAD OF triggers.

It also makes it hard to test deparsing support, which actually made
it in in the end. That must be untested.

-- 
Peter Geoghegan



Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-02 Thread Andres Freund
Hi,

On 2018-03-30 12:10:04 +0100, Simon Riggs wrote:
> On 29 March 2018 at 10:50, Simon Riggs  wrote:
> > On 28 March 2018 at 12:00, Pavan Deolasee  wrote:
> >
> >> v27 attached, though review changes are in
> >> the add-on 0005 patch.
> >
> > This all looks good now, thanks for making all of those changes.
> >
> > I propose [v27 patch1+patch3+patch5] as the initial commit candidate
> > for MERGE, with other patches following later before end CF.
> >
> > I propose to commit this tomorrow, 30 March, about 26 hours from now.
> > That will allow some time for buildfarm fixing/reversion before the
> > Easter weekend, then other patches to follow starting 2 April. That
> > then gives reasonable time to follow up on other issues that we will
> > no doubt discover fairly soon after commit, such as additional runs by
> > SQLsmith and more eyeballs.
>
> No problems found, but moving proposed commit to 2 April pm

I did a scan through this, as I hadn't been able to keep with the thread
previously. Sorry if some of the things mentioned here have been
discussed previously. I am just reading through the patch in its own
order, so please excuse if there's things I remark on that only later
fully make sense.


later update: TL;DR: I don't think the parser / executor implementation
of MERGE is architecturally sound.  I think creating hidden joins during
parse-analysis to implement MERGE is a seriously bad idea and it needs
to be replaced by a different executor structure.


@@ -3828,8 +3846,14 @@ struct AfterTriggersTableData
boolbefore_trig_done;   /* did we already queue BS triggers? */
boolafter_trig_done;/* did we already queue AS triggers? */
AfterTriggerEventList after_trig_events;/* if so, saved list pointer */
-   Tuplestorestate *old_tuplestore;/* "old" transition table, if any */
-   Tuplestorestate *new_tuplestore;/* "new" transition table, if any */
+   /* "old" transition table for UPDATE, if any */
+   Tuplestorestate *old_upd_tuplestore;
+   /* "new" transition table for UPDATE, if any */
+   Tuplestorestate *new_upd_tuplestore;
+   /* "old" transition table for DELETE, if any */
+   Tuplestorestate *old_del_tuplestore;
+   /* "new" transition table INSERT, if any */
+   Tuplestorestate *new_ins_tuplestore;
 };

A comment somewhere why we have all of these (presumably because they
can now happen in the context of a single statement) would be good.


@@ -5744,12 +5796,28 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo 
*relinfo,
 newtup == NULL));

if (oldtup != NULL &&
-   ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-(event == TRIGGER_EVENT_UPDATE && update_old_table)))
+   (event == TRIGGER_EVENT_DELETE && delete_old_table))
{
Tuplestorestate *old_tuplestore;

-   old_tuplestore = transition_capture->tcs_private->old_tuplestore;
+   old_tuplestore = 
transition_capture->tcs_private->old_del_tuplestore;
+
+   if (map != NULL)
+   {
+   HeapTuple   converted = do_convert_tuple(oldtup, map);
+
+   tuplestore_puttuple(old_tuplestore, converted);
+   pfree(converted);
+   }
+   else
+   tuplestore_puttuple(old_tuplestore, oldtup);
+   }

Very similar code is now repeated four times. Could you abstract this
into a separate function?



--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -37,6 +37,16 @@ the plan tree returns the computed tuples to be updated, 
plus a "junk"
 one.  For DELETE, the plan tree need only deliver a CTID column, and the
 ModifyTable node visits each of those rows and marks the row deleted.

+MERGE runs one generic plan that returns candidate target rows. Each row
+consists of a super-row that contains all the columns needed by any of the
+individual actions, plus a CTID and a TABLEOID junk columns. The CTID column is

"plus a CTID and a TABLEOID junk columns" sounds a bit awkward?


+required to know if a matching target row was found or not and the TABLEOID
+column is needed to find the underlying target partition, in case when the
+target table is a partition table. If the CTID column is set we attempt to
+activate WHEN MATCHED actions, or if it is NULL then we will attempt to

"if it is NULL then" sounds wrong.

+activate WHEN NOT MATCHED actions. Once we know which action is activated we
+form the final result row and apply only those changes.
+
 XXX a great deal more documentation needs to be written here...

Are we using tableoids in a similar way in other places?



+/*
+ * Given OID of the partition leaf, return the index of the leaf in the
+ * partition hierarchy.
+ */
+int
+ExecFindPartitionByOid(PartitionTupleRouting *proute, Oid partoid)
+{
+   int i;
+
+   for (i = 0; i < proute->num_partitions; i++)
+   {
+   if (proute->partition_oids[i] == partoid)
+   break;
+   }
+
+   Assert(i < proute->num_partitions);
+ 

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Thomas Munro
On Tue, Apr 3, 2018 at 3:03 AM, Craig Ringer  wrote:
> I see little benefit to not just PANICing unconditionally on EIO, really. It
> shouldn't happen, and if it does, we want to be pretty conservative and
> adopt a data-protective approach.
>
> I'm rather more worried by doing it on ENOSPC. Which looks like it might be
> necessary from what I recall finding in my test case + kernel code reading.
> I really don't want to respond to a possibly-transient ENOSPC by PANICing
> the whole server unnecessarily.

Yeah, it'd be nice to give an administrator the chance to free up some
disk space after ENOSPC is reported, and stay up.  Running out of
space really shouldn't take down the database without warning!  The
question is whether the data remains in cache and marked dirty, so
that retrying is a safe option (since it's potentially gone from our
own buffers, so if the OS doesn't have it the only place your
committed data can definitely still be found is the WAL... recovery
time).  Who can tell us?  Do we need a per-filesystem answer?  Delayed
allocation is a somewhat filesystem-specific thing, so maybe.
Interestingly, there don't seem to be many operating systems that can
report ENOSPC from fsync(), based on a quick scan through some
documentation:

POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no
Illumos/Solaris, Linux, macOS: yes

I don't know if macOS really means it or not; it just tells you to see
the errors for read(2) and write(2).  By the way, speaking of macOS, I
was curious to see if the common BSD heritage would show here.  Yeah,
somewhat.  It doesn't appear to keep buffers on writeback error, if
this is the right code[1] (though it could be handling it somewhere
else for all I know).

[1] https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695

-- 
Thomas Munro
http://www.enterprisedb.com



Re: 2018-03 Commitfest Summary (Andres #1)

2018-04-02 Thread Bruce Momjian
On Sat, Mar 31, 2018 at 10:08:34AM +0200, Fabien COELHO wrote:
> However, it helps explain the disagreements about pgbench features: pgbench
> internal developer-oriented use for postgres is somehow limited, and a lot
> of new features are suggested with an external performance benchmarking use
> in mind, about which core committers do not seem much interested.

Agreed.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Peter Geoghegan
On Mon, Apr 2, 2018 at 5:05 PM, Andres Freund  wrote:
>>Even accepting that (I personally go with surprising over violation, as
>>if my vote counted), it is highly unlikely that we will convince every
>>kernel team to declare "What fools we've been!" and push a change...
>>and even if they did, PostgreSQL can look forward to many years of
>>running on kernels with the broken semantics.  Given that, I think the
>>PANIC option is the soundest one, as unappetizing as it is.
>
> Don't we pretty much already have agreement in that? And Craig is the main 
> proponent of it?

I wonder how bad it will be in practice if we PANIC. Craig said "This
isn't as bad as it seems because AFAICS fsync only returns EIO in
cases where we should be stopping the world anyway, and many FSes will
do that for us". It would be nice to get more information on that.

-- 
Peter Geoghegan



Re: check_ssl_key_file_permissions should be in be-secure-common.c

2018-04-02 Thread Michael Paquier
On Mon, Apr 02, 2018 at 11:39:57AM -0400, Peter Eisentraut wrote:
> committed

Thanks for the commit, Peter.
--
Michael


signature.asc
Description: PGP signature


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Christophe Pettus

> On Apr 2, 2018, at 17:05, Andres Freund  wrote:
> 
> Don't we pretty much already have agreement in that? And Craig is the main 
> proponent of it?

For sure on the second sentence; the first was not clear to me.

--
-- Christophe Pettus
   x...@thebuild.com




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Andres Freund


On April 2, 2018 5:03:39 PM PDT, Christophe Pettus  wrote:
>
>> On Apr 2, 2018, at 16:27, Craig Ringer  wrote:
>> 
>> They're undocumented and extremely surprising semantics that are
>arguably a violation of the POSIX spec for fsync(), or at least a
>surprising interpretation of it.
>
>Even accepting that (I personally go with surprising over violation, as
>if my vote counted), it is highly unlikely that we will convince every
>kernel team to declare "What fools we've been!" and push a change...
>and even if they did, PostgreSQL can look forward to many years of
>running on kernels with the broken semantics.  Given that, I think the
>PANIC option is the soundest one, as unappetizing as it is.

Don't we pretty much already have agreement in that? And Craig is the main 
proponent of it?

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Christophe Pettus

> On Apr 2, 2018, at 16:27, Craig Ringer  wrote:
> 
> They're undocumented and extremely surprising semantics that are arguably a 
> violation of the POSIX spec for fsync(), or at least a surprising 
> interpretation of it.

Even accepting that (I personally go with surprising over violation, as if my 
vote counted), it is highly unlikely that we will convince every kernel team to 
declare "What fools we've been!" and push a change... and even if they did, 
PostgreSQL can look forward to many years of running on kernels with the broken 
semantics.  Given that, I think the PANIC option is the soundest one, as 
unappetizing as it is.

--
-- Christophe Pettus
   x...@thebuild.com




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Craig Ringer
On 3 April 2018 at 07:05, Anthony Iliopoulos  wrote:

> Hi Stephen,
>
> On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:
> >
> > fsync() doesn't reflect the status of given pages, however, it reflects
> > the status of the file descriptor, and as such the file, on which it's
> > called.  This notion that fsync() is actually only responsible for the
> > changes which were made to a file since the last fsync() call is pure
> > foolishness.  If we were able to pass a list of pages or data ranges to
> > fsync() for it to verify they're on disk then perhaps things would be
> > different, but we can't, all we can do is ask to "please flush all the
> > dirty pages associated with this file descriptor, which represents this
> > file we opened, to disk, and let us know if you were successful."
> >
> > Give us a way to ask "are these specific pages written out to persistant
> > storage?" and we would certainly be happy to use it, and to repeatedly
> > try to flush out pages which weren't synced to disk due to some
> > transient error, and to track those cases and make sure that we don't
> > incorrectly assume that they've been transferred to persistent storage.
>
> Indeed fsync() is simply a rather blunt instrument and a narrow legacy
> interface but further changing its established semantics (no matter how
> unreasonable they may be) is probably not the way to go.
>

They're undocumented and extremely surprising semantics that are arguably a
violation of the POSIX spec for fsync(), or at least a surprising
interpretation of it.

So I don't buy this argument.

It really turns out that this is not how the fsync() semantics work
> though, exactly because the nature of the errors: even if the kernel
> retained the dirty bits on the failed pages, retrying persisting them
> on the same disk location would simply fail.


*might* simply fail.

It depends on why the error ocurred.

I originally identified this behaviour on a multipath system. Multipath
defaults to "throw the writes away, nobody really cares anyway" on error.
It seems to figure a higher level will retry, or the application will
receive the error and retry.

(See no_path_retry in multipath config. AFAICS the default is insanely
dangerous and only suitable for specialist apps that understand the quirks;
you should use no_path_retry=queue).

Instead the kernel opts
> for marking those pages clean (since there is no other recovery
> strategy),

and reporting once to the caller who can potentially deal
> with it in some manner. It is sadly a bad and undocumented convention.
>


It could mark the FD.

It's not just undocumented, it's a slightly creative interpretation of the
POSIX spec for fsync.


> > Consider two independent programs where the first one writes to a file
> > and then calls the second one whose job it is to go out and fsync(),
> > perhaps async from the first, those files.  Is the second program
> > supposed to go write to each page that the first one wrote to, in order
> > to ensure that all the dirty bits are set so that the fsync() will
> > actually return if all the dirty pages are written?
>
> I think what you have in mind are the semantics of sync() rather
> than fsync(), but as long as an application needs to ensure data
> are persisted to storage, it needs to retain those data in its heap
> until fsync() is successful instead of discarding them and relying
> on the kernel after write().


This is almost exactly what we tell application authors using PostgreSQL:
the data isn't written until you receive a successful commit confirmation,
so you'd better not forget it.

We provide applications with *clear boundaries* so they can know *exactly*
what was, and was not, written. I guess the argument from the kernel is the
same is true: whatever was written since the last *successful* fsync is
potentially lost and must be redone.

But the fsync behaviour is utterly undocumented and dubiously standard.


> I think we are conflating a few issues here: having the OS kernel being
> responsible for error recovery (so that subsequent fsync() would fix
> the problems) is one. This clearly is a design which most kernels have
> not really adopted for reasons outlined above


[citation needed]

What do other major platforms do here? The post above suggests it's a bit
of a mix of behaviours.

Now, there is the issue of granularity of
> error reporting: userspace could benefit from a fine-grained indication
> of failed pages (or file ranges).


Yep. I looked at AIO in the hopes that, if we used AIO, we'd be able to map
a sync failure back to an individual AIO write.

But it seems AIO just adds more problems and fixes none. Flush behaviour
with AIO from what I can tell is inconsistent version to version and
generally unhelpful. The kernel should really report such sync failures
back to the app on its AIO write mapping, but it seems nothing of the sort
happens.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Developmen

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Andres Freund
On 2018-04-03 01:05:44 +0200, Anthony Iliopoulos wrote:
> Would using sync_file_range() be helpful? Potential errors would only
> apply to pages that cover the requested file ranges. There are a few
> caveats though:

To quote sync_file_range(2):
   Warning
   This  system  call  is  extremely  dangerous and should not be used in 
portable programs.  None of these operations writes out the
   file's metadata.  Therefore, unless the application is strictly 
performing overwrites of already-instantiated disk  blocks,  there
   are no guarantees that the data will be available after a crash.  There 
is no user interface to know if a write is purely an over‐
   write.  On filesystems using copy-on-write semantics (e.g., btrfs) an 
overwrite of existing allocated blocks is impossible.   When
   writing  into  preallocated  space,  many filesystems also require calls 
into the block allocator, which this system call does not
   sync out to disk.  This system call does not flush disk write caches and 
thus does not provide any data integrity on systems  with
   volatile disk write caches.

Given the lack of metadata safety that seems entirely a no go.  We use
sfr(2), but only to force the kernel's hand around writing back earlier
without throwing away cache contents.


> > > The application will need to deal with that first error irrespective of
> > > subsequent return codes from fsync(). Conceptually every fsync() 
> > > invocation
> > > demarcates an epoch for which it reports potential errors, so the caller
> > > needs to take responsibility for that particular epoch.
> > 
> > We do deal with that error- by realizing that it failed and later
> > *retrying* the fsync(), which is when we get back an "all good!
> > everything with this file descriptor you've opened is sync'd!" and
> > happily expect that to be truth, when, in reality, it's an unfortunate
> > lie and there are still pages associated with that file descriptor which
> > are, in reality, dirty and not sync'd to disk.
> 
> It really turns out that this is not how the fsync() semantics work
> though

Except on freebsd and solaris, and perhaps others.


>, exactly because the nature of the errors: even if the kernel
> retained the dirty bits on the failed pages, retrying persisting them
> on the same disk location would simply fail.

That's not guaranteed at all, think NFS.


> Instead the kernel opts for marking those pages clean (since there is
> no other recovery strategy), and reporting once to the caller who can
> potentially deal with it in some manner. It is sadly a bad and
> undocumented convention.

It's broken behaviour justified post facto with the only rational that
was available, which explains why it's so unconvincing. You could just
say "this ship has sailed, and it's to onerous to change because xxx"
and this'd be a done deal. But claiming this is reasonable behaviour is
ridiculous.

Again, you could just continue to error for this fd and still throw away
the data.


> > Consider two independent programs where the first one writes to a file
> > and then calls the second one whose job it is to go out and fsync(),
> > perhaps async from the first, those files.  Is the second program
> > supposed to go write to each page that the first one wrote to, in order
> > to ensure that all the dirty bits are set so that the fsync() will
> > actually return if all the dirty pages are written?
> 
> I think what you have in mind are the semantics of sync() rather
> than fsync()

If you open the same file with two fds, and write with one, and fsync
with another that's definitely supposed to work. And sync() isn't a
realistic replacement in any sort of way because it's obviously
systemwide, and thus entirely and completely unsuitable. Nor does it
have any sort of better error reporting behaviour, does it?

Greetings,

Andres Freund



Re: Creating streaming replication standby

2018-04-02 Thread Tatsuo Ishii
> A complain that I have about such concepts is that it creates a
> dependency between utilities which are usually classified by
> distributions as client-side (pg_basebackup goes to postgresql-client
> for Debian), and the server itself.  There is perhaps room for an
> extension though.  If many vendors are providing such an option at SQL
> level, it would mean that there is demand for it.

I think new function/SQL would not be necessarily bound to
pg_basebackup. Rather, new GUC would be invented so that which tool is
executed by the function/SQL for backup. For example,

create_streaming_standby_command = '/path/to/create_standby.sh %h %p %d'

where %h is the host, %p is the port on which postmaster is listening
on and %d is the database cluster directory for the standby server to
be created.

Then the new function (suppose it is called
"create_streaming_standby") would be called something like:

create_streaming_standby('standby_host', 5432, '/usr/local/pgsql/data')

and it execute create_standby.sh with the arguments.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
Hi Stephen,

On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:
>
> fsync() doesn't reflect the status of given pages, however, it reflects
> the status of the file descriptor, and as such the file, on which it's
> called.  This notion that fsync() is actually only responsible for the
> changes which were made to a file since the last fsync() call is pure
> foolishness.  If we were able to pass a list of pages or data ranges to
> fsync() for it to verify they're on disk then perhaps things would be
> different, but we can't, all we can do is ask to "please flush all the
> dirty pages associated with this file descriptor, which represents this
> file we opened, to disk, and let us know if you were successful."
>
> Give us a way to ask "are these specific pages written out to persistant
> storage?" and we would certainly be happy to use it, and to repeatedly
> try to flush out pages which weren't synced to disk due to some
> transient error, and to track those cases and make sure that we don't
> incorrectly assume that they've been transferred to persistent storage.

Indeed fsync() is simply a rather blunt instrument and a narrow legacy
interface but further changing its established semantics (no matter how
unreasonable they may be) is probably not the way to go.

Would using sync_file_range() be helpful? Potential errors would only
apply to pages that cover the requested file ranges. There are a few
caveats though:

(a) it still messes with the top-level error reporting so mixing it
with callers that use fsync() and do care about errors will produce
the same issue (clearing the error status).

(b) the error-reporting granularity is coarse (failure reporting applies
to the entire requested range so you still don't know which particular
pages/file sub-ranges failed writeback)

(c) the same "report and forget" semantics apply to repeated invocations
of the sync_file_range() call, so again action will need to be taken
upon first error encountered for the particular ranges.

> > The application will need to deal with that first error irrespective of
> > subsequent return codes from fsync(). Conceptually every fsync() invocation
> > demarcates an epoch for which it reports potential errors, so the caller
> > needs to take responsibility for that particular epoch.
> 
> We do deal with that error- by realizing that it failed and later
> *retrying* the fsync(), which is when we get back an "all good!
> everything with this file descriptor you've opened is sync'd!" and
> happily expect that to be truth, when, in reality, it's an unfortunate
> lie and there are still pages associated with that file descriptor which
> are, in reality, dirty and not sync'd to disk.

It really turns out that this is not how the fsync() semantics work
though, exactly because the nature of the errors: even if the kernel
retained the dirty bits on the failed pages, retrying persisting them
on the same disk location would simply fail. Instead the kernel opts
for marking those pages clean (since there is no other recovery
strategy), and reporting once to the caller who can potentially deal
with it in some manner. It is sadly a bad and undocumented convention.

> Consider two independent programs where the first one writes to a file
> and then calls the second one whose job it is to go out and fsync(),
> perhaps async from the first, those files.  Is the second program
> supposed to go write to each page that the first one wrote to, in order
> to ensure that all the dirty bits are set so that the fsync() will
> actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather
than fsync(), but as long as an application needs to ensure data
are persisted to storage, it needs to retain those data in its heap
until fsync() is successful instead of discarding them and relying
on the kernel after write(). The pattern should be roughly like:
write() -> fsync() -> free(), rather than write() -> free() -> fsync().
For example, if a partition gets full upon fsync(), then the application
has a chance to persist the data in a different location, while
the kernel cannot possibly make this decision and recover.

> > Callers that are not affected by the potential outcome of fsync() and
> > do not react on errors, have no reason for calling it in the first place
> > (and thus masking failure from subsequent callers that may indeed care).
> 
> Reacting on an error from an fsync() call could, based on how it's
> documented and actually implemented in other OS's, mean "run another
> fsync() to see if the error has resolved itself."  Requiring that to
> mean "you have to go dirty all of the pages you previously dirtied to
> actually get a subsequent fsync() to do anything" is really just not
> reasonable- a given program may have no idea what was written to
> previously nor any particular reason to need to know, on the expectation
> that the fsync() call will flush any dirty pages, as it's documente

Re: Disabling memory display in EXPLAIN ANALYZE

2018-04-02 Thread Andres Freund
Hi,

On 2018-04-02 17:01:08 -0400, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > i.e. memory consumption differs between environments. Which isn't
> > surprising.
> > 
> > I wonder if we could disable the display with a separate switch or tie
> > it to !'COSTS OFF && TIMING OFF' or such?
> 
> Yeah, I agree with having these suppressed when we're running the
> regression tests.  I don't particularly care for having them suppressed
> under either costs or timing off though, as those are really pretty
> specific and independnet flags.

We already put a number of loosely related things in there, so I don't
think that'd be that weird to continue along those lines.


> Perhaps what we need is a flag that says 'regression mode'...

DWIM mode? reproducible mode?

Greetings,

Andres Freund



Re: Rethinking -L switch handling and construction of LDFLAGS

2018-04-02 Thread Tom Lane
So here's a draft patch for this.  It fixes the originally complained-of
issue about --with-libxml breaking the build on Macs.  I've not tested
it very far beyond that plus a sanity cross-check on Linux, but I doubt
there's any point in further manual testing; we might as well just
throw it at the buildfarm.

I'm pretty happy with the way this turned out.  It's not a complicated
change, and we're no longer breaking our own rules about how to manage
LDFLAGS adjustments.

Some notes:

* I ended up adding not only LDFLAGS_INTERNAL and SHLIB_LINK_INTERNAL,
but also PG_LIBS_INTERNAL.  As things stood, the only uses of PG_LIBS
were for within-tree library references, so that it was fine that the
make rule that uses this variable put it before LDFLAGS.  But if anyone
ever tried to use PG_LIBS for the seemingly obvious purpose of referencing
an external library, we'd be right back in hazard land.  So that variable
needs to be split as well.

* I documented SHLIB_LINK_INTERNAL and PG_LIBS_INTERNAL in the comments
in pgxs.mk, but not in the user-facing documentation about using PGXS.
This is on the theory that only within-tree modules need either of those
variables, so users don't need them.  I could be convinced otherwise,
though.

* Some of these changes aren't really essential, eg the change in
hstore_plperl/Makefile to use SHLIB_LINK_INTERNAL instead of SHLIB_LINK;
since there's no -L switch there, it'd work either way.  But I thought
it best to maintain a consistent rule of using the _INTERNAL variable
for within-tree references.

* The change to STD_LDFLAGS in src/common/Makefile is because I noticed
that "pg_config --ldflags" has been printing "-L../../../src/common",
which it surely should not.  Apparently we forgot to fix this rule when
we rearranged things to create src/common/.  I'm tempted to make the
rule be

STD_LDFLAGS := $(filter-out $(LDFLAGS_INTERNAL),$(LDFLAGS))

in hopes of preventing a repetition of such silliness, but desisted
for the moment, because I'm not quite sure if this is a good idea,
or if it risks taking too much out.  Anybody have an opinion about it?

regards, tom lane

diff --git a/contrib/dblink/Makefile b/contrib/dblink/Makefile
index 5189758..b1a5e06 100644
--- a/contrib/dblink/Makefile
+++ b/contrib/dblink/Makefile
@@ -3,7 +3,7 @@
 MODULE_big = dblink
 OBJS	= dblink.o $(WIN32RES)
 PG_CPPFLAGS = -I$(libpq_srcdir)
-SHLIB_LINK = $(libpq)
+SHLIB_LINK_INTERNAL = $(libpq)
 
 EXTENSION = dblink
 DATA = dblink--1.2.sql dblink--1.1--1.2.sql dblink--1.0--1.1.sql \
diff --git a/contrib/hstore_plperl/Makefile b/contrib/hstore_plperl/Makefile
index c0906db..f63cba2 100644
--- a/contrib/hstore_plperl/Makefile
+++ b/contrib/hstore_plperl/Makefile
@@ -28,7 +28,7 @@ ifeq ($(PORTNAME), win32)
 # these settings are the same as for plperl
 override CPPFLAGS += -DPLPERL_HAVE_UID_GID -Wno-comment
 # ... see silliness in plperl Makefile ...
-SHLIB_LINK += $(sort $(wildcard ../../src/pl/plperl/libperl*.a))
+SHLIB_LINK_INTERNAL += $(sort $(wildcard ../../src/pl/plperl/libperl*.a))
 else
 rpathdir = $(perl_archlibexp)/CORE
 SHLIB_LINK += $(perl_embed_ldflags)
diff --git a/contrib/hstore_plpython/Makefile b/contrib/hstore_plpython/Makefile
index 7ff787a..b81735a 100644
--- a/contrib/hstore_plpython/Makefile
+++ b/contrib/hstore_plpython/Makefile
@@ -26,7 +26,7 @@ endif
 # We must link libpython explicitly
 ifeq ($(PORTNAME), win32)
 # ... see silliness in plpython Makefile ...
-SHLIB_LINK += $(sort $(wildcard ../../src/pl/plpython/libpython*.a))
+SHLIB_LINK_INTERNAL += $(sort $(wildcard ../../src/pl/plpython/libpython*.a))
 else
 rpathdir = $(python_libdir)
 SHLIB_LINK += $(python_libspec) $(python_additional_libs)
diff --git a/contrib/jsonb_plpython/Makefile b/contrib/jsonb_plpython/Makefile
index 8c7090a..b3c98e6 100644
--- a/contrib/jsonb_plpython/Makefile
+++ b/contrib/jsonb_plpython/Makefile
@@ -26,7 +26,7 @@ endif
 # We must link libpython explicitly
 ifeq ($(PORTNAME), win32)
 # ... see silliness in plpython Makefile ...
-SHLIB_LINK += $(sort $(wildcard ../../src/pl/plpython/libpython*.a))
+SHLIB_LINK_INTERNAL += $(sort $(wildcard ../../src/pl/plpython/libpython*.a))
 else
 rpathdir = $(python_libdir)
 SHLIB_LINK += $(python_libspec) $(python_additional_libs)
diff --git a/contrib/ltree_plpython/Makefile b/contrib/ltree_plpython/Makefile
index bc7502b..7e988c7 100644
--- a/contrib/ltree_plpython/Makefile
+++ b/contrib/ltree_plpython/Makefile
@@ -26,7 +26,7 @@ endif
 # We must link libpython explicitly
 ifeq ($(PORTNAME), win32)
 # ... see silliness in plpython Makefile ...
-SHLIB_LINK += $(sort $(wildcard ../../src/pl/plpython/libpython*.a))
+SHLIB_LINK_INTERNAL += $(sort $(wildcard ../../src/pl/plpython/libpython*.a))
 else
 rpathdir = $(python_libdir)
 SHLIB_LINK += $(python_libspec) $(python_additional_libs)
diff --git a/contrib/oid2name/Makefile b/contrib/oid2name/Makefile
index 3414b4a..3eef8f6 100644
--- a/contrib/oid2name/Makefile
+++ b/contrib/oid2name/Makefile
@@ -7

Re: Add default role 'pg_access_server_files'

2018-04-02 Thread Stephen Frost
Michael, all,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Sun, Apr 01, 2018 at 09:39:02AM -0400, Stephen Frost wrote:
> > Thanks for checking.  Attached is an updated version which also includes
> > the changes for adminpack, done in a similar manner to how pgstattuple
> > was updated, as discussed.  Regression tests updated and extended a bit,
> > doc updates also included.
> > 
> > If you get a chance to take a look, that'd be great.  I'll do my own
> > review of it again also after stepping away for a day or so.
> 
> I have spotted some issues mainly in patch 3.

Thanks for taking a look.

> I am not sure what has happened to your editor, but git diff --check is
> throwing a dozen of warnings coming from adminpack.c.

Ahhh, those came from switching over to tmux..  I need to figure out how
to get it to copy/paste like I had set up before with screen.  Anyhow,
they were all in patch 3 and I've fixed them all.

> c0cbe00 has stolen the OIDs your patch is using for the new roles, so
> patch 2 needs a refresh.

Fixed and generally rebased.

> @@ -68,6 +77,15 @@ convert_and_check_filename(text *arg, bool logAllowed)
> [...]
> +   /*
> +* Members of the 'pg_read_server_files' role are allowed to access any
> +* files on the server as the PG user, so no need to do any further checks
> +* here.
> +*/
> +   if (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
> +   return filename;

> So...  If a user has loaded adminpack v1.0 with Postgres v11, then
> convert_and_check_filename would actually be able to read paths out of
> the data folder for a user within pg_read_server_files, while with
> Postgres v10 then only paths within the data folder were allowed.  And
> that7s actually fine because a superuser check happens before entering
> in this code path.

Yes, all of the adminpack v1.0 code paths still have superuser checks,
similar to how the older pgstattuple code paths do.  When an upgrade to
adminpack v1.1 is done, the new v1_1 functions don't have those
superuser checks but the upgrade script REVOKE's execute rights from
public, so the right to execute the functions has to be explicitly
GRANT'd for non-superusers.

>  pg_file_rename(PG_FUNCTION_ARGS)
> +{
> +   text   *file1;
> +   text   *file2;
> +   text   *file3;
> +   boolresult;
> +
> +   if (PG_ARGISNULL(0) || PG_ARGISNULL(1))
> +   PG_RETURN_NULL();
> +
> +   file1 = PG_GETARG_TEXT_PP(0);
> +   file2 = PG_GETARG_TEXT_PP(1);
> +
> +   if (PG_ARGISNULL(2))
> +   file3 = NULL;
> +   else
> +   file3 = PG_GETARG_TEXT_PP(2);
> +
> +   requireSuperuser();

> Here requireSuperuser() should be called before looking at the
> argument values.

Moved up.

> No refactoring for pg_file_unlink and its v1.1?

I considered each function and thought about if it'd make sense to
refactor them or if they were simple enough that the additional function
wouldn't really be all that useful.  I'm open to revisiting that, but
just want to make it clear that it was something I thought about and
considered.  Since pg_file_unlink is basically two function calls, I
didn't think it worthwhile to refactor those into their own function.

> The argument checks are exactly the same for +pg_file_rename and
> pg_file_rename_v1_1.  Why about just passing fcinfo around and simplify
> the patch?

In general, I prefer to keep the PG_FUNCTION_ARGS abstraction when we
can.  Unfortunately, that does fall apart when wrapping an SRF as in
pg_logdir_ls(), but with pg_file_rename we can maintain it and it's
really not that much code to do so.  As with the refactoring of
pg_file_unlink, this is something which could really go either way.

> +CREATE OR REPLACE FUNCTION pg_catalog.pg_file_rename(text, text)
> +RETURNS bool
> +AS 'SELECT pg_catalog.pg_file_rename($1, $2, NULL::pg_catalog.text);'
> +LANGUAGE SQL VOLATILE STRICT;

> You forgot a REVOKE clause for pg_file_rename(text, text).

No, I explicitly didn't include it because that's a security-invoker SQL
function that basically doesn't do anything but turn around and call
pg_file_rename(), which will handle the privilege check:

=*> select pg_file_rename('abc','123');
ERROR:  permission denied for function pg_file_rename
CONTEXT:  SQL function "pg_file_rename" statement 1

I'm not sure how useful it is to REVOKE the rights on the simple SQL
function; that would just mean that an admin has to go GRANT the rights
on that function as well as the three-argument version.

> In adminpack.c, convert_and_check_filename is always called with false
> as second argument.  Why not dropping it and use the version in
> genfile.c instead?  As far as I can see, both functions are the same.

Hmm.  I'm pretty sure that function was actually copied from adminpack
into core, so I'm not surprised that they're basically the same, but it
was implemented in core as static and I'm not really sure that we want
to export it- it wasn't when it was first copied into core, after all.

Also, they act

Re: [PATCH] Logical decoding of TRUNCATE

2018-04-02 Thread Christophe Pettus

> On Apr 2, 2018, at 11:50, Andres Freund  wrote:
> I'm not convinced. I think it's perfectly reasonable to want to truncate
> away data on the live node, but maintain it on an archival node.

+1.  We've had at least one specific request for this, in maintaining a data 
warehouse from an OLTP system.

--
-- Christophe Pettus
   x...@thebuild.com




Re: Disabling memory display in EXPLAIN ANALYZE

2018-04-02 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> i.e. memory consumption differs between environments. Which isn't
> surprising.
> 
> I wonder if we could disable the display with a separate switch or tie
> it to !'COSTS OFF && TIMING OFF' or such?

Yeah, I agree with having these suppressed when we're running the
regression tests.  I don't particularly care for having them suppressed
under either costs or timing off though, as those are really pretty
specific and independnet flags.  Perhaps what we need is a flag that
says 'regression mode'...

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Stephen Frost
Greetings,

* Anthony Iliopoulos (ail...@altatus.com) wrote:
> On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:
> > On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:
> > > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> > > > Throwing away the dirty pages *and* persisting the error seems a lot
> > > > more reasonable. Then provide a fcntl (or whatever) extension that can
> > > > clear the error status in the few cases that the application that wants
> > > > to gracefully deal with the case.
> > > 
> > > Given precisely that the dirty pages which cannot been written-out are
> > > practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> > > are essentially correct: the first call indicates that a writeback error
> > > indeed occurred, while subsequent calls have no reason to indicate an 
> > > error
> > > (assuming no other errors occurred in the meantime).
> > 
> > Meh^2.
> > 
> > "no reason" - except that there's absolutely no way to know what state
> > the data is in. And that your application needs explicit handling of
> > such failures. And that one FD might be used in a lots of different
> > parts of the application, that fsyncs in one part of the application
> > might be an ok failure, and in another not.  Requiring explicit actions
> > to acknowledge "we've thrown away your data for unknown reason" seems
> > entirely reasonable.
> 
> As long as fsync() indicates error on first invocation, the application
> is fully aware that between this point of time and the last call to fsync()
> data has been lost. Persisting this error any further does not change this
> or add any new info - on the contrary it adds confusion as subsequent write()s
> and fsync()s on other pages can succeed, but will be reported as failures.

fsync() doesn't reflect the status of given pages, however, it reflects
the status of the file descriptor, and as such the file, on which it's
called.  This notion that fsync() is actually only responsible for the
changes which were made to a file since the last fsync() call is pure
foolishness.  If we were able to pass a list of pages or data ranges to
fsync() for it to verify they're on disk then perhaps things would be
different, but we can't, all we can do is ask to "please flush all the
dirty pages associated with this file descriptor, which represents this
file we opened, to disk, and let us know if you were successful."

Give us a way to ask "are these specific pages written out to persistant
storage?" and we would certainly be happy to use it, and to repeatedly
try to flush out pages which weren't synced to disk due to some
transient error, and to track those cases and make sure that we don't
incorrectly assume that they've been transferred to persistent storage.

> The application will need to deal with that first error irrespective of
> subsequent return codes from fsync(). Conceptually every fsync() invocation
> demarcates an epoch for which it reports potential errors, so the caller
> needs to take responsibility for that particular epoch.

We do deal with that error- by realizing that it failed and later
*retrying* the fsync(), which is when we get back an "all good!
everything with this file descriptor you've opened is sync'd!" and
happily expect that to be truth, when, in reality, it's an unfortunate
lie and there are still pages associated with that file descriptor which
are, in reality, dirty and not sync'd to disk.

Consider two independent programs where the first one writes to a file
and then calls the second one whose job it is to go out and fsync(),
perhaps async from the first, those files.  Is the second program
supposed to go write to each page that the first one wrote to, in order
to ensure that all the dirty bits are set so that the fsync() will
actually return if all the dirty pages are written?

> Callers that are not affected by the potential outcome of fsync() and
> do not react on errors, have no reason for calling it in the first place
> (and thus masking failure from subsequent callers that may indeed care).

Reacting on an error from an fsync() call could, based on how it's
documented and actually implemented in other OS's, mean "run another
fsync() to see if the error has resolved itself."  Requiring that to
mean "you have to go dirty all of the pages you previously dirtied to
actually get a subsequent fsync() to do anything" is really just not
reasonable- a given program may have no idea what was written to
previously nor any particular reason to need to know, on the expectation
that the fsync() call will flush any dirty pages, as it's documented to
do.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: BRIN FSM vacuuming questions

2018-04-02 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> I noticed that several places in brin_pageops.c do this sort of thing:
>> ...
>> ie they put *one* page into the index's free space map and then invoke a
>> scan of the index's entire FSM to ensure that that info is bubbled up to
>> the top immediately.  I wonder how carefully this was thought through.
>> We don't generally think that it's worth doing an FSM vacuum for a single
>> page worth of free space.

> Yeah, there was no precedent for doing that, but it was actually
> intentional: there were enough cases where we would extend the relation,
> only to have the operation fail for unrelated reasons (things like a
> concurrent heap insertion), so it was possible to lose free space very
> fast, and since BRIN indexes are slow to grow it would become
> noticeable.  We could have waited for FreeSpaceMapVacuum to occur
> naturally, but this would cause the index to bloat until next vacuum,
> which could happen a long time later.  Also, the idea behind BRIN is
> that the indexed tables are very large, so vacuuming is infrequent.

Fair enough.  It seemed like an odd thing, but as long as it was
intentional I'm good with it.

>> some places that do RecordPageWithFreeSpace follow it up with an immediate
>> FreeSpaceMapVacuum, and some don't.

> Hmm I analyzed the callsites carefully to ensure it wasn't possible to
> bail out without vacuuming in the normal corner cases.  Maybe you
> spotted some gap in my analysis -- or worse, maybe I had to change
> nearby code for unrelated reasons and broke it afterwards inadvertently.

Well, I did not trace the logic fully, but there are places that record
free space and don't have an obviously connected FreeSpaceMapVacuum
call.  Maybe those all expect the caller to do it.

> I agree that brin_getinsertbuffer should have a better comment to
> explain the intention (and that routine seems to carry the most guilt in
> failing to do the FSM vacuuming).

Got a proposal?

I can take care of the other stuff, unless you wanted to.

regards, tom lane



Disabling memory display in EXPLAIN ANALYZE

2018-04-02 Thread Andres Freund
Hi,

The, now reverted, MERGE patch used EXPLAIN (ANALYZE, COSTS OFF, TIMING
OFF) on a few queries. That doesn't seem like a bad idea when it's
interesting to display the plan and run a query without (interesting)
results.  Otherwise one has to duplicate the query.

But right now it triggers differences like:

***
*** 887,893 
   Hash Cond: (s.sid = t_1.tid)
   ->  Seq Scan on source s (actual rows=3 loops=1)
   ->  Hash (actual rows=3 loops=1)
!Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  Seq Scan on target t_1 (actual rows=3 loops=1)
   Trigger merge_ard: calls=1
   Trigger merge_ari: calls=1
--- 887,893 
   Hash Cond: (s.sid = t_1.tid)
   ->  Seq Scan on source s (actual rows=3 loops=1)
   ->  Hash (actual rows=3 loops=1)
!Buckets: 1024  Batches: 1  Memory Usage: 5kB
 ->  Seq Scan on target t_1 (actual rows=3 loops=1)
   Trigger merge_ard: calls=1
   Trigger merge_ari: calls=1

***
*** 1320,1330 
   Merge Cond: (t_1.a = s.a)
   ->  Sort (actual rows=50 loops=1)
 Sort Key: t_1.a
!Sort Method: quicksort  Memory: 27kB
 ->  Seq Scan on ex_mtarget t_1 (actual rows=50 loops=1)
   ->  Sort (actual rows=100 loops=1)
 Sort Key: s.a
!Sort Method: quicksort  Memory: 33kB
 ->  Seq Scan on ex_msource s (actual rows=100 loops=1)
  (15 rows)
  
--- 1320,1330 
   Merge Cond: (t_1.a = s.a)
   ->  Sort (actual rows=50 loops=1)
 Sort Key: t_1.a
!Sort Method: quicksort  Memory: 19kB
 ->  Seq Scan on ex_mtarget t_1 (actual rows=50 loops=1)
   ->  Sort (actual rows=100 loops=1)
 Sort Key: s.a
!Sort Method: quicksort  Memory: 24kB
 ->  Seq Scan on ex_msource s (actual rows=100 loops=1)
  (15 rows)

i.e. memory consumption differs between environments. Which isn't
surprising.

I wonder if we could disable the display with a separate switch or tie
it to !'COSTS OFF && TIMING OFF' or such?

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:
> On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:
> > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> > > Throwing away the dirty pages *and* persisting the error seems a lot
> > > more reasonable. Then provide a fcntl (or whatever) extension that can
> > > clear the error status in the few cases that the application that wants
> > > to gracefully deal with the case.
> > 
> > Given precisely that the dirty pages which cannot been written-out are
> > practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> > are essentially correct: the first call indicates that a writeback error
> > indeed occurred, while subsequent calls have no reason to indicate an error
> > (assuming no other errors occurred in the meantime).
> 
> Meh^2.
> 
> "no reason" - except that there's absolutely no way to know what state
> the data is in. And that your application needs explicit handling of
> such failures. And that one FD might be used in a lots of different
> parts of the application, that fsyncs in one part of the application
> might be an ok failure, and in another not.  Requiring explicit actions
> to acknowledge "we've thrown away your data for unknown reason" seems
> entirely reasonable.

As long as fsync() indicates error on first invocation, the application
is fully aware that between this point of time and the last call to fsync()
data has been lost. Persisting this error any further does not change this
or add any new info - on the contrary it adds confusion as subsequent write()s
and fsync()s on other pages can succeed, but will be reported as failures.

The application will need to deal with that first error irrespective of
subsequent return codes from fsync(). Conceptually every fsync() invocation
demarcates an epoch for which it reports potential errors, so the caller
needs to take responsibility for that particular epoch.

Callers that are not affected by the potential outcome of fsync() and
do not react on errors, have no reason for calling it in the first place
(and thus masking failure from subsequent callers that may indeed care).

Best regards,
Anthony



Re: BRIN FSM vacuuming questions

2018-04-02 Thread Alvaro Herrera
Tom Lane wrote:
> I noticed that several places in brin_pageops.c do this sort of thing:
> 
> if (extended)
> {
> Assert(BlockNumberIsValid(newblk));
> RecordPageWithFreeSpace(idxrel, newblk, freespace);
> FreeSpaceMapVacuum(idxrel);
> }
> 
> ie they put *one* page into the index's free space map and then invoke a
> scan of the index's entire FSM to ensure that that info is bubbled up to
> the top immediately.  I wonder how carefully this was thought through.
> We don't generally think that it's worth doing an FSM vacuum for a single
> page worth of free space.

Yeah, there was no precedent for doing that, but it was actually
intentional: there were enough cases where we would extend the relation,
only to have the operation fail for unrelated reasons (things like a
concurrent heap insertion), so it was possible to lose free space very
fast, and since BRIN indexes are slow to grow it would become
noticeable.  We could have waited for FreeSpaceMapVacuum to occur
naturally, but this would cause the index to bloat until next vacuum,
which could happen a long time later.  Also, the idea behind BRIN is
that the indexed tables are very large, so vacuuming is infrequent.

I also considered the fact that we can save the newly acquired page in
relcache's "target block", but I was scared of relying just on that
because AFAIR it's easily lost on relcache invalidations.

> We could make these operations somewhat cheaper, in the wake of 851a26e26,
> by doing
> 
> -   FreeSpaceMapVacuum(idxrel);
> +   FreeSpaceMapVacuumRange(idxrel, newblk, newblk + 1);
> 
> so that only the relevant subtree of the FSM gets scanned.  But it seems
> at least as plausible to just drop the FreeSpaceMapVacuum calls entirely
> and let the next VACUUM fix up the map, like the entire rest of the
> system does.  This is particularly true because the code is inconsistent:
> some places that do RecordPageWithFreeSpace follow it up with an immediate
> FreeSpaceMapVacuum, and some don't.

Hmm I analyzed the callsites carefully to ensure it wasn't possible to
bail out without vacuuming in the normal corner cases.  Maybe you
spotted some gap in my analysis -- or worse, maybe I had to change
nearby code for unrelated reasons and broke it afterwards inadvertently.


> Or, perhaps, there's actually some method behind this madness and there's
> good reason to think that FreeSpaceMapVacuum calls in these specific
> places are worth the cost?  If so, a comment documenting the reasoning
> would be appropriate.

Range-vacuuming the FSM seems a good idea for these places -- no need to
trawl the rest of the tree.  I'm not too excited about dropping the FSM
vacuuming completely and waiting till regular VACUUM.  (Why do we call
this FSM operation "vacuum"?  It seems pretty confusing.)

I agree that brin_getinsertbuffer should have a better comment to
explain the intention (and that routine seems to carry the most guilt in
failing to do the FSM vacuuming).

> I'm also finding this code in brin_page_cleanup a bit implausible:
> 
> /* Measure free space and record it */
> freespace = br_page_get_freespace(page);
> if (freespace > GetRecordedFreeSpace(idxrel, BufferGetBlockNumber(buf)))
> {
> RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buf), freespace);
> return true;
> }
> 
> because of the fact that GetRecordedFreeSpace will report a rounded-down
> result owing to the limited resolution of the FSM's numbers.  If we
> assume that the result of br_page_get_freespace is always maxaligned,
> then by my math there's a 75% chance (on maxalign-8 machines; more if
> maxalign is 4) that this will think there's a discrepancy even when
> there is none.  It seems questionable to me that it's worth checking
> GetRecordedFreeSpace at all.  If it really is, we'd need to get the
> recorded free space, invoke RecordPageWithFreeSpace, and then see if
> the result of GetRecordedFreeSpace has changed in order to decide whether
> anything really changed.  But that seems overly complicated, and again
> unlike anything done elsewhere.  Also, the result value is used to
> decide whether FreeSpaceMapVacuum is needed, but I'm unconvinced that
> it's a good idea to skip said call just because the leaf FSM values did
> not change; that loses the opportunity to clean up any upper-level FSM
> errors that may exist.
> 
> In short, I'd suggest making this RecordPageWithFreeSpace call
> unconditional, dropping the result value of brin_page_cleanup,
> and doing FreeSpaceMapVacuum unconditionally too, back in
> brin_vacuum_scan.  I don't see a reason for brin to be unlike
> the rest of the system here.

As I recall, this was just trying to save unnecessary FSM updates, and
failed to make the connection with lossiness in GetRecordedFreeSpace.
But since FSM updates are quite cheap anyway (since they're not WAL
logged), I agree with your proposed fix.

> Lastly, I notice that brin_va

Re: BUG #15112: Unable to run pg_upgrade with earthdistance extension

2018-04-02 Thread Bruce Momjian
On Fri, Mar 30, 2018 at 04:52:29PM -0400, Bruce Momjian wrote:
> On Wed, Mar 14, 2018 at 11:12:26PM +, PG Bug reporting form wrote:
> > The following bug has been logged on the website:
> > 
> > Bug reference:  15112
> > Logged by:  Keiko Oda
> > Email address:  keiko...@gmail.com
> > PostgreSQL version: 10.3
> > Operating system:   Ubuntu
> > Description:
> > 
> > This is similar to the bug report from 2016
> > (https://www.postgresql.org/message-id/20161015001424.1413.93...@wrigleys.postgresql.org).
> > where earthdistance extension is assuming that it's executed with public
> > schema in search_path.
> > 
> > With 10.3 release, Postgres has tighten the search_path during pg_upgrade
> > only to pg_catalog. This is problematic as it's now impossible to run
> > pg_upgrade with earthdistance extension (it fails with create index with
> > ll_to_earth function).
> > Prior to this, we were at least able to workaround by adding public schema
> > in search_path with pg_upgrade.
> > 
> > As it's recommended in the release note, earthdistance should adjust these
> > functions to not assume anything about what search path they are invoked
> > under.
> 
> Uh, I can reproduce this failure.  :-(
> 
> I tested it by installing earchdistance (and cube) and an index using an
> earchdistance function in the old cluster and ran pg_upgrade.  I used
> the instructions in the referenced email:
> 
>   CREATE TABLE zip_code_geo_poly_data (
>   id serial   NOT NULL PRIMARY KEY,
>   zip_codeTEXT,
>   latitudeNUMERIC,
>   longitude   NUMERIC
>   );
>   
>   CREATE INDEX zip_code_geo_poly_data_ll_to_earth 
>   ON zip_code_geo_poly_data 
>   USING gist(ll_to_earth(latitude, longitude));
> 
> The failure is actually in pg_dump/restore, but pg_upgrade relies on
> that, so it fails too.  The failure output is:
> 
>   LINE 1: ...ians($1))*sin(radians($2))),earth()*sin(radians($1)))::earth
> ^
>   QUERY:  SELECT 
> cube(cube(cube(earth()*cos(radians($1))*cos(radians($2))),earth()*cos(radians($1))*sin(radians($2))),earth()*sin(radians($1)))::earth
>   CONTEXT:  SQL function "ll_to_earth" during inlining
>   Command was:
>   -- For binary upgrade, must preserve pg_class oids
>   SELECT 
> pg_catalog.binary_upgrade_set_next_index_pg_class_oid('16498'::pg_catalog.oid);
>   
>   CREATE INDEX "zip_code_geo_poly_data_ll_to_earth" ON 
> "public"."zip_code_geo_poly_data" USING "gist" 
> ("public"."ll_to_earth"(("latitude")::double precision, ("longitude")::double 
> precision));
> 
> The function definition of ll_to_earth() is (as defined in
> earthdistance--1.1.sql):
> 
>   CREATE FUNCTION ll_to_earth(float8, float8)
>   RETURNS earth
>   LANGUAGE SQL
>   IMMUTABLE STRICT
>   PARALLEL SAFE
>   AS 'SELECT 
> cube(cube(cube(earth()*cos(radians($1))*cos(radians($2))),earth()*cos(radians($1))*sin(radians($2))),earth()*sin(radians($1)))::earth';
> 
> As you can see, the SQL function is doing a cast to ::earth, but in this
> test case the earth cast is stored in the public schema, and the restore
> doesn't reference it.
> 
> I am not sure we can fix this without requiring people to drop and
> recreate such indexes.  However, I am even at a loss in how to fix the
> CREATE FUNCTION to reference a cast in the same schema as the function,
> in this case 'public'.  We can rewrite the cast to not use :: and use a
> function call with schema qualification. e.g. public.earth(), but how do
> we know what schema that is in, i.e. what if the extension is loaded
> into a schema other than public?  
> 
> FYI, earthdistance is certainly not the only case of this problem.  
> 
> This is also part of a larger problem with our new schema qualification
> approach.  While we are qualifying where the object is to be created, we
> are not always qualifying the object's contents, i.e. in this case, the
> SQL function body.

I have not received a reply to this so I am CCing the hackers list for
more input.  (If this is not the proper procedure, please let me know.)

The only additional thoughts I have are that when we removed public from
pg_dump's search_path and decided to instead fully qualify object
creation names, I was concerned about function creation failing because
of 'public' schema object references in the function body, but someone
pointed out that we don't check function bodies during restore because
of:

SET check_function_bodies = false;

However, what I did not consider was creation of indexes based on
function bodies with 'public' schema references.

I have ran some tests and found out that only SQL functions cause such
failures.  I believe this is because they are inlined during CREATE
INDEX.  The attached pg_dump output file recreates the failure, and
shows that prefixing it with "public" avoids the failure, and shows that
index

fdw_startup/tuple_cost vs. parallel_startup/tuple_cost

2018-04-02 Thread Robert Haas
postgres_fdw has values called fdw_startup_cost and fdw_tuple_cost
that you can adjust in order to control the perceived cost of starting
a remote query and schlepping a tuple over the network.  Parallel
query, similarly, has parallel_startup_cost and parallel_tuple_cost.
Today, I noticed that on a relative basis, the defaults are insane.

parallel_startup_cost = 1000
fdw_startup_cost = 100

parallel_tuple_cost = 0.1
fdw_tuple_cost = 0.01

So, we think that starting a query on a remote node is 10x cheaper
than starting a worker within the same cluster, and similarly we think
that the cost of moving a tuple from the worker to the leader is 10x
more than the cost of moving a tuple from a remote node to the local
node.  I think that's gotta be wrong.  Communication within the same
database should be considerably *cheaper*, because we don't have to
convert tuples to next or fit them through the network pipe.

Another useful point of comparison is the cost of an Append node,
which is currently defined to be one-half of cpu_tuple_cost, or by
default 0.005 - or in other words half the cost of sending a tuple
over the network.  It's good that it's less, but I think the actual
cost of sending a tuple over the network is considerably more than
twice the cost of passing it through an Append node.  I suspect it's
at least 10x more expensive.

I don't think all of this mattered very much before we had join
pushdown, aggregate pushdown, and partition-wise join and aggregate,
but it does matter now.  I'm seeing the planner fail to push down
aggregates in cases that are obvious wins unless I crank up
fdw_tuple_cost.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Passing current_database to BackgroundWorkerInitializeConnection

2018-04-02 Thread Andres Freund
Hi,

On 2018-04-02 14:33:54 -0500, Jeremy Finzel wrote:
> Hmmm... not sure if I follow.  My goal is to run a SQL statement every 10
> seconds (or what value is chosen) in a particular database, using a
> background worker.  Those are the two arguments.  Am I missing some way to
> implement this apart from passing those 2 arguments into the launcher
> function?  Is the way to do this properly then to allocate shared memory
> for it, as opposed to trying to pass args into the main function?

Yes, that's the proper way. Allocate shared memory and pass a pointer to
that as the argument.

Greetings,

Andres Freund



Re: Passing current_database to BackgroundWorkerInitializeConnection

2018-04-02 Thread Jeremy Finzel
On Mon, Apr 2, 2018 at 2:27 PM, Andres Freund  wrote:

> Hi,
>
> On 2018-04-02 14:24:53 -0500, Jeremy Finzel wrote:
> > Thank you, this makes sense.  However, how can this be done since I can
> > only pass one argument to bgw_main?  Is there any way to do this without
> > having to store the value in shared memory?
>
> No (I mean you can store it in the filesystem or such as well, but
> ...). Pretty fundamentally sharing data between concurrently running
> processes needs a medium to share the data over. The bgw infrastructure
> allocates just enough so you can put an index to it into
> shmem. Allocating more would be wasteful and/or not enough for some
> users.
>
> Greetings,
>
> Andres Freund
>

Hmmm... not sure if I follow.  My goal is to run a SQL statement every 10
seconds (or what value is chosen) in a particular database, using a
background worker.  Those are the two arguments.  Am I missing some way to
implement this apart from passing those 2 arguments into the launcher
function?  Is the way to do this properly then to allocate shared memory
for it, as opposed to trying to pass args into the main function?

Thanks,
Jeremy


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Andres Freund
On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:
> On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> > Throwing away the dirty pages *and* persisting the error seems a lot
> > more reasonable. Then provide a fcntl (or whatever) extension that can
> > clear the error status in the few cases that the application that wants
> > to gracefully deal with the case.
> 
> Given precisely that the dirty pages which cannot been written-out are
> practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> are essentially correct: the first call indicates that a writeback error
> indeed occurred, while subsequent calls have no reason to indicate an error
> (assuming no other errors occurred in the meantime).

Meh^2.

"no reason" - except that there's absolutely no way to know what state
the data is in. And that your application needs explicit handling of
such failures. And that one FD might be used in a lots of different
parts of the application, that fsyncs in one part of the application
might be an ok failure, and in another not.  Requiring explicit actions
to acknowledge "we've thrown away your data for unknown reason" seems
entirely reasonable.


> The error reporting is thus consistent with the intended semantics (which
> are sadly not properly documented). Repeated calls to fsync() simply do not
> imply that the kernel will retry to writeback the previously-failed pages,
> so the application needs to be aware of that.

Which isn't what I've suggested.


> Persisting the error at the fsync() level would essentially mean
> moving application policy into the kernel.

Meh.

Greetings,

Andres Freund



Re: Passing current_database to BackgroundWorkerInitializeConnection

2018-04-02 Thread Andres Freund
Hi,

On 2018-04-02 14:24:53 -0500, Jeremy Finzel wrote:
> Thank you, this makes sense.  However, how can this be done since I can
> only pass one argument to bgw_main?  Is there any way to do this without
> having to store the value in shared memory?

No (I mean you can store it in the filesystem or such as well, but
...). Pretty fundamentally sharing data between concurrently running
processes needs a medium to share the data over. The bgw infrastructure
allocates just enough so you can put an index to it into
shmem. Allocating more would be wasteful and/or not enough for some
users.

Greetings,

Andres Freund



Re: Passing current_database to BackgroundWorkerInitializeConnection

2018-04-02 Thread Jeremy Finzel
On Fri, Mar 30, 2018 at 5:37 PM, Andres Freund  wrote:

>
>
> On March 30, 2018 3:16:31 PM PDT, Jeremy Finzel  wrote:
> >> What do you mean with "current database"? Before you
> >> BackgroundWorkerInitializeConnection() there is no such thing?
> >
> >
> >My module is based directly off the worker_spi example. The worker is
> >dynamically launched via SQL command. But in the worker_spi example,
> >the
> >database postgres is just hardcoded as the database in which to start
> >the
> >background worker process. Instead, I want to start it in the database
> >in
> >which I run the SQL command.
>
> The started worker isn't associated with the original database. You can
> pass the database oid as an argument to the launched bgworker.
>
>
Thank you, this makes sense.  However, how can this be done since I can
only pass one argument to bgw_main?  Is there any way to do this without
having to store the value in shared memory?  I was going to try passing an
array instead of an int, but I'm not liking that much.  I am trying to pass
naptime and database_name (or oid).

Thanks,
Jeremy


Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

2018-04-02 Thread Alvaro Herrera
Why do we need AccessExclusiveLock on all children of a relation that we
want to scan to search for rows not satisfying the constraint?  I think
it should be enough to get ShareLock, which prevents INSERT, no?  I have
a feeling I'm missing something here, but I don't know what, and all
tests pass with that change.

Also: the change proposed to remove the get_default_oid_from_partdesc()
call actually fixes the bug, but to me it was not at all obvious why.

To figure out why, I first had to realize that
ValidatePartitionConstraints was lying to me, both in name and in
comments: it doesn't actually validate anything, it merely queues
entries so that alter table's phase 3 would do the validation.  I found
this extremely confusing, so I fixed the comments to match reality, and
later decided to rename the function also.

At that point I was able to understand what the problem was: when
attaching the default partition, we were setting the constraint to be
validated for that partition to the correctly computed partition
constraint; and later in the same function we would set the constraint
to "the immature constraint" because we were now seeing that the default
partition OID was not invalid.  So it was rather a bug in
ValidatePartitionConstraints, in that it was accepting to set the
expression to validate when another expression had already been set!  I
added an assert to protect against this.  And then changed the decision
of whether or not to run this block based on the attached partition
being the default one or not; because as the "if" test was, it was just
a recipe for confusion.  (Now, if you look carefully you realize that
the new test for bound->is_default I added is kinda redundant: it can
only be set if there was a default partition OID at start of the
function, and therefore we know we're not adding a default partition.
And for the case where we *are* adding the default partition, it is
still Invalid because we don't re-read its own OID.  But relying on that
seems brittle because it breaks if we happen to update the default
partition OID in the middle of that function, which is what we were
doing.  Hence the new is_default test.)

I looked at the implementation of ValidatePartitionConstraints and
didn't like it, so I rewrote it also.

This comment is mistaken, too:
-   /*
-* Skip if the partition is itself a partitioned table.  We can only
-* ever scan RELKIND_RELATION relations.
-*/
... because it ignores the possibility of a partition being a foreign
table.  The code seems to work, but I think there is no test to cover
the case that a foreign table containing data that doesn't satisfy the
constraint is attached, because when I set that case to return doing
nothing (ie. ATGetQueueEntry is not called for a foreign partition), no
test failed.


Generally speaking, I didn't like ATExecAttachPartition; it's doing too
much that should have been done in ATPrepAttachPartition.  The only
reason that's not broken is because we don't allow ATTACH PARTITION to
appear together with other commands in alter table, which seems
disappointing ... for example, when attaching multiple tables and a
default partition exist, you have to scan the default one multiple
times, which is lame.

(Speaking of lame: I noticed that RelationGetPartitionQual obtains lock
on the parent relation ... I wonder if this can be used to cause a
deadlock during InitResultRelationInfo.)

BTW, I think this is already broken for the case where the default
partition is partitioned and you attach a partition colliding with a row
that's being concurrently inserted in a partition of the default
partition, though I didn't bother to write a test for that.

Anyway, I'm just an onlooker fixing a CommandCounterIncrement change.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c8da82217d..357b1078f8 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -483,10 +483,9 @@ static void RemoveInheritance(Relation child_rel, Relation 
parent_rel);
 static ObjectAddress ATExecAttachPartition(List **wqueue, Relation rel,
  PartitionCmd *cmd);
 static void AttachPartitionEnsureIndexes(Relation rel, Relation attachrel);
-static void ValidatePartitionConstraints(List **wqueue, Relation scanrel,
-List *scanrel_children,
-List *partConstraint,
-bool validate_default);
+static void QueuePartitionConstraintValidation(List **wqueue, Relation scanrel,
+  List 
*partConstraint,
+  bool 
validate_default);
 static vo

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:
> > On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
> > > Craig Ringer  writes:
> > > > So we should just use the big hammer here.
> > >
> > > And bitch, loudly and publicly, about how broken this kernel behavior is.
> > > If we make enough of a stink maybe it'll get fixed.
> > 
> > It is not likely to be fixed (beyond what has been done already with the
> > manpage patches and errseq_t fixes on the reporting level). The issue is,
> > the kernel needs to deal with hard IO errors at that level somehow, and
> > since those errors typically persist, re-dirtying the pages would not
> > really solve the problem (unless some filesystem remaps the request to a
> > different block, assuming the device is alive).
> 
> Throwing away the dirty pages *and* persisting the error seems a lot
> more reasonable. Then provide a fcntl (or whatever) extension that can
> clear the error status in the few cases that the application that wants
> to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are
practically thrown away, the semantics of fsync() (after the 4.13 fixes)
are essentially correct: the first call indicates that a writeback error
indeed occurred, while subsequent calls have no reason to indicate an error
(assuming no other errors occurred in the meantime).

The error reporting is thus consistent with the intended semantics (which
are sadly not properly documented). Repeated calls to fsync() simply do not
imply that the kernel will retry to writeback the previously-failed pages,
so the application needs to be aware of that. Persisting the error at the
fsync() level would essentially mean moving application policy into the
kernel.

Best regards,
Anthony



Re: Feature Request - DDL deployment with logical replication

2018-04-02 Thread Andres Freund
Hi,

On 2018-04-02 23:07:17 +0800, Craig Ringer wrote:
> We then lack any mechanism by which you can NACK, saying "I can't apply
> this".

Sure, but nothing forces this mechanism to be in-band.


> So upstream will wait indefinitely. I guess we just expect the user to
> intervene and ROLLBACK if they decide a replica isn't going to get the job
> done, or have checked the replica's logs and found it can't apply it for
> some hopefully-sane reason.
> 
> It's not like we'd auto-ROLLBACK PREPARED in response to a nack from a
> downstream anyway, so all we're missing is probably info in the upstream
> logs about which replica(s) cannot apply it and why.
> 
> OK. So it'd be a nice-to-have, but not vital.

I'm not sure that an in-band mechanism that's the same for all potential
users is flexible enough (actually unsure, not intimating it's wrong).
It doesn't seem crazy to do these checks over a separate
connection. That'd allow more flexible error handling etc.

Greetings,

Andres Freund



Re: [PATCH] Logical decoding of TRUNCATE

2018-04-02 Thread Andres Freund
Hi,

On 2018-04-02 07:39:57 +0100, Simon Riggs wrote:
> On 1 April 2018 at 21:01, Andres Freund  wrote:
> 
> >> ***
> >> *** 111,116  CREATE PUBLICATION  >> class="parameter">name
> >> --- 111,121 
> >> and so the default value for this option is
> >> 'insert, update, delete'.
> >>
> >> +  
> >> +TRUNCATE is treated as a form of
> >> +DELETE for the purpose of deciding whether
> >> +to publish, or not.
> >> +  
> >>   
> >>  
> >> 
> >
> > Why is this a good idea?
> 
> TRUNCATE removes rows, just as DELETE does, so anybody that wants to
> publish the removal of rows will be interested in this.

I'm not convinced. I think it's perfectly reasonable to want to truncate
away data on the live node, but maintain it on an archival node.  In
that case one commonly wants to continue to maintain OLTP changes (hence
DELETE), but not the bulk truncation.  I think there's a reasonable
counter-argument in that this isn't fine grained enough.


> >> +  * Write a WAL record to allow this set of actions to be logically 
> >> decoded.
> >> +  * We could optimize this away when !RelationIsLogicallyLogged(rel)
> >> +  * but that doesn't save much space or time.
> >
> > What you're saying isn't that you're not logging anything, but that
> > you're allocating the header regardless? Because this certainly sounds
> > like you unconditionally log a WAL record.
> 
> It says that, yes, my idea - as explained.

My complaint is that the comment and the actual implementation
differ. The above sounds like it's unconditionally WAL logging, but:

+   /*
+* Write a WAL record to allow this set of actions to be logically 
decoded.
+* We could optimize this away when !RelationIsLogicallyLogged(rel)
+* but that doesn't save much space or time.
+*
+* Assemble an array of relids, then an array of seqrelids so we can 
write
+* a single WAL record for the whole action.
+*/
+   logrelids = palloc(maxrelids * sizeof(Oid));
+   foreach (cell, relids_logged)
+   {
+   nrelids++;
+   if (nrelids > maxrelids)
+   {
+   maxrelids *= 2;
+   logrelids = repalloc(logrelids, maxrelids * 
sizeof(Oid));
+   }
+   logrelids[nrelids - 1] = lfirst_oid(cell);
+   }
+ 
+   foreach (cell, seq_relids_logged)
+   {
+   nseqrelids++;
+   if ((nrelids + nseqrelids) > maxrelids)
+   {
+   maxrelids *= 2;
+   logrelids = repalloc(logrelids, maxrelids * 
sizeof(Oid));
+   }
+   logrelids[nrelids + nseqrelids - 1] = lfirst_oid(cell);
+   }
+ 
+   if ((nrelids + nseqrelids) > 0)
+   {
+   xl_heap_truncate xlrec;
+ 
+   xlrec.dbId = MyDatabaseId;
+   xlrec.nrelids = nrelids;
+   xlrec.nseqrelids = nseqrelids;
+   xlrec.flags = 0;
+   if (behavior == DROP_CASCADE)
+   xlrec.flags |= XLH_TRUNCATE_CASCADE;
+   if (restart_seqs)
+   xlrec.flags |= XLH_TRUNCATE_RESTART_SEQS;
+ 
+   XLogBeginInsert();
+   XLogRegisterData((char *) &xlrec, SizeOfHeapTruncate);
+   XLogRegisterData((char *) logrelids,
+   (nrelids + nseqrelids) 
* sizeof(Oid));
+ 
+   XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+ 
+   (void) XLogInsert(RM_HEAP_ID, XLOG_HEAP_TRUNCATE);
+   }

Note that the XLogInsert() is in an if branch that's only executed if
there's either logged relids or sequence relids.


I think logrelids should be allocated at the max size immediately, and
the comment rewritten to explicitly only talk about the allocation,
rather than sounding like it's about WAL logging as well.

Greetings,

Andres Freund



Re: [PATCH] Logical decoding of TRUNCATE

2018-04-02 Thread Andres Freund
Hi,

On 2018-04-02 14:30:50 -0400, Peter Eisentraut wrote:
> On 4/1/18 16:01, Andres Freund wrote:
> > On 2018-01-25 14:21:15 +0100, Marco Nenciarini wrote:
> >> +  if (SessionReplicationRole != SESSION_REPLICATION_ROLE_REPLICA)
> >> +  {
> >> + 
> >> +  /*
> >> +   * Check foreign key references.  In CASCADE mode, this should 
> >> be
> >> +   * unnecessary since we just pulled in all the references; but 
> >> as a
> >> +   * cross-check, do it anyway if in an Assert-enabled build.
> >> +   */
> >>   #ifdef USE_ASSERT_CHECKING
> >>heap_truncate_check_FKs(rels, false);
> >> + #else
> >> +  if (stmt->behavior == DROP_RESTRICT)
> >> +  heap_truncate_check_FKs(rels, false);
> >>   #endif
> >> +  }
> > 
> > That *can't* be right.
> 
> You mean the part that ignores this in session_replication_role =
> replica?  I don't like that either.  And it's also not clear why it's
> needed for this feature.

I primarily the way the code is written. For me code differing like this
between USE_ASSERT_CHECKING and not seems like a recipe for disaster.
And yea, I'm not sure why the session_replication_role bit is here either.


> > I know this has been discussed in the thread already, but it really
> > strikes me as wrong to basically do some mini DDL replication feature
> > via per-command WAL records.
> 
> The problem is that the interaction of TRUNCATE and foreign keys is a mess.
> 
> Let's say I have a provider database with table1, table2, and table3,
> all connected by foreign keys, and a replica database with table1,
> table2, and table4, all connected by foreign keys.  I run TRUNCATE
> table1 CASCADE on the provider.  What should replication do?
> 
> The proposed patch applies the TRUNCATE table1 CASCADE on the replica,
> which ends up truncating table1, table2, and table4, which is different
> from what happened on the origin.

I actually think that's a fairly sane behaviour because it allows you to
have different schemas on both sides and still deal in a reasonably sane
manner.  What I'm concerned about is more that we're developing a one-of
DDL replication feature w/ corresponding bespoke WAL-logging.


> An alternative might be to make the replication protocol message say "I
> truncated table1, table2" (let's say table3 is not replicated).  (This
> sounds more like logical replication rather than DDL replication.)  That
> will then fail to apply on the replica, because it requires that you
> include all tables connected by foreign keys.

And entirely fails if there's schema differences.


> We could then do what we do in the DELETE case and just ignore foreign
> keys altogether, which is obviously bad and not a forward-looking behavior.
> 
> Or we could do what we *should* be doing in the DELETE case and apply
> cascading actions on the replica to table4, but that kind of row-by-row
> processing is what we want to avoid by using TRUNCATE in the first place.

Well, you could also queue a re-check at the end of the processed
message, doing a bulk re-check at the end. But that's obviously more
work.


> >> +  
> >> +TRUNCATE is treated as a form of
> >> +DELETE for the purpose of deciding whether
> >> +to publish, or not.
> >> +  
> >>   
> >>  
> >> 
> > 
> > Why is this a good idea?
> 
> I think it seemed like a good idea at the time, so to speak, but several
> people have argued against it, so we should probably change this in the
> final version.

I think it's e.g. perfectly reasonable to want OLTP changes to be
replicated, but not to truncate historical data. So for me those actions
should be separate...

Greetings,

Andres Freund



Re: [PATCH] Logical decoding of TRUNCATE

2018-04-02 Thread Simon Riggs
On 2 April 2018 at 19:30, Peter Eisentraut
 wrote:

>>> ***
>>> *** 111,116  CREATE PUBLICATION >> class="parameter">name
>>> --- 111,121 
>>> and so the default value for this option is
>>> 'insert, update, delete'.
>>>
>>> +  
>>> +TRUNCATE is treated as a form of
>>> +DELETE for the purpose of deciding whether
>>> +to publish, or not.
>>> +  
>>>   
>>>  
>>> 
>>
>> Why is this a good idea?
>
> I think it seemed like a good idea at the time, so to speak, but several
> people have argued against it, so we should probably change this in the
> final version.

Who has argued against it?

I see that Andres has asked twice about it and been answered twice,
but expressed no opinion.
Petr has said he thinks that's right.
Nobody else has commented.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: disable SSL compression?

2018-04-02 Thread Andres Freund
Hi,

On 2018-04-02 12:46:25 -0400, Tom Lane wrote:
> It seemed like the attack you described wasn't all that dependent on
> whether the data is compressed or not: if you can see the size of the
> server's reply to "select ... where account_number = x", you can pretty
> well tell the difference between 0 and 1 rows, with or without
> compression.  So I'm still not very clear on what the threat model is.

Imagine that the attacker has control over *parts* of a query. The
server always sends a query like
SELECT mysecret FROM secrets WHERE ...;
but somehow, and there's definitely sneaky ways, there's a second query
sent alongside.
SELECT '$user-defined-string';

If those are compressed together *and* the attacker can observe the size
of the returned result, the attacker can change $user-defined-string
iteratively and infer knowledge about what the contents of mysecret
are. If $user-defined-string and mysecret are the same you're going to
get a smaller total response packet.

One such injection vector that's commonly changable can just be a
username or such.

Greetings,

Andres Freund



Re: [PATCH] Logical decoding of TRUNCATE

2018-04-02 Thread Peter Eisentraut
On 4/1/18 16:01, Andres Freund wrote:
> On 2018-01-25 14:21:15 +0100, Marco Nenciarini wrote:
>> +if (SessionReplicationRole != SESSION_REPLICATION_ROLE_REPLICA)
>> +{
>> + 
>> +/*
>> + * Check foreign key references.  In CASCADE mode, this should 
>> be
>> + * unnecessary since we just pulled in all the references; but 
>> as a
>> + * cross-check, do it anyway if in an Assert-enabled build.
>> + */
>>   #ifdef USE_ASSERT_CHECKING
>>  heap_truncate_check_FKs(rels, false);
>> + #else
>> +if (stmt->behavior == DROP_RESTRICT)
>> +heap_truncate_check_FKs(rels, false);
>>   #endif
>> +}
> 
> That *can't* be right.

You mean the part that ignores this in session_replication_role =
replica?  I don't like that either.  And it's also not clear why it's
needed for this feature.

>> +case REORDER_BUFFER_CHANGE_TRUNCATE:
>> +appendStringInfoString(ctx->out, " TRUNCATE:");
>> + 
>> +if (change->data.truncate_msg.restart_seqs
>> +|| change->data.truncate_msg.cascade)
>> +{
>> +if (change->data.truncate_msg.restart_seqs)
>> +appendStringInfo(ctx->out, " 
>> restart_seqs");
>> +if (change->data.truncate_msg.cascade)
>> +appendStringInfo(ctx->out, " cascade");
>> +}
>> +else
>> +appendStringInfoString(ctx->out, " (no-flags)");
>> +break;
>>  default:
>>  Assert(false);
>>  }
> 
> I know this has been discussed in the thread already, but it really
> strikes me as wrong to basically do some mini DDL replication feature
> via per-command WAL records.

The problem is that the interaction of TRUNCATE and foreign keys is a mess.

Let's say I have a provider database with table1, table2, and table3,
all connected by foreign keys, and a replica database with table1,
table2, and table4, all connected by foreign keys.  I run TRUNCATE
table1 CASCADE on the provider.  What should replication do?

The proposed patch applies the TRUNCATE table1 CASCADE on the replica,
which ends up truncating table1, table2, and table4, which is different
from what happened on the origin.

An alternative might be to make the replication protocol message say "I
truncated table1, table2" (let's say table3 is not replicated).  (This
sounds more like logical replication rather than DDL replication.)  That
will then fail to apply on the replica, because it requires that you
include all tables connected by foreign keys.

We could then do what we do in the DELETE case and just ignore foreign
keys altogether, which is obviously bad and not a forward-looking behavior.

Or we could do what we *should* be doing in the DELETE case and apply
cascading actions on the replica to table4, but that kind of row-by-row
processing is what we want to avoid by using TRUNCATE in the first place.

None of these solutions are good.  I don't have any other ideas, though.


>> ***
>> *** 111,116  CREATE PUBLICATION > class="parameter">name
>> --- 111,121 
>> and so the default value for this option is
>> 'insert, update, delete'.
>>
>> +  
>> +TRUNCATE is treated as a form of
>> +DELETE for the purpose of deciding whether
>> +to publish, or not.
>> +  
>>   
>>  
>> 
> 
> Why is this a good idea?

I think it seemed like a good idea at the time, so to speak, but several
people have argued against it, so we should probably change this in the
final version.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: disable SSL compression?

2018-04-02 Thread Andres Freund
Hi,

On 2018-04-02 10:25:04 -0400, Robert Haas wrote:
> In general, I'd expect compressing data to be beneficial for the
> security of encryption because it should increase the entropy of the
> encrypted bytes, but obviously it's not hard to hypothesize cases
> where the opposite is true for one reason or another.

I don't think it's actually ever a really positive thing for security to
compress before encrypting, and encrypting after should always be
useless.  The problem is that that opens one hell of a sidechannel
attack, because you're suddenly leaking information about the
compressability of the transferred data. If there's any way attackers
have knowledge, or worse influence, of any of the transported data that
allows to make inferrerences about the content and potentially the key.

Whereas there should never be a observable difference in the encrypted
stream, if you use a sane cipher mode (i.e. NOT ECB).

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Andres Freund
Hi,

On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:
> On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
> > Craig Ringer  writes:
> > > So we should just use the big hammer here.
> >
> > And bitch, loudly and publicly, about how broken this kernel behavior is.
> > If we make enough of a stink maybe it'll get fixed.
> 
> It is not likely to be fixed (beyond what has been done already with the
> manpage patches and errseq_t fixes on the reporting level). The issue is,
> the kernel needs to deal with hard IO errors at that level somehow, and
> since those errors typically persist, re-dirtying the pages would not
> really solve the problem (unless some filesystem remaps the request to a
> different block, assuming the device is alive).

Throwing away the dirty pages *and* persisting the error seems a lot
more reasonable. Then provide a fcntl (or whatever) extension that can
clear the error status in the few cases that the application that wants
to gracefully deal with the case.


> Keeping around dirty
> pages that cannot possibly be written out is essentially a memory leak,
> as those pages would stay around even after the application has exited.

Why do dirty pages need to be kept around in the case of persistent
errors? I don't think the lack of automatic recovery in that case is
what anybody is complaining about. It's that the error goes away and
there's no reasonable way to separate out such an error from some
potential transient errors.

Greetings,

Andres Freund



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-02 Thread Andres Freund
On 2018-04-02 09:23:10 +0100, Simon Riggs wrote:
> On 29 March 2018 at 23:24, Andres Freund  wrote:
> 
> >> I agree with the former, of course - docs are a must. I disagree with
> >> the latter, though - there have been about no proposals how to do it
> >> without the locking. If there are, I'd like to hear about it.
> >
> > I don't care. Either another solution needs to be found, or the locking
> > needs to be automatically performed when necessary.
> 
> That seems unreasonable.

> It's certainly a nice future goal to have it all happen automatically,
> but we don't know what the plugin will do.

No, fighting too complicated APIs is not unreasonable. And we've found
an alternative. 


> How can we ever make an unknown task happen automatically? We can't.

The task isn't unknown, so this just seems like a non sequitur.


Greetings,

Andres Freund



Re: disable SSL compression?

2018-04-02 Thread Garick Hamlin
On Mon, Apr 02, 2018 at 12:46:25PM -0400, Tom Lane wrote:
> Peter Eisentraut  writes:
> > I agree the attack is less likely to be applicable in typical database
> > installations.  I think we should move forward with considering protocol
> > compression proposals, but any final result should put a warning in the
> > documentation that using compression is potentially insecure.
> 
> It seemed like the attack you described wasn't all that dependent on
> whether the data is compressed or not: 

I think it is.  
I wrote something longer about this in 2013.
https://www.postgresql.org/message-id/20130115172253.ga6...@isc.upenn.edu

The key idea is you can now to a radix search of a secret rather than a
single 1-bit match/doesn't match oracle type attack which is not
feasible when you have an high entropy low density value like a big
ascii number or hex string.

I do something like set my preferred email (or anything an application
lets a user change) to something like a3bf and keep extending the guess
as I go and sometimes I compress along with the secret and rather than
have to guess a whole application session id all at once and have a work
factor of O(2^128) or something I've got a very tractable search.

It might not be something that a problem for everyone, but in some
situations it's attackable.

Garick




Re: disable SSL compression?

2018-04-02 Thread Peter Eisentraut
On 4/2/18 12:46, Tom Lane wrote:
> Peter Eisentraut  writes:
>> I agree the attack is less likely to be applicable in typical database
>> installations.  I think we should move forward with considering protocol
>> compression proposals, but any final result should put a warning in the
>> documentation that using compression is potentially insecure.
> 
> It seemed like the attack you described wasn't all that dependent on
> whether the data is compressed or not: if you can see the size of the
> server's reply to "select ... where account_number = x", you can pretty
> well tell the difference between 0 and 1 rows, with or without
> compression.  So I'm still not very clear on what the threat model is.

Well these could also be update commands or procedure calls with a
constant response size.  Also, it doesn't matter whether the select
returns anything.  Maybe it's not querying the main accounts table.  But
it already shows that the client thinks that the account number is a
real one.

There are probably even better examples.  But the main point is that if
an attacker can control part of what you send alongside some secret
thing, compression is going to be a security concern for some.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: tab complete for procedures for \sf and \ef commands

2018-04-02 Thread Peter Eisentraut
On 4/1/18 04:14, Pavel Stehule wrote:
> Hi
> 
> small bugfix patch

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: disable SSL compression?

2018-04-02 Thread Tom Lane
Peter Eisentraut  writes:
> I agree the attack is less likely to be applicable in typical database
> installations.  I think we should move forward with considering protocol
> compression proposals, but any final result should put a warning in the
> documentation that using compression is potentially insecure.

It seemed like the attack you described wasn't all that dependent on
whether the data is compressed or not: if you can see the size of the
server's reply to "select ... where account_number = x", you can pretty
well tell the difference between 0 and 1 rows, with or without
compression.  So I'm still not very clear on what the threat model is.

regards, tom lane



Re: new function for tsquery creartion

2018-04-02 Thread Dmitry Ivanov

I've fixed a bug and added some tests and documentation.

--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5abb1c46fb..c3b7be6e4e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9609,6 +9609,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
 phraseto_tsquery('english', 'The Fat Rats')
 'fat' <-> 'rat'

+   
+
+ 
+  websearch_to_tsquery
+ 
+  websearch_to_tsquery( config regconfig ,  query text)
+ 
+tsquery
+produce tsquery from a web search style query
+websearch_to_tsquery('english', '"fat rat" or rat')
+'fat' <-> 'rat' | 'rat'
+   

 
  
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 610b7bf033..19f58511c8 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,16 @@ UPDATE tt SET ti =

 PostgreSQL provides the
 functions to_tsquery,
-plainto_tsquery, and
-phraseto_tsquery
+plainto_tsquery,
+phraseto_tsquery and
+websearch_to_tsquery
 for converting a query to the tsquery data type.
 to_tsquery offers access to more features
 than either plainto_tsquery or
-phraseto_tsquery, but it is less forgiving
-about its input.
+phraseto_tsquery, but it is less forgiving about its
+input. websearch_to_tsquery is a simplified version
+of to_tsquery with an alternative syntax, similar
+to the one used by web search engines.

 

@@ -962,6 +965,87 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
 

 
+
+websearch_to_tsquery( config regconfig,  querytext text) returns tsquery
+
+
+   
+websearch_to_tsquery creates a tsquery
+value from querytext using an alternative
+syntax in which simple unformatted text is a valid query.
+Unlike plainto_tsquery
+and phraseto_tsquery, it also recognizes certain
+operators. Moreover, this function should never raise syntax errors,
+which makes it possible to use raw user-supplied input for search.
+The following syntax is supported:
+
+ 
+   
+unquoted text: text not inside quote marks will be
+converted to terms separated by & operators, as
+if processed by
+plainto_tsquery.
+  
+ 
+ 
+   
+"quoted text": text inside quote marks will be
+converted to terms separated by <->
+operators, as if processed by phraseto_tsquery.
+  
+ 
+ 
+  
+   OR: logical or will be converted to
+   the | operator.
+  
+ 
+ 
+  
+   -: the logical not operator, converted to the
+   the ! operator.
+  
+ 
+
+   
+   
+Examples:
+
+  select websearch_to_tsquery('english', 'The fat rats');
+   websearch_to_tsquery
+  -
+   'fat' & 'rat'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '"supernovae stars" -crab');
+ websearch_to_tsquery
+  --
+   'supernova' <-> 'star' & !'crab'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '"sad cat" or "fat rat"');
+ websearch_to_tsquery
+  ---
+   'sad' <-> 'cat' | 'fat' <-> 'rat'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', 'signal -"segmentation fault"');
+   websearch_to_tsquery
+  ---
+   'signal' & !( 'segment' <-> 'fault' )
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '""" )( dummy \\ query <->');
+   websearch_to_tsquery
+  --
+   'dummi' & 'queri'
+  (1 row)
+
+
   
 
   
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..6055fb6b4e 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -490,7 +490,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
 	query = parse_tsquery(text_to_cstring(in),
 		  pushval_morph,
 		  PointerGetDatum(&data),
-		  false);
+		  0);
 
 	PG_RETURN_TSQUERY(query);
 }
@@ -520,7 +520,7 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
 	query = parse_tsquery(text_to_cstring(in),
 		  pushval_morph,
 		  PointerGetDatum(&data),
-		  true);
+		  P_TSQ_PLAIN);
 
 	PG_RETURN_POINTER(query);
 }
@@ -551,7 +551,7 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
 	query = parse_tsquery(text_to_cstring(in),
 		  pushval_morph,
 		  PointerGetDatum(&data),
-		  true);
+		  P_TSQ_PLAIN);
 
 	PG_RETURN_TSQUERY(query);
 }
@@ -567,3 +567,35 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
 		ObjectIdGetDatum(cfgId),
 		PointerGetDatum(in)));
 }
+
+Datum
+websearch_to_tsquery_byid(P

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

2018-04-02 Thread Alvaro Herrera
>   /*
> -  * Check whether default partition has a row that would fit the 
> partition
> -  * being attached.
> +  * Check if the default partition contains a row that would belong in 
> the
> +  * partition being attached.
>*/
> - defaultPartOid =
> - get_default_oid_from_partdesc(RelationGetPartitionDesc(rel));
>   if (OidIsValid(defaultPartOid))

Oh my.  This code is terrible, and I think this patch is wrong.  More
later.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: disable SSL compression?

2018-04-02 Thread Peter Eisentraut
On 4/2/18 11:05, Robert Haas wrote:
> Ah.  Yeah, that makes sense.  Although to use PG as a vector, it seems
> like the attacker would need to the ability to snoop network traffic
> between the application server and PG.  If both of those are
> presumably inside the bank's network and yet an attacker can sniff
> them, to some degree you've already lost.  Now it could be that a
> rogue bank employee is trying to gain access from within the bank, or
> maybe your bank exposes application-to-database traffic on the public
> Internet.  But in general that seems like traffic that should be
> well-secured anyway for lots of reasons, as opposed to the case where
> one part of your browser is trying to hide information from another
> part of your browser, which is a lot harder to isolate thoroughly.

I agree the attack is less likely to be applicable in typical database
installations.  I think we should move forward with considering protocol
compression proposals, but any final result should put a warning in the
documentation that using compression is potentially insecure.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: check_ssl_key_file_permissions should be in be-secure-common.c

2018-04-02 Thread Peter Eisentraut
On 4/2/18 02:51, Michael Paquier wrote:
> It seems to me that this routine should be logically put into
> be-secure-common.c so as future SSL implementations can use it.  This
> makes the code more consistent with the frontend refactoring that has
> happened in f75a959.  I would not have bothered about this refactoring
> if be-secure-openssl.c did not exist yet, but as it does I think that we
> should bite the bullet, and do that for v11 so as a good base is in
> place for the future.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-04-02 Thread Robert Haas
On Wed, Mar 28, 2018 at 2:12 PM, Andres Freund  wrote:
> How will it break it? They'll see an invalid ctid and conclude that the
> tuple is dead? Without any changes that's already something that can
> happen if a later tuple in the chain has been pruned away.  Sure, that
> code won't realize it should error out because the tuple is now in a
> different partition, but neither would a infomask bit.
>
> I think my big problem is that I just don't see what the worst that can
> happen is. We'd potentially see a broken ctid chain, something that very
> commonly happens, and consider the tuple to be invisible.  That seems
> pretty sane behaviour for unadapted code, and not any worse than other
> potential solutions.

This is more or less my feeling as well.  I think it's better to
conserve our limited supply of infomask bits as much as we can, and I
do think that we should try to reclaimed HEAP_MOVED_IN and
HEAP_MOVED_OFF in the future instead of defining the combination of
the two of them to mean something now.

The only scenario in which I can see this patch really leading to
disaster is if there's some previous release out there where the bit
pattern chosen by this patch has some other, incompatible meaning.  As
far as we know, that's not the case: this bit pattern was previously
unused.  Code seeing that bit pattern could potentially therefore (1)
barf on the valid CTID, but the whole point of this is to throw an
ERROR anyway, so if that happens then we're getting basically the
right behavior with the wrong error message or (2) just treat it as a
broken CTID link, in which case the result should be pretty much the
same as if this patch hadn't been committed in the first place.

Where the multixact patch really caused us a lot of trouble is that
the implications weren't just for the heap itself -- the relevant
SLRUs became subject to new retention requirements which in turn
affected vacuum, autovacuum, and checkpoint behavior.  There is no
similar problem here -- the flag indicating the problematic situation,
however it ends up being stored, doesn't point to any external data.
Now, that doesn't mean there can't be some other kind of problem with
this patch, but I don't think that we should block the patch on the
theory that it might have an undiscovered problem that destroys the
entire PostgreSQL ecosystem with no theory as to what that problem
might actually be.  Modulo implementation quality, I think the risk
level of this patch is somewhat but not vastly higher than
37484ad2aacef5ec794f4dd3d5cf814475180a78, which similarly defined a
previously-unused bit pattern in the tuple header.  The reason I think
this one might be somewhat riskier is because AFAICS it's not so easy
to make sure we've found all the code, even in core, that might care,
as it was in that case; and also because updates happen more than
freezing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Feature Request - DDL deployment with logical replication

2018-04-02 Thread Craig Ringer
On 1 April 2018 at 00:57, Andres Freund  wrote:

> On 2018-03-31 22:13:42 +0800, Craig Ringer wrote:
> > We'll still need a mechanism to transport them to downstreams (like WAL
> > messages) and to send responses upstream. For responses I think we will
> > finally want to add a backchannel to the logical replication protocol as
> > I've wanted for a long while: downstream can send a COPY message on COPY
> > BOTH proto back to upstream, which passes it to a callback on the output
> > plugin for the output plugin to act on.
>
> Not necessarily?  You can just send out the prepare, wait for all
> clients to ack it, and then commit/rollback prepared.
>

We then lack any mechanism by which you can NACK, saying "I can't apply
this".

So upstream will wait indefinitely. I guess we just expect the user to
intervene and ROLLBACK if they decide a replica isn't going to get the job
done, or have checked the replica's logs and found it can't apply it for
some hopefully-sane reason.

It's not like we'd auto-ROLLBACK PREPARED in response to a nack from a
downstream anyway, so all we're missing is probably info in the upstream
logs about which replica(s) cannot apply it and why.

OK. So it'd be a nice-to-have, but not vital.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: disable SSL compression?

2018-04-02 Thread Robert Haas
On Mon, Apr 2, 2018 at 10:52 AM, Peter Eisentraut
 wrote:
> On 4/2/18 10:25, Robert Haas wrote:
>> As I understand it on a brief review of the Google search
>> results^W^W^Wliterature, the reason that was done was to prevent
>> things like the CRIME attack, which apparently involves Javascript
>> running in your browser from deducing information that it shouldn't be
>> able to get, like the Cookies that are sent along with the requests it
>> initiates.  No similar attack should be possible against PostgreSQL
>> because there's no similar kind of privilege separation.  Your PG
>> driver doesn't have untrusted Javascript running inside of it, we
>> hope.
>
> I think the attack is much more general involving two server end points,
> one of which is under the control of the attacker and provides
> information that the client is using to query the second server,
> independent of the protocols.
>
> For example, if your application code does maybe a geoip lookup and then
> does
>
> select * from sometable
> where a = $geo_data and b = 'secret bank account number';
>
> then the operator of the geoip service (or someone impersonating it, of
> course) can just rotate the lookup results through the bank account
> number space until they notice that the compression result changes, in
> which case they have matched the bank account number.
>
> In the web space, that is easier because the client code is typically
> viewable by the attacker, and this kind of query is more common (the
> "bank account number" is typically a session key), but the principle is
> the same.

Ah.  Yeah, that makes sense.  Although to use PG as a vector, it seems
like the attacker would need to the ability to snoop network traffic
between the application server and PG.  If both of those are
presumably inside the bank's network and yet an attacker can sniff
them, to some degree you've already lost.  Now it could be that a
rogue bank employee is trying to gain access from within the bank, or
maybe your bank exposes application-to-database traffic on the public
Internet.  But in general that seems like traffic that should be
well-secured anyway for lots of reasons, as opposed to the case where
one part of your browser is trying to hide information from another
part of your browser, which is a lot harder to isolate thoroughly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Change RangeVarGetRelidExtended() to take flags argument?

2018-04-02 Thread Bossart, Nathan
On 3/30/18, 7:09 PM, "Andres Freund"  wrote:
> Pushed.

Thanks!

Nathan



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Craig Ringer
On 2 April 2018 at 02:24, Thomas Munro 
wrote:


>
> Maybe my drive-by assessment of those kernel routines is wrong and
> someone will correct me, but I'm starting to think you might be better
> to assume the worst on all systems.  Perhaps a GUC that defaults to
> panicking, so that users on those rare OSes could turn that off?  Even
> then I'm not sure if the failure mode will be that great anyway or if
> it's worth having two behaviours.  Thoughts?
>
>
I see little benefit to not just PANICing unconditionally on EIO, really.
It shouldn't happen, and if it does, we want to be pretty conservative and
adopt a data-protective approach.

I'm rather more worried by doing it on ENOSPC. Which looks like it might be
necessary from what I recall finding in my test case + kernel code reading.
I really don't want to respond to a possibly-transient ENOSPC by PANICing
the whole server unnecessarily.

BTW, the support team at 2ndQ is presently working on two separate issues
where ENOSPC resulted in DB corruption, though neither of them involve logs
of lost page writes. I'm planning on taking some time tomorrow to write a
torture tester for Pg's ENOSPC handling and to verify ENOSPC handling in
the test case I linked to in my original StackOverflow post.

If this is just an EIO issue then I see no point doing anything other than
PANICing unconditionally.

If it's a concern for ENOSPC too, we should try harder to fail more nicely
whenever we possibly can.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Partition-wise aggregation/grouping

2018-04-02 Thread Robert Haas
On Thu, Mar 29, 2018 at 9:02 AM, Jeevan Chalke
 wrote:
> Yep, I see the risk.

Committed 0001 last week and 0002 just now.  I don't really see 0003 a
a critical need.  If somebody demonstrates that this saves a
meaningful amount of planning time, we can consider that part for a
future release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: disable SSL compression?

2018-04-02 Thread Peter Eisentraut
On 4/2/18 10:25, Robert Haas wrote:
> As I understand it on a brief review of the Google search
> results^W^W^Wliterature, the reason that was done was to prevent
> things like the CRIME attack, which apparently involves Javascript
> running in your browser from deducing information that it shouldn't be
> able to get, like the Cookies that are sent along with the requests it
> initiates.  No similar attack should be possible against PostgreSQL
> because there's no similar kind of privilege separation.  Your PG
> driver doesn't have untrusted Javascript running inside of it, we
> hope.

I think the attack is much more general involving two server end points,
one of which is under the control of the attacker and provides
information that the client is using to query the second server,
independent of the protocols.

For example, if your application code does maybe a geoip lookup and then
does

select * from sometable
where a = $geo_data and b = 'secret bank account number';

then the operator of the geoip service (or someone impersonating it, of
course) can just rotate the lookup results through the bank account
number space until they notice that the compression result changes, in
which case they have matched the bank account number.

In the web space, that is easier because the client code is typically
viewable by the attacker, and this kind of query is more common (the
"bank account number" is typically a session key), but the principle is
the same.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rewriting the test of pg_upgrade as a TAP test - take two

2018-04-02 Thread Peter Eisentraut
On 3/21/18 21:49, Michael Paquier wrote:
> On Thu, Mar 22, 2018 at 09:42:35AM +0800, Craig Ringer wrote:
>> I'm super excited by the idea of multi-version support in TAP, if that's
>> what you mean.
>>
>> Why? Because I use TAP heavily in extensions. Especially replication
>> extensions. Which like to talk across multiple versions. I currently need
>> external test frameworks and some hideous hacks to do this.
> 
> Okay, in front of such enthusiasm we could keep at least the refactoring
> part of PostgresNode.pm :)

I took a quick look at that part.  It appears to be quite invasive, more
than I would have hoped.  Basically, it imposes that from now on all
program invocations must observe the bindir setting, which someone is
surely going to forget.  The pg_upgrade tests aren't going to exercise
all possible paths where programs are called, so this is going to lead
to omissions and inconsistencies -- which will then possibly only be
found much later by the extensions that Craig was talking about.  I'd
like to see this more isolated, maybe via a path change, or something
inside system_or_bail or something less invasive and more maintainable.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: disable SSL compression?

2018-04-02 Thread Robert Haas
On Wed, Mar 28, 2018 at 7:16 PM, Andres Freund  wrote:
> +analysis of whether that's safe to do from a cryptographic POV. There's a 
> reason compression has been disabled for just about all SSL/TLS libraries.

As I understand it on a brief review of the Google search
results^W^W^Wliterature, the reason that was done was to prevent
things like the CRIME attack, which apparently involves Javascript
running in your browser from deducing information that it shouldn't be
able to get, like the Cookies that are sent along with the requests it
initiates.  No similar attack should be possible against PostgreSQL
because there's no similar kind of privilege separation.  Your PG
driver doesn't have untrusted Javascript running inside of it, we
hope.

In general, I'd expect compressing data to be beneficial for the
security of encryption because it should increase the entropy of the
encrypted bytes, but obviously it's not hard to hypothesize cases
where the opposite is true for one reason or another.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Foreign keys and partitioned tables

2018-04-02 Thread Peter Eisentraut
Comments on the code:

@@ -7226,12 +7235,23 @@ ATAddForeignKeyConstraint(AlteredTableInfo *tab,
Relation rel,
 * numbers)
 */
if (pkrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+   {
+   /* fix recursion in ATExecValidateConstraint to enable this case */
+   if (fkconstraint->skip_validation && !fkconstraint->initially_valid)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("cannot add NOT VALID foreign key to
relation \"%s\"",
+   RelationGetRelationName(pkrel;
+   }

Maybe this error message should be more along the lines of "is not
supported yet".  Also, this restriction should perhaps be mentioned in
the documentation additions that we have been discussing.


The first few hunks in ri_triggers.c (the ones that refer to the
pktable) are apparently for the postponed part of the feature, foreign
keys referencing partitioned tables.  So I think those hunks should be
dropped from this patch.

The removal of the ONLY for the foreign key queries also affects
inherited tables, in undesirable ways.  For example, consider this
script:

create table test1 (a int primary key);
insert into test1 values (1), (2), (3);

create table test2 (x int primary key, y int references test1 on delete
cascade);
insert into test2 values (11, 1), (22, 2), (33, 3);

create table test2a () inherits (test2);
insert into test2a values (111, 1), (222, 2);

delete from test1 where a = 1;

select * from test1 order by 1;
select * from test2 order by 1, 2;

With the patch, this will have deleted the (111, 1) record in test2a,
which it did not do before.

I think the trigger functions need to look up whether the table is a
partitioned table and decide whether the use ONLY based on that.
(This will probably also apply to the PK side, when that is
implemented.)


In pg_dump, maybe this can be refined:

+   /*
+* For partitioned tables, it doesn't work to emit constraints
as not
+* inherited.
+*/
+   if (tbinfo->relkind == RELKIND_PARTITIONED_TABLE)
+   only = "";
+   else
+   only = "ONLY ";

We use ONLY for foreign keys on inherited tables, but foreign keys are
not inherited anyway, so this is at best future-proofing.  We could
just drop the ONLY unconditionally.  Or at least explain this more.


Other than that, it looks OK.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
> Craig Ringer  writes:
> > So we should just use the big hammer here.
>
> And bitch, loudly and publicly, about how broken this kernel behavior is.
> If we make enough of a stink maybe it'll get fixed.

It is not likely to be fixed (beyond what has been done already with the
manpage patches and errseq_t fixes on the reporting level). The issue is,
the kernel needs to deal with hard IO errors at that level somehow, and
since those errors typically persist, re-dirtying the pages would not
really solve the problem (unless some filesystem remaps the request to a
different block, assuming the device is alive). Keeping around dirty
pages that cannot possibly be written out is essentially a memory leak,
as those pages would stay around even after the application has exited.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote:
>On 31 March 2018 at 21:24, Anthony Iliopoulos <[1]ail...@altatus.com>
>wrote:
> 
>  On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:
> 
>  > >> Yeah, I see why you want to PANIC.
>  > >
>  > > Indeed. Even doing that leaves question marks about all the kernel
>  > > versions before v4.13, which at this point is pretty much everything
>  > > out there, not even detecting this reliably. This is messy.
> 
>  There may still be a way to reliably detect this on older kernel
>  versions from userspace, but it will be messy whatsoever. On EIO
>  errors, the kernel will not restore the dirty page flags, but it
>  will flip the error flags on the failed pages. One could mmap()
>  the file in question, obtain the PFNs (via /proc/pid/pagemap)
>  and enumerate those to match the ones with the error flag switched
>  on (via /proc/kpageflags). This could serve at least as a detection
>  mechanism, but one could also further use this info to logically
>  map the pages that failed IO back to the original file offsets,
>  and potentially retry IO just for those file ranges that cover
>  the failed pages. Just an idea, not tested.
> 
>That sounds like a huge amount of complexity, with uncertainty as to how
>it'll behave kernel-to-kernel, for negligble benefit.

Those interfaces have been around since the kernel 2.6 times and are
rather stable, but I was merely responding to your original post comment
regarding having a way of finding out which page(s) failed. I assume
that indeed there would be no benefit, especially since those errors
are usually not transient (typically they come from hard medium faults),
and although a filesystem could theoretically mask the error by allocating
a different logical block, I am not aware of any implementation that
currently does that.

>I was exploring the idea of doing selective recovery of one relfilenode,
>based on the assumption that we know the filenode related to the fd that
>failed to fsync(). We could redo only WAL on that relation. But it fails
>the same test: it's too complex for a niche case that shouldn't happen in
>the first place, so it'll probably have bugs, or grow bugs in bitrot over
>time.

Fully agree, those cases should be sufficiently rare that a complex
and possibly non-maintainable solution is not really warranted.

>Remember, if you're on ext4 with errors=remount-ro, you get shut down even
>harder than a PANIC. So we should just use the big hammer here.

I am not entirely sure what you mean here, does Pg really treat write()
errors as fatal? Also, the kind of errors that ext4 detects with this
option is at the superblock level and govern metadata rather than actual
data writes (recall that those are buffered anyway, no actual device IO
has to take place at the time of write()).

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-02 Thread Anthony Iliopoulos
On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

> >> Yeah, I see why you want to PANIC.
> >
> > Indeed. Even doing that leaves question marks about all the kernel
> > versions before v4.13, which at this point is pretty much everything
> > out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel
versions from userspace, but it will be messy whatsoever. On EIO
errors, the kernel will not restore the dirty page flags, but it
will flip the error flags on the failed pages. One could mmap()
the file in question, obtain the PFNs (via /proc/pid/pagemap)
and enumerate those to match the ones with the error flag switched
on (via /proc/kpageflags). This could serve at least as a detection
mechanism, but one could also further use this info to logically
map the pages that failed IO back to the original file offsets,
and potentially retry IO just for those file ranges that cover
the failed pages. Just an idea, not tested.

Best regards,
Anthony



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-02 Thread Nikhil Sontakke



> 
> Quick thought: Should be simple to release lock when interacting with network.

I don’t think this will be that simple. The network calls will typically happen 
from inside the plugins and we don’t want to make plugin authors responsible 
for that. 

> Could also have abort signal lockers.

With the decodegroup locking we do have access to all the decoding backend 
pids. So we could signal them. But am not sure signaling will work if the 
plugin is in the midst of a network
Call. 

I agree with Petr. With this decodegroup
Lock implementation we have an inexpensive but workable implementation for 
locking around the plugin call. Sure, the abort will be penalized but it’s 
bounded by the Wal sender timeout or a max of one change apply cycle.
As he mentioned if we can optimize this later we can do so without changing 
plugin coding semantics later. 

Regards,
Nikhils

> 
> Andres
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: hot_standby_feedback vs excludeVacuum and snapshots

2018-04-02 Thread Amit Kapila
On Sun, Apr 1, 2018 at 3:30 PM, Simon Riggs  wrote:
> On 31 March 2018 at 14:21, Amit Kapila  wrote:
>>
>> I think the vacuum assigns xids only if it needs to truncate some of
>> the pages in the relation which happens towards the end of vacuum.
>> So, it shouldn't hold back the xmin horizon for long.
>
> Yes, that's the reason. I recall VACUUMs giving lots of problems
> during development  of Hot Standby.
>
> VACUUM FULL was the thing that needed to be excluded in the past
> because it needed an xid to move rows.
>
> Greg's concern is a good one and his noticing that we hadn't
> specifically excluded VACUUMs is valid, so we should exclude them.
> Well spotted, Greg.
>
> So although this doesn't have the dramatic effect it might have had,
> there is still the possibility of some effect and I think we should
> treat it as a bug.
>

+1.  I think you missed to update the comments on top of the modified
function ("Similar to GetSnapshotData but returns more information. We
include all PGXACTs with an assigned TransactionId, even VACUUM
processes." ).  It seems last part of the sentence should be omitted
after your patch, otherwise, patch looks good to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [PATCH] Verify Checksums during Basebackups

2018-04-02 Thread Michael Banck
Hi!

On Sun, Apr 01, 2018 at 05:27:21PM +0200, Magnus Hagander wrote:
> It might be a micro-optimisation, but ISTM that we shouldn't do the
> basename(palloc(fn)) in is_heap_file() unless we actually need it -- so not
> at the top of the function. (And surely "." and ".." should not occur here?
> I think that's a result of copy/paste from the online checksum patch?
> 
> We also do exactly the same basename(palloc(fn)) in sendFile().  Can we
> find a way to reuse that duplication? Perhaps we want to split it into the
> two pieces already out in sendFile() and pass it in as separate parameters?

I've done the latter now, although it looks a bit weird that the second
argument data is a subset of the first.  I couldn't find another way to
not have it done twice, though.

> If not then this second one should at least only be done inside the if
> (verify_checksums).

We can't have both, as we need to call the is_heap_file() function in
order to determine whether we should verify the checksums. 
 
> There is a bigger problem next to that though -- I believe  basename() does
> not exist on Win32. I haven't tested it, but there is zero documentation of
> it existing, which usually indicates it doesn't. That's the part that
> definitely needs to get fixed.
> 
> I think you need to look into the functionality in port/path.c, in
> particular last_dir_separator()?

Thanks for the pointer, I've used that now; I mentioned before that
basename() might be a portability hazard, but couldn't find a good
substitute myself.

V6 of the patch is attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 8c488506fa..b94dd4ac65 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2382,7 +2382,7 @@ The commands accepted in replication mode are:
   
 
   
-BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ] [ MAX_RATE rate ] [ TABLESPACE_MAP ]
+BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ] [ MAX_RATE rate ] [ TABLESPACE_MAP ] [ NOVERIFY_CHECKSUMS ]
  BASE_BACKUP
 
 
@@ -2481,6 +2481,17 @@ The commands accepted in replication mode are:
  
 

+
+   
+NOVERIFY_CHECKSUMS
+
+ 
+  By default, checksums are verified during a base backup if they are
+  enabled. Specifying NOVERIFY_CHECKSUMS disables
+  this verification.
+ 
+
+   
   
  
  
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index e8921b1bb4..95045669c9 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -506,6 +506,22 @@ PostgreSQL documentation
   
  
 
+ 
+  -k
+  --no-verify-checksums
+  
+   
+Disables verification of checksums, if they are enabled on the server
+the base backup is taken from. 
+   
+   
+By default, checksums are verified and checksum failures will result in
+a non-zero exit status. However, the base backup will not be removed in
+this case, as if the --no-clean option was used.
+   
+  
+ 
+
  
   -v
   --verbose
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 516eea57f8..3c1fd620ae 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 
+#include "access/xlog.h"
 #include "access/xlog_internal.h"	/* for pg_start/stop_backup */
 #include "catalog/catalog.h"
 #include "catalog/pg_type.h"
@@ -31,6 +32,8 @@
 #include "replication/basebackup.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -99,6 +102,15 @@ static TimeOffset elapsed_min_unit;
 /* The last check of the transfer rate. */
 static TimestampTz throttled_last;
 
+/* The starting XLOG position of the base backup. */
+static XLogRecPtr startptr;
+
+/* Total number of checksum failures during base backup. */
+static int64 total_checksum_failures;
+
+/* Do not verify checksums. */
+static bool noverify_checksums = false;
+
 /*
  * The contents of these directories are removed or recreated during server
  * start so they are not included in backups.  The directories themselves are
@@ -194,7 +206,6 @@ base_backup_cleanup(int code, Datum arg)
 static void
 perform_base_backup(basebackup_options *opt)
 {
-	XLogRecPtr	startptr;
 	TimeLineID	starttli;
 	XLogRecPtr	endptr;
 	TimeLineID	endtli;
@@ -210,6 +2

Re: Foreign keys and partitioned tables

2018-04-02 Thread Peter Eisentraut
On 3/31/18 18:21, Alvaro Herrera wrote:
> Yeah, I started by putting what I thought was going to be just ALTER
> TABLE in that test, then moved to the other file and added what I
> thought were more complete tests there and failed to move stuff to
> alter_table.  Honestly, I think these should mostly all belong in
> foreign_key,

right

>   
> -  Partitioned tables do not support EXCLUDE or
> -  FOREIGN KEY constraints; however, you can define
> -  these constraints on individual partitions.
> +  Partitioned tables do not support EXCLUDE 
> constraints;
> +  however, you can define these constraints on individual partitions.
> +  Also, while it's possible to define PRIMARY KEY
> +  constraints on partitioned tables, it is not supported to create 
> foreign
> +  keys cannot that reference them.  This restriction will be lifted in a

This doesn't read correctly.

> +  future release.
>   

> -  tables and permanent tables.
> +  tables and permanent tables.  Also note that while it is permitted to
> +  define a foreign key on a partitioned table, declaring a foreign key
> +  that references a partitioned table is not allowed.
>   

Maybe use "possible" or "supported" instead of "allowed" and "permitted"
in this and similar cases.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-02 Thread Etsuro Fujita

(2018/04/02 18:49), Amit Langote wrote:

2. If I understand the description you provided in [1] correctly, the
point of having ri_PartitionIsValid and ExecInitExtraPartitionInfo() is to
avoid possibly-redundantly performing following two steps in
ExecSetupPartitionTupleRouting() for *reused* partition ResultRelInfos
that may not be used for tuple routing after all:

   - create the parent_child_tupconv_maps[] entry for it
   - perform FDW tuple routing initialization.


Sorry, my explanation was not enough, but that was just one of the reasons
why I introduced those; another is to do CheckValidResultRel against a
partition after the partition was chosen for tuple routing rather than in
ExecSetupPartitionTupleRouting, to avoid aborting an UPDATE of a partition
key unnecessarily due to non-routable foreign-partitions that may not be
chosen for tuple routing after all.


I see.  So, currently in ExecSetupPartitionTupleRouting, we call
CheckValidResultRel() to check if resultRelInfos reused from those
initialized for UPDATE are valid for insertions (or rather for routing
inserted tuples into it).  But you're proposing to delay that check until
ExecPrepareTupleRouting is called for a given resultRelInfo, at which
point it's clear that we actually want to insert using that resultRelInfo.
  That seems reasonable.


That's exactly what I'm thinking.


Now we have ON CONFLICT for partitioned tables, which requires the
conversion map to be computed in ExecInitPartitionInfo, so I updated the
patch so that we keep the former step in ExecInitPartitionInfo and
ExecSetupPartitionTupleRouting and so that we just init the FDW in
ExecInitExtraPartitionInfo (changed to ExecInitForeignRouting).  And I
added comments to ExecInitForeignRouting and ExecPrepareTupleRouting.


That looks good.


I revisited this.  Please see the reply to Alvaro I sent right now.


I looked at the new patch.  It looks good overall, although I have one
question -- IIUC, BeginForeignInsert() performs actions that are
equivalent of performing PlanForeignModify() + BeginForeignModify() for an
INSERT as if it was directly executed on a given foreign table/partition.
Did you face any problems in doing the latter itself in the core code?
Doing it that way would mean no changes to a FDW itself will be required
(and hence no need for additional APIs), but I may be missing something.


The biggest issue in performing PlanForeignModify() plus 
BeginForeignModify() instead of the new FDW API would be: can the core 
code create a valid-looking planner state passed to PlanForeignModify() 
such as the ModifyTable plan node or the query tree stored in PlannerInfo?



One thing that seems good about your approach is that newly added support
for COPY FROM on foreign tables/partitions takes minimal changes as
implemented by using the new API, whereas with the alternative approach it
would require duplicated code (same code would have to be present in both
copy.c and execPartition.c, but it could perhaps be avoided).


I agree.

Thanks for the review!

Best regards,
Etsuro Fujita



Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-02 Thread Etsuro Fujita

(2018/03/30 22:28), Alvaro Herrera wrote:

Etsuro Fujita wrote:


Now we have ON CONFLICT for partitioned tables, which requires the
conversion map to be computed in ExecInitPartitionInfo, so I updated the
patch so that we keep the former step in ExecInitPartitionInfo and
ExecSetupPartitionTupleRouting and so that we just init the FDW in
ExecInitExtraPartitionInfo (changed to ExecInitForeignRouting).  And I added
comments to ExecInitForeignRouting and ExecPrepareTupleRouting.


Hmm, I may be misreading what you mean here ... but as far as I
understand, ON CONFLICT does not support foreign tables, does it?


It doesn't, except for ON CONFLICT DO NOTHING without an inference 
specification.



If
that is so, I'm not sure why the conversion map would be necessary, if
it is for ON CONFLICT alone.


We wouldn't need that for foreign partitions (When DO NOTHING with an 
inference specification or DO UPDATE on a partitioned table containing 
foreign partitions, the planner would throw an error before we get to 
ExecInitPartitionInfo).  The reason why I updated the patch that way is 
just to make the patch simple, but on reflection I don't think that's a 
good idea; I think we should delay the map-creation step as well as the 
FDW-initialization step for UPDATE subplan partitions as before for 
improved efficiency for UPDATE-of-partition-key.  However, I don't think 
it'd still be a good idea to delay those steps for partitions created by 
ExecInitPartitionInfo the same way as for the subplan partitions, 
because in that case it'd be better to perform CheckValidResultRel 
against a partition right after we do InitResultRelInfo for the 
partition in ExecInitPartitionInfo, as that would save cycles in cases 
when the partition is considered nonvalid by that check.  So, What I'm 
thinking is: a) for the subplan partitions, we delay those steps as 
before, and b) for the partitions created by ExecInitPartitionInfo, we 
do that check for a partition right after InitResultRelInfo in that 
function, (and if valid, we create a map and initialize the FDW for the 
partition in that function).


Best regards,
Etsuro Fujita



Re: Diagonal storage model

2018-04-02 Thread Andrey Borodin


> 2 апр. 2018 г., в 16:57, Ashutosh Bapat  
> написал(а):
> On Mon, Apr 2, 2018 at 3:49 AM, Marko Tiikkaja  wrote:
>> 
>> I'm a little worried about the fact that even with this model we're still
>> limited to only two dimensions.  That's bound to cause problems sooner or
>> later.
> How about a 3D storage model, whose first dimension gives horizontal
> view, second provides vertical or columnar view and third one provides
> diagonal view. It also provides capability to add extra dimensions to
> provide additional views like double diagonal view. Alas! it all
> collapses since I was late to the party.

BTW, MDX expression actually provides mulitidimensional result. They have 
COLUMNS, ROWS, PAGES, SECTIONS, CHAPTERS, and AXIS(N) for those who is not 
satisfied with named dimensions.

Best regards, Andrey Borodin.


Re: [HACKERS] path toward faster partition pruning

2018-04-02 Thread David Rowley
On 2 April 2018 at 17:18, Amit Langote  wrote:
> On 2018/03/31 0:55, David Rowley wrote:
>> explain (analyze, costs off, summary off, timing off)  execute q1 (1,2,2,0);
>>QUERY PLAN
>> 
>>  Result (actual rows=0 loops=1)
>>One-Time Filter: false
>> (2 rows)
>
> Hmm.  It is the newly added inversion step that's causing this.  When
> creating a generic plan (that is when the planning happens via
> BuildCachedPlan called with boundParams set to NULL), the presence of
> Params will cause an inversion step's source step to produce
> scan-all-partitions sort of result, which the inversion step dutifully
> inverts to a scan-no-partitions result.
>
> I have tried to attack that problem by handling the
> no-values-to-prune-with case using a side-channel to propagate the
> scan-all-partitions result through possibly multiple steps.  That is, a
> base pruning step will set datum_offsets in a PruneStepResult only if
> pruning is carried out by actually comparing values with the partition
> bounds.  If no values were provided (like in the generic plan case), it
> will set a scan_all_nonnull flag instead and return without setting
> datum_offsets.  Combine steps perform their combining duty only if
> datum_offset contains a valid value, that is, if scan_all_nonnulls is not set.

I'm afraid this is still not correct :-(

The following code is not doing the right thing:

+ case COMBINE_UNION:
+ foreach(lc1, cstep->source_stepids)
+ {
+ int step_id = lfirst_int(lc1);
+ PruneStepResult *step_result;
+
+ /*
+ * step_results[step_id] must contain a valid result,
+ * which is confirmed by the fact that cstep's step_id is
+ * greater than step_id and the fact that results of the
+ * individual steps are evaluated in sequence of their
+ * step_ids.
+ */
+ if (step_id >= cstep->step.step_id)
+ elog(ERROR, "invalid pruning combine step argument");
+ step_result = step_results[step_id];
+ Assert(step_result != NULL);
+
+ result->scan_all_nonnull = step_result->scan_all_nonnull;

The last line there is not properly performing a union, it just sets
the result_scan_all_nonnull to whatever the last step's value was.

At the very least it should be |= but I don't really like this new code.

Why did you move away from just storing the matching partitions in a
Bitmapset? If you want to store all non-null partitions, then why not
just set the bits for all non-null partitions? That would cut down on
bugs like this since the combining of step results would just be
simple unions or intersects.

Also, the following code could be made a bit nicer

+ result = (PruneStepResult *) palloc0(sizeof(PruneStepResult));
+
+ switch (context->strategy)
+ {
+ case PARTITION_STRATEGY_HASH:
+ result->bound_offsets = get_matching_hash_bound(context,
+ opstep->opstrategy,
+ values, nvalues,
+ partsupfunc,
+ opstep->nullkeys,
+ &result->scan_all_nonnull);

Why not allocate the PruneStepResult inside the get_matching_*_bound,
that way you wouldn't need all those out parameters to set the bool
fields.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Diagonal storage model

2018-04-02 Thread Ashutosh Bapat
On Mon, Apr 2, 2018 at 3:49 AM, Marko Tiikkaja  wrote:
> On Sun, Apr 1, 2018 at 3:48 PM, Konstantin Knizhnik
>  wrote:
>>
>> I want to announce new model, "diagonal storage" which combines benefits
>> of both approaches.
>> The idea is very simple: we first store column 1 of first record, then
>> column 2 of second record, ... and so on until we reach the last column.
>> After it we store second column of first record, third column of the
>> second record,...
>
>
> I'm a little worried about the fact that even with this model we're still
> limited to only two dimensions.  That's bound to cause problems sooner or
> later.
>

How about a 3D storage model, whose first dimension gives horizontal
view, second provides vertical or columnar view and third one provides
diagonal view. It also provides capability to add extra dimensions to
provide additional views like double diagonal view. Alas! it all
collapses since I was late to the party.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: json(b)_to_tsvector with numeric values

2018-04-02 Thread Dmitry Dolgov
> On 2 April 2018 at 11:27, Arthur Zakirov  wrote:
> On Mon, Apr 02, 2018 at 11:41:12AM +0300, Oleg Bartunov wrote:
>> On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov  
>> wrote:
>> I found this bug, when working on presentation about FTS and it looked
>> annoying, since it validates
>> the consistency of FTS.I think this is a bug, which needs to be fixed,
>> else inconsistency with existing full text search  will be gets
>> deeper.
>>
>> The fix looks trivial, but needs a review, of course.
>
> Oh, I understood. The code looks good, tests passed. But maybe it is
> better to use NumericGetDatum() instead of PointerGetDatum()?

Well, technically speaking they're the same, but yes, NumericGetDatum would be
more precise. I've modified it in the attached patch.


jsonb_to_tsvector_numeric_v2.patch
Description: Binary data


Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-02 Thread Amit Langote
Fujita-san,

Thanks for updating the patch.

On 2018/03/30 21:56, Etsuro Fujita wrote:
> I modified the patch to use the existing API ExecForeignInsert instead of
> that API and removed that API including this doc.

OK.

>> 2. If I understand the description you provided in [1] correctly, the
>> point of having ri_PartitionIsValid and ExecInitExtraPartitionInfo() is to
>> avoid possibly-redundantly performing following two steps in
>> ExecSetupPartitionTupleRouting() for *reused* partition ResultRelInfos
>> that may not be used for tuple routing after all:
>>
>>   - create the parent_child_tupconv_maps[] entry for it
>>   - perform FDW tuple routing initialization.
> 
> Sorry, my explanation was not enough, but that was just one of the reasons
> why I introduced those; another is to do CheckValidResultRel against a
> partition after the partition was chosen for tuple routing rather than in
> ExecSetupPartitionTupleRouting, to avoid aborting an UPDATE of a partition
> key unnecessarily due to non-routable foreign-partitions that may not be
> chosen for tuple routing after all.

I see.  So, currently in ExecSetupPartitionTupleRouting, we call
CheckValidResultRel() to check if resultRelInfos reused from those
initialized for UPDATE are valid for insertions (or rather for routing
inserted tuples into it).  But you're proposing to delay that check until
ExecPrepareTupleRouting is called for a given resultRelInfo, at which
point it's clear that we actually want to insert using that resultRelInfo.
 That seems reasonable.

> Now we have ON CONFLICT for partitioned tables, which requires the
> conversion map to be computed in ExecInitPartitionInfo, so I updated the
> patch so that we keep the former step in ExecInitPartitionInfo and
> ExecSetupPartitionTupleRouting and so that we just init the FDW in
> ExecInitExtraPartitionInfo (changed to ExecInitForeignRouting).  And I
> added comments to ExecInitForeignRouting and ExecPrepareTupleRouting.

That looks good.


I looked at the new patch.  It looks good overall, although I have one
question -- IIUC, BeginForeignInsert() performs actions that are
equivalent of performing PlanForeignModify() + BeginForeignModify() for an
INSERT as if it was directly executed on a given foreign table/partition.
Did you face any problems in doing the latter itself in the core code?
Doing it that way would mean no changes to a FDW itself will be required
(and hence no need for additional APIs), but I may be missing something.

One thing that seems good about your approach is that newly added support
for COPY FROM on foreign tables/partitions takes minimal changes as
implemented by using the new API, whereas with the alternative approach it
would require duplicated code (same code would have to be present in both
copy.c and execPartition.c, but it could perhaps be avoided).

Thanks,
Amit




Re: [HACKERS] [PATCH] Lockable views

2018-04-02 Thread Yugo Nagata
On Thu, 29 Mar 2018 17:26:36 -0700
Andres Freund  wrote:

Thank you for your comments. I attach a patch to fix issues
you've pointed out.

> Hi,
> 
> On 2018-03-28 20:26:48 +0900, Yugo Nagata wrote:
> > diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml
> > index b2c7203..96d477c 100644
> > --- a/doc/src/sgml/ref/lock.sgml
> > +++ b/doc/src/sgml/ref/lock.sgml
> > @@ -46,6 +46,11 @@ LOCK [ TABLE ] [ ONLY ]  > class="parameter">name [ * ]
> >
> >  
> >
> > +   When a view is specified to be locked, all relations appearing in the 
> > view
> > +   definition query are also locked recursively with the same lock mode. 
> > +  
> 
> Trailing space added. I'd remove "specified to be" from the sentence.

Fixed.

> 
> I think mentioning how this interacts with permissions would be a good
> idea too. Given how operations use the view's owner to recurse, that's
> not obvious. Should also explain what permissions are required to do the
> operation on the view.

Added a description about permissions into the documentation.

> 
> 
> > @@ -86,15 +92,17 @@ RangeVarCallbackForLockTable(const RangeVar *rv, Oid 
> > relid, Oid oldrelid,
> > return; /* woops, concurrently 
> > dropped; no permissions
> >  * check */
> >  
> > -   /* Currently, we only allow plain tables to be locked */
> > -   if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE)
> > +
> 
> This newline looks spurious to me.

Removed.

> 
> 
> >  /*
> > + * Apply LOCK TABLE recursively over a view
> > + *
> > + * All tables and views appearing in the view definition query are locked
> > + * recursively with the same lock mode.
> > + */
> > +
> > +typedef struct
> > +{
> > +   Oid root_reloid;
> > +   LOCKMODE lockmode;
> > +   bool nowait;
> > +   Oid viewowner;
> > +   Oid viewoid;
> > +} LockViewRecurse_context;
> 
> Probably wouldn't hurt to pgindent the larger changes in the patch.
> 
> 
> > +static bool
> > +LockViewRecurse_walker(Node *node, LockViewRecurse_context *context)
> > +{
> > +   if (node == NULL)
> > +   return false;
> > +
> > +   if (IsA(node, Query))
> > +   {
> > +   Query   *query = (Query *) node;
> > +   ListCell*rtable;
> > +
> > +   foreach(rtable, query->rtable)
> > +   {
> > +   RangeTblEntry   *rte = lfirst(rtable);
> > +   AclResultaclresult;
> > +
> > +   Oid relid = rte->relid;
> > +   char relkind = rte->relkind;
> > +   char *relname = get_rel_name(relid);
> > +
> > +   /* The OLD and NEW placeholder entries in the view's 
> > rtable are skipped. */
> > +   if (relid == context->viewoid &&
> > +   (!strcmp(rte->eref->aliasname, "old") || 
> > !strcmp(rte->eref->aliasname, "new")))
> > +   continue;
> > +
> > +   /* Currently, we only allow plain tables or views to be 
> > locked. */
> > +   if (relkind != RELKIND_RELATION && relkind != 
> > RELKIND_PARTITIONED_TABLE &&
> > +   relkind != RELKIND_VIEW)
> > +   continue;
> > +
> > +   /* Check infinite recursion in the view definition. */
> > +   if (relid == context->root_reloid)
> > +   ereport(ERROR,
> > +   
> > (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> > +   errmsg("infinite recursion 
> > detected in rules for relation \"%s\"",
> > +   
> > get_rel_name(context->root_reloid;
> 
> Hm, how can that happen? And if it can happen, why can it only happen
> with the root relation?

For example, the following queries cause the infinite recursion of views. 
This is detected and the error is raised.

 create table t (i int);
 create view v1 as select 1;
 create view v2 as select * from v1;
 create or replace view v1 as select * from v2;
 begin;
 lock v1;
 abort;

However, I found that the previous patch could not handle the following
situation in which the root relation itself doesn't have infinite recursion.

 create view v3 as select * from v1;
 begin;
 lock v3;
 abort;

This fixed in the attached patch.

Regards,

> 
> Greetings,
> 
> Andres Freund


-- 
Yugo Nagata 
diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml
index 96d477c..12303fe 100644
--- a/doc/src/sgml/ref/lock.sgml
+++ b/doc/src/sgml/ref/lock.sgml
@@ -46,8 +46,8 @@ LOCK [ TABLE ] [ ONLY ] name [ * ]
   
 
   
-   When a view is specified to be locked, all relations appearing in the view
-   definition query are also locked recursively with the same lock mode. 
+   When a view is locked, all relations appearing in the view definition
+   query are a

Re: json(b)_to_tsvector with numeric values

2018-04-02 Thread Arthur Zakirov
On Mon, Apr 02, 2018 at 11:41:12AM +0300, Oleg Bartunov wrote:
> On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov  
> wrote:
> I found this bug, when working on presentation about FTS and it looked
> annoying, since it validates
> the consistency of FTS.I think this is a bug, which needs to be fixed,
> else inconsistency with existing full text search  will be gets
> deeper.
> 
> The fix looks trivial, but needs a review, of course.

Oh, I understood. The code looks good, tests passed. But maybe it is
better to use NumericGetDatum() instead of PointerGetDatum()?

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



make installcheck-world in a clean environment

2018-04-02 Thread Alexander Lakhin

Hello,

I am trying to perform "make installcheck-world" after "make clean" (or 
after installing a precompiled binaries), but it fails.

The error I get is related to ECPG regression test:
make -C test installcheck
make[2]: Entering directory 
`/home/postgres/postgres/src/interfaces/ecpg/test'
gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -O2 '-I../../../../src/port' 
'-I../../../../src/test/regress' '-DHOST_TUPLE="x86_64-pc-linux-gnu"' 
'-DSHELLPROG="/bin/sh"' '-DDLSUFFIX=".so"' -I../../../../src/include  
-D_GNU_SOURCE   -c -o pg_regress_ecpg.o pg_regress_ecpg.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -O2 -L../../../../src/port 
-L../../../../src/common -Wl,--as-needed 
-Wl,-rpath,'/usr/local/pgsql/lib',--enable-new-dtags pg_regress_ecpg.o 
../../../../src/test/regress/pg_regress.o -lpgcommon -lpgport -lpthread 
-lz -lreadline -lrt -lcrypt -ldl -lm  -o pg_regress

make -C connect all
make[3]: Entering directory 
`/home/postgres/postgres/src/interfaces/ecpg/test/connect'

make[3]: *** No rule to make target `test1.c', needed by `test1.o'.  Stop.
make[3]: Leaving directory 
`/home/postgres/postgres/src/interfaces/ecpg/test/connect'


Is it a supported scenario to make installcheck-world without performing 
"make" first?
(If I do "make -C src/interfaces/ecpg" and then "make 
installcheck-world", then this error is gone. And when I set up all the 
extensions, all tests passed successfully.)
And even if we need to perform make, I wonder, should the recompiled 
ecpg binary be checked instead of installed one?
I tried to modify Makefile to target installed ecpg binary and it's libs 
(see the patch attached), it works, but this fix is only suited for 
installcheck.

So if this scenario should be supported, a more elaborated fix is needed.

Best regards,
--
Alexander Lakhin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/interfaces/ecpg/test/Makefile.regress b/src/interfaces/ecpg/test/Makefile.regress
index b3d7c1e..5df1a44 100644
--- a/src/interfaces/ecpg/test/Makefile.regress
+++ b/src/interfaces/ecpg/test/Makefile.regress
@@ -5,14 +5,14 @@ override CPPFLAGS := -I../../include -I$(top_srcdir)/src/interfaces/ecpg/include
-I$(libpq_srcdir) $(CPPFLAGS)
 override CFLAGS += $(PTHREAD_CFLAGS)
 
-override LDFLAGS := -L../../ecpglib -L../../pgtypeslib $(filter-out -l%, $(libpq)) $(LDFLAGS)
+override LDFLAGS := -L'$(DESTDIR)$(libdir)/' $(filter-out -l%, $(libpq)) $(LDFLAGS)
 override LIBS := -lecpg -lpgtypes $(filter -l%, $(libpq)) $(LIBS) $(PTHREAD_LIBS)
 
 # Standard way to invoke the ecpg preprocessor
-ECPG = ../../preproc/ecpg --regression -I$(srcdir)/../../include -I$(srcdir)
+ECPG = '$(DESTDIR)$(bindir)/ecpg' --regression -I$(srcdir)/../../include -I$(srcdir)
 
 # Files that most or all ecpg preprocessor test outputs depend on
-ECPG_TEST_DEPENDENCIES = ../../preproc/ecpg$(X) \
+ECPG_TEST_DEPENDENCIES =  \
$(srcdir)/../regression.h \
$(srcdir)/../../include/sqlca.h \
$(srcdir)/../../include/sqlda.h \


Re: json(b)_to_tsvector with numeric values

2018-04-02 Thread Oleg Bartunov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov  wrote:
> Hello Dmitry,
>
> Dmitry Dolgov <9erthali...@gmail.com> wrote:
>>
>> Any opinions about this suggestion? Can it be considered as a bug fix and
>> included into this release?
>
>
> I think there is no chance to include it into v11. You can add the patch to
> the 2018-09 commitfest.

I found this bug, when working on presentation about FTS and it looked
annoying, since it validates
the consistency of FTS.I think this is a bug, which needs to be fixed,
else inconsistency with existing full text search  will be gets
deeper.

The fix looks trivial, but needs a review, of course.

>
>
> --
> Arthur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-02 Thread Simon Riggs
On 29 March 2018 at 23:30, Petr Jelinek  wrote:
> On 29/03/18 23:58, Andres Freund wrote:
>> On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote:
 I have added details about this in src/backend/storage/lmgr/README as
 suggested by you.

>>>
>>> Thanks. I think the README is a good start, but I think we also need to
>>> improve the comments, which is usually more detailed than the README.
>>> For example, it's not quite acceptable that LogicalLockTransaction and
>>> LogicalUnlockTransaction have about no comments, especially when it's
>>> meant to be public API for decoding plugins.
>>
>> FWIW, for me that's ground to not accept the feature. Burdening output
>> plugins with this will make their development painful (because they'll
>> have to adapt regularly) and correctness doubful (there's nothing
>> checking for the lock being skipped).  Another way needs to be found.
>>
>
> I have to agree with Andres here. It's also visible in the latter
> patches. The pgoutput patch forgets to call these new APIs completely.
> The test_decoding calls them, but it does so even when it's processing
> changes for committed transaction.. I think that should be avoided as it
> means potentially doing SLRU lookup for every change. So doing it right
> is indeed not easy.

Yet you spotted these problems easily enough. Similar to finding
missing LWlocks.

> I as wondering how to hide this. Best idea I had so far would be to put
> it in heap_beginscan (and index_beginscan given that catalog scans use
> is as well) behind some condition. That would also improve performance
> because locking would not need to happen for syscache hits. The problem
> is however how to inform the heap_beginscan about the fact that we are
> in 2PC decoding. We definitely don't want to change all the scan apis
> for this. I wonder if we could add some kind of property to Snapshot
> which would indicate this fact - logical decoding is using it's own
> snapshots it could inject the information about being inside the 2PC
> decoding.

Perhaps, but how do we know we've covered all the right places? We
don't know what every plugin will require, do we?

The plugin needs to take responsibility for its own correctness,
whether we make it easier or not.

It seems clear that we would need a generalized API (the proposed
locking approach) to cover all requirements.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-02 Thread Simon Riggs
On 29 March 2018 at 23:24, Andres Freund  wrote:

>> I agree with the former, of course - docs are a must. I disagree with
>> the latter, though - there have been about no proposals how to do it
>> without the locking. If there are, I'd like to hear about it.
>
> I don't care. Either another solution needs to be found, or the locking
> needs to be automatically performed when necessary.

That seems unreasonable.

It's certainly a nice future goal to have it all happen automatically,
but we don't know what the plugin will do.

How can we ever make an unknown task happen automatically? We can't.


We have a reasonable approach here. Locking shared resources before
using them is not a radical new approach, its just standard
development. If we find a better way in the future, we can use that,
but requiring a better solution when there isn't one is unreasonable.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services