Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2018-06-14 Thread Ashutosh Bapat
On Tue, Jun 12, 2018 at 8:49 AM, Kyotaro HORIGUCHI
 wrote:
> Thanks for the discussion.
>
> At Thu, 7 Jun 2018 19:16:57 +0530, Ashutosh Bapat 
>  wrote in 
> 
>> On Tue, Jun 5, 2018 at 3:40 PM, Kyotaro HORIGUCHI
>>  wrote:
>> > Hello.
>> >
>> > At Mon, 04 Jun 2018 20:58:28 +0900 (Tokyo Standard Time), Kyotaro 
>> > HORIGUCHI  wrote in 
>> > <20180604.205828.208262556.horiguchi.kyot...@lab.ntt.co.jp>
>> >> It fails on some join-pushdown cases since it doesn't add tid
>> >> columns to join tlist.  I suppose that build_tlist_to_deparse
>> >> needs something but I'll consider further tomorrow.
>> >
>> > I made it work with a few exceptions and bumped.  PARAM_EXEC
>> > doesn't work at all in a case where Sort exists between
>> > ForeignUpdate and ForeignScan.
>> >
>> > =
>> > explain (verbose, costs off)
>> > update bar set f2 = f2 + 100
>> > from
>> >   ( select f1 from foo union all select f1+3 from foo ) ss
>> > where bar.f1 = ss.f1;
>> >   QUERY PLAN
>> > -
>> >  Update on public.bar
>> >Update on public.bar
>> >Foreign Update on public.bar2
>> >  Remote SQL: UPDATE public.loct2 SET f2 = $3 WHERE tableoid = $1 AND 
>> > ctid = $2
>> > ...
>> >->  Merge Join
>> >  Output: bar2.f1, (bar2.f2 + 100), bar2.f3, (ROW(foo.f1))
>> >  Merge Cond: (bar2.f1 = foo.f1)
>> >  ->  Sort
>> >Output: bar2.f1, bar2.f2, bar2.f3, bar2.tableoid, bar2.ctid
>> >Sort Key: bar2.f1
>> >->  Foreign Scan on public.bar2
>> >  Output: bar2.f1, bar2.f2, bar2.f3, bar2.tableoid, 
>> > bar2.ctid
>> >  Remote SQL: SELECT f1, f2, f3, ctid, tableoid FROM 
>> > public.loct2 FOR UPDATE
>> > =
>>
>> What's the problem that you faced?
>
> The required condtion for PARAM_EXEC to work is that executor
> ensures the correspondence between the setter the reader of a
> param like ExecNestLoop is doing. The Sort node breaks the
> correspondence between the tuple obtained from the Foreign Scan
> and that ForeignUpdate is updating. Specifically Foreign Update
> upadtes the first tuple using the tableoid for the last tuple
> from the Foreign Scan.

Ok. Thanks for the explanation.

>
>> > Even if this worked fine, it cannot be back-patched.  We need an
>> > extra storage moves together with tuples or prevent sorts or
>> > something like from being inserted there.
>>
>> I think your approach still has the same problem that it's abusing the
>> tableOid field in the heap tuple to store tableoid from the remote as
>> well as local table. That's what Robert and Tom objected to [1], [2]
>
> It's wrong understanding. PARAM_EXEC conveys remote tableoids
> outside tuples and each tuple is storing correct (= local)
> tableoid.

In the patch I saw that we were setting tableoid field of HeapTuple to
the remote table oid somewhere. Hence the comment. I might be wrong.

>
>> > At Fri, 1 Jun 2018 10:21:39 -0400, Ashutosh Bapat 
>> >  wrote in 
>> > 
>> >> I am not suggesting to commit 0003 in my patch set, but just 0001 and
>> >> 0002 which just raise an error when multiple rows get updated when
>> >> only one row is expected to be updated.
>> >
>> > So I agree to commit the two at least in order to prevent doing
>> > wrong silently.
>>
>> I haven't heard any committer's opinion on this one yet.
>>
>> [1] 
>> https://www.postgresql.org/message-id/CA+TgmobUHHZiDR=hcu4n30yi9_pe175ittbfk6t8jxzwkra...@mail.gmail.com
>> [2] https://www.postgresql.org/message-id/7936.1526590932%40sss.pgh.pa.us
>
> Agreed. We need any comment to proceed.
>
> I have demonstrated and actually shown a problem of the
> PARAM_EXEC case. (It seems a bit silly that I actually found the
> problem after it became almost workable, though..)

I think the general idea behind Tom's suggestion is that we have to
use some node other than Var node when we update the targetlist with
junk columns. He suggested Param since that gives us some place to
store remote tableoid. But if that's not working, another idea (that
Tom mentioned during our discussion at PGCon) is to invent a new node
type like ForeignTableOid or something like that, which gets deparsed
to "tableoid" and evaluated to the table oid on the foreign server.
That will not work as it is since postgres_fdw code treats a foreign
table almost like a local table in many ways e.g. it uses attr_used to
know which attributes are to be requested from the foreign server,
build_tlist_to_deparse() only pulls Var nodes from the targelist of
foreign table and so on. All of those assumptions will need to change
with this approach. But good thing is because of join and aggregate
push-down we already have ability to push arbitrary kinds of nodes
down to the foreign server through the targetlist. We should be able
to leverage that capability. It looks like a lot of change, which
again doesn't seem to be back-portable.

-- 
Best 

RE: ON CONFLICT DO NOTHING on pg_dump

2018-06-14 Thread Ideriha, Takeshi
Hi,

>-Original Message-
>From: Surafel Temesgen [mailto:surafel3...@gmail.com]
>thank you for the review
>
>   Do you have any plan to support on-conlict-do-update? Supporting this 
> seems
>to me complicated and take much time so I don't mind not implementing this.
>
>
>i agree its complicated and i don't have a plan to implementing it.

Sure.

>thank you for pointing me that i add basic test and it seems to me the rest of 
>the test
>is covered by column_inserts test

Thank you for updating. 
Just one comment for the code.

+   if (dopt.do_nothing && !(dopt.dump_inserts || dopt.column_inserts))
I think you can remove dopt.column_inserts here because at this point 
dopt.dump_inserts has
already turned true if --column-inserts is specified. 

But I'm just inclined to think that use case of this feature is vague.
Could you specify more concrete use case that is practical for many users?
(It would lead to more attention to this patch.)
For example, is it useful to backup just before making a big change to DB like 
delete tables?

===
Takeshi


Re: WAL prefetch

2018-06-14 Thread Amit Kapila
On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost  wrote:
>
>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>> The speed of synchronous replication between two nodes is increased from 56k
>> TPS to 60k TPS (on pgbench with scale 1000).
>
> I'm also surprised that it wasn't a larger improvement.
>
> Seems like it would make sense to implement in core using
> posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
> or nearby..  At least, that's the thinking I had when I was chatting w/
> Sean.
>

Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: automating perl compile time checking

2018-06-14 Thread Peter Eisentraut
On 6/11/18 16:06, Andrew Dunstan wrote:
> This affects pretty much nothing. In fact some of the other changes I've 
> recently committed were arguably more dangerous. Do you want me to 
> revert the whole lot?

No, but this whole let's clean up the Perl code initiative seemed to
have come out of nowhere after feature freeze.  The Perl code was fine
before, so if we're going to be polishing it around it should go into
the next release.  I would actually have liked to have had more time to
discuss some of the changes, since I didn't seem to agree to some of
them at first reading.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Few cosmetic suggestions for commit 16828d5c (Fast Alter Table Add Column...)

2018-06-14 Thread Amit Kapila
On Fri, Jun 15, 2018 at 12:54 AM, Andrew Dunstan
 wrote:
>
> On 06/14/2018 02:01 PM, Alvaro Herrera wrote:
>>
>> On 2018-Jun-14, Amit Kapila wrote:
>>
>>> On Sun, Jun 3, 2018 at 5:08 AM, Amit Kapila 
>>> wrote:

 2.
 +/*
 + * Structure used to represent value to be used when the attribute is
 not
 + * present at all in a tuple, i.e. when the column was created after
 the
 tuple
 + */
 +
 +typedef struct attrMissing
 +{
 +   boolammissingPresent;   /* true if non-NULL missing value
 exists
 */
 +   Datum   ammissing;  /* value when attribute is missing */
 +} AttrMissing;
 +
..
>>
>> We used to use prefixes for common struct members names to help
>> disambiguate across members that would otherwise have identical names in
>> different structs.  Our convention was to use _ as a separator.  This
>> convention has been partially lost, but seems we can use it to good
>> effect here, by renaming ammissingPresent to am_present and ammissing to
>> am_missing (I would go as far as suggesting am_default or am_substitute
>> or something like that).
>
> am_present and am_value perhaps? I'm not dogmatic about it.
>

+1.  Attached patch changed the names as per suggestion.

>
>
>
>> BTW I think "the result stored" is correct English.
>>
>
> Yes, it certainly is.
>

Okay.

How about attached?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


cosmetic_changes_fast_addcol_v2.patch
Description: Binary data


Re: AtEOXact_ApplyLauncher() and subtransactions

2018-06-14 Thread Peter Eisentraut
On 6/5/18 07:02, Amit Khandekar wrote:
> I haven't written a patch for it, but I think we should have a
> separate on_commit_stop_workers for eachyou get subtransaction. At
> subtransaction commit, we replace the on_commit_stop_workers list of
> the parent subtransaction with the one from the committed
> subtransaction; and on abort, discard the list of the current
> subtransaction. So have a stack of the lists.

Is there maybe a more general mechanism we could attach this to?  Maybe
resource owners?

> Subscription mysub is set to synchronise tables tx1 and tx2 :
> CREATE SUBSCRIPTION mysub ... PUBLICATION pubx;
> 
> Now suppose the subscription is altered to synchronise ty1 and ty2,
> and then again altered back to synchronise tx1 and tx2 in the same
> transaction.
> begin;
> ALTER SUBSCRIPTION mysub set publication puby;
> ALTER SUBSCRIPTION mysub set publication pubx;
> commit;
> 
> Here, workers for tx1 and tx2 are added to on_commit_stop_workers
> after the publication is set to puby. And then workers for ty1 and ty2
> are further added to that list after the 2nd ALTER command where
> publication is set to pubx, because the earlier ALTER command has
> already changed the catalogs to denote that ty1 and ty2 are being
> synchronised. Effectively, at commit, all the workers are targetted to
> be stopped, when actually at commit time there is no final change in
> the tables to be synchronised.

I'm not so bothered about this scenario.  When you drop and then
recreate a table in the same transaction, that doesn't mean you keep the
data that was previously in the table.  If you want to *undo* a change,
you need to do rollback, not commit further changes that try to recreate
the previous state.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #15237: I got "ERROR: source for a multiple-column UPDATE item must be a sub-SELECT or ROW() expression"

2018-06-14 Thread Andrew Gierth
> "Justin" == Justin Pryzby  writes:

 >> It then apparently went unnoticed until after the release of pg 10,
 >> at which point it got retroactively documented (in the release notes
 >> and nowhere else), in response to a brief discussion of a user
 >> complaint that happened on -general and not -hackers (or even
 >> -bugs).

 Justin> Actually I noticed and pointed out during beta;
 Justin> 
https://www.postgresql.org/message-id/flat/20170719174507.GA19616%40telsasoft.com#20170719174507.ga19...@telsasoft.com

Interesting, I missed that when searching. So that tells us:

1. Breaking SET (col) = (val) wasn't actually intended or noticed in the
   original patch;

2. The break was then defended based on an incorrect reading of the spec
   (as I pointed out already on -bugs, the syntax _is_ legal in the spec
   with only one column, for just the same reasons that VALUES (1) isn't
   required to be VALUES row(1); the spec adds the ROW( ) construct
   itself in one of the syntax rules, in a way that's easy to overlook)

-- 
Andrew (irc:RhodiumToad)



Re: ntile() throws ERROR when hashagg is false

2018-06-14 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 >> I wonder if it would be worth adding a run-time check in
 >> window_ntile() that causes an ERROR on first call if there are any
 >> Vars or PARAM_EXEC Params in the function argument. An ERROR might
 >> be better than doing something that the user does not expect.

 Tom> -1, that would break cases that are legal and useful, such as
 Tom> where a PARAM_EXEC Param represents an outer-query-level variable,
 Tom> while still failing to catch some problematic cases (eg. volatile
 Tom> functions).

The only sane run-time check (that I can think of) that could be applied
would be to check that the value is the same on each row of a partition.

 Tom> I think also that there are cases that are not legal per spec but
 Tom> can still be useful, as long as the user knows what they're doing.

Yes, it would make sense for example to allow the value to change
between partitions.

-- 
Andrew (irc:RhodiumToad)



Re: Partitioning with temp tables is broken

2018-06-14 Thread Tom Lane
David Rowley  writes:
> On 15 June 2018 at 02:42, Tom Lane  wrote:
>> I think that if possible, we should still allow a partitioned table
>> in which all the rels are temp tables of the current session.  What we
>> have to disallow is (a) temp/permanent mixes and (b) temp tables from
>> different sessions.

> So, this used to work in v10. Is it fine to just pluck the feature out
> of the v11 release and assume nobody cares?

IIUC, it worked in v10 only for small values of "work".

regards, tom lane



RE: ON CONFLICT DO NOTHING on pg_dump

2018-06-14 Thread Ideriha, Takeshi
>-Original Message-
>From: Nico Williams [mailto:n...@cryptonector.com]
>On Tue, Jun 12, 2018 at 09:05:23AM +, Ideriha, Takeshi wrote:
>> >From: Surafel Temesgen [mailto:surafel3...@gmail.com]
>> >Subject: ON CONFLICT DO NOTHING on pg_dump
>>
>> >Sometimes I have to maintain two similar database and I have to update one 
>> >from
>the other and notice having the option to add ON CONFLICT DO NOTHING clause to
>>INSERT command in the dump data will allows pg_restore to be done with free of
>ignore error.
>>
>> Hi,
>> I feel like that on-conflict-do-nothing support is useful especially coupled 
>> with
>--data-only option.
>> Only the difference of data can be restored.
>
>But that's additive-only.  Only missing rows are restored this way, and 
>differences are
>not addressed.
>
>If you want restore to restore data properly and concurrently (as opposed to 
>renaming
>a new database into place or whatever) then you'd want a) MERGE, b) dump to
>generate MERGE statements.  A concurrent data restore operation would be rather
>neat.

I agree with you though supporting MERGE or ON-CONFLICT-DO-UPDATE seems hard 
work.
Only ON-CONCLICT-DO-NOTHING use case may be narrow.

--
Takeshi 




Re: Partitioning with temp tables is broken

2018-06-14 Thread David Rowley
On 15 June 2018 at 02:42, Tom Lane  wrote:
> I think that if possible, we should still allow a partitioned table
> in which all the rels are temp tables of the current session.  What we
> have to disallow is (a) temp/permanent mixes and (b) temp tables from
> different sessions.

So, this used to work in v10. Is it fine to just pluck the feature out
of the v11 release and assume nobody cares?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Partitioning with temp tables is broken

2018-06-14 Thread Amit Langote
On 2018/06/14 22:11, Ashutosh Bapat wrote:
> On Thu, Jun 14, 2018 at 9:42 AM, Amit Langote
>> I'm attaching a patch here to forbid adding a temporary table as partition
>> of permanent table.  I didn't however touch the feature that allows *all*
>> members in a partition tree to be temporary tables.
> 
> Similar problems will happen if a temporary partitioned table's
> hierarchy contains partitions from different sessions. Also, what
> should happen to the partitions from the other session after the
> session creating the temporary partitioned table goes away is not
> clear to me. Should they get dropped, how?
> 
> If I am reading Tom's reply upthread correctly, we should not allow
> creating a temporary partitioned table as well as temporary partitions
> altogether. I thought that that's easily fixable in grammar itself.
> But we allow creating a table in temporary schema. So that doesn't
> work. A better fix might be to prohibit those cases in
> DefineRelation() itself.

Sorry, I may not have described my patch sufficiently, but as mentioned in
my reply to Tom, it suffices to prohibit the cases which we don't handle
correctly, such as a mix of temporary and permanent tables and/or
temporary tables of different sessions appearing in a given partition tree.

Thanks,
Amit




Re: BUG #15237: I got "ERROR: source for a multiple-column UPDATE item must be a sub-SELECT or ROW() expression"

2018-06-14 Thread Justin Pryzby
On Wed, Jun 13, 2018 at 05:48:50AM +0100, Andrew Gierth wrote:
> It then apparently went unnoticed until after the release of pg 10, at
> which point it got retroactively documented (in the release notes and
> nowhere else), in response to a brief discussion of a user complaint
> that happened on -general and not -hackers (or even -bugs).

Actually I noticed and pointed out during beta;
https://www.postgresql.org/message-id/flat/20170719174507.GA19616%40telsasoft.com#20170719174507.ga19...@telsasoft.com

Justin



Re: [GSoC] current working status

2018-06-14 Thread Charles Cui
Hi mentors,

   The repo currently can be build on my mac. You can check it out.
Also I am working on the CI config to monitor each commit.


Thanks

2018-06-14 8:29 GMT-07:00 Charles Cui :

> good idea, will keep a feature branch, and set up ci.
>
> On Thu, Jun 14, 2018, 8:03 AM Aleksander Alekseeev <
> a.aleks...@postgrespro.ru> wrote:
>
>> Hello Charles,
>>
>> >I saw the list of errors you posted. That's because I have some new
>> > function implemented and pushed without testing(in order forget my
>> > change in local machine).
>> > So you are saying it is best to keep each commit workable. Right?
>>
>> Ideally, yes. You can always create a separate branch to experiment with
>> new code and merge it to master branch when it becomes stable. This is
>> not a strict GSoC requirement or something like this, but a good
>> practice. If something goes wrong you can just delete an experimental
>> branch and go back to working master branch.
>>
>>
>> --
>> Best regards,
>> Aleksander Alekseev
>>
>


Re: Partitioning with temp tables is broken

2018-06-14 Thread Amit Langote
On 2018/06/14 23:42, Tom Lane wrote:
> Ashutosh Bapat  writes:
>> If I am reading Tom's reply upthread correctly, we should not allow
>> creating a temporary partitioned table as well as temporary partitions
>> altogether.
> 
> I think that if possible, we should still allow a partitioned table
> in which all the rels are temp tables of the current session.  What we
> have to disallow is (a) temp/permanent mixes and (b) temp tables from
> different sessions.

The patch I posted upthread does exactly that I think.  It allows for a
partition tree where all tables are temporary relations of the same
session, whereas it disallows:

1. Temporary-permanent mix

create table perm_p (a int) partition by list (a);
create temp table temp_p partition of perm_p for values in (1);
ERROR:  cannot create a temporary relation as partition of permanent
relation "perm_p"

create temp table temp_p1 (a int);
alter table perm_p attach partition temp_p1 for values in (1);
ERROR:  cannot attach a temporary relation as partition of permanent
relation "perm_p"

create temp table temp_p (a int) partition by list (a);
create table perm_p1 partition of temp_p for values in (1);
ERROR:  cannot create a permanent relation as partition of temporary
relation "temp_p"

create table perm_p1 (a int);
alter table temp_p attach partition perm_p1 for values in (1);
ERROR:  cannot attach a permanent relation as partition of temporary
relation "temp_p"

2. Mixing of temp table from different sessions

create temp table temp_other_p2 partition of temp_p for values in (2);
ERROR:  relation "temp_p" does not exist

create temp table temp_other_p2 partition of pg_temp_2.temp_p for values
in (2);
ERROR:  cannot create as partition of temporary relation of another session

create temp table temp_other_p2 (a int);
alter table temp_p attach partition temp_other_p2 for values in (2);
ERROR:  relation "temp_other_p2" does not exist

alter table temp_p attach partition pg_temp_3.temp_other_p2 for values in (2);
ERROR:  cannot attach temporary relation of another session as partition

Thanks,
Amit




Bogus dependency calculation for expressions involving casts

2018-06-14 Thread Tom Lane
I've identified the underlying cause of the misbehavior reported in
https://www.postgresql.org/message-id/7cb957dd12774a52ac8a680b73910...@imaginesoftware.com
and it's a bit embarrassing: we're doing dependency logging for casts
all wrong.  A cast expression produces a FuncExpr parse node, which
find_expr_references_walker happily logs as a dependency on the function
implementing the cast, rather than the cast per se.  Therefore pg_dump
has no idea that it needs to dump the CREATE CAST before it can dump
expressions relying on the availability of the cast notation.

It's a bit astonishing that this flaw hasn't been identified before now.
I think probably the explanation is that pg_dump's default dump order
(cf. pg_dump_sort.c) puts out casts before most object types that could
involve an expression depending on a cast, so that we accidentally emit
things in a safe order anyway.  But types --- in particular, default
expressions for domain types --- are an exception to that rule, and in
any case dump-order rearrangements made to satisfy other dependencies
could result in a failure.

I'm afraid that this might not be easy to fix.  The straightforward
approach would be to expand FuncExpr (and also RelabelType, CoerceViaIO,
ArrayCoerceExpr, RowExpr, and ConvertRowtypeExpr -- everything with a
CoercionForm field) to include the OID of the cast if the node came from
cast syntax, and then teach find_expr_references_walker to report the
dependency as being a dependency on the cast not the cast's implementation
function(s).  However, quite aside from the invasiveness and
non-back-patch-ability of this approach, I'm not even sure it totally
fixes the problem:

1. There are cases where we intentionally omit a RelabelType so that
the expression will be seen to have its original type.  It's possible
that this is all right because all such cases correspond to, essentially,
hard-wired polymorphism behavior rather than droppable casts.  But I'm not
quite sure that that's true.

2. As noted in the comments for enum CoercionForm, we'd really prefer that
equal() treat semantically-equivalent nodes as equal even if they have
different display output; this would imply ignoring the cast OID fields
in equal().  But I'm unsure that we can get away with that for expressions
that actually have different dependency lists ... it's not hard to imagine
ALTER operations screwing up and leaving an object with a dependency list
that doesn't match the current contents of its expression.

Another approach we could take is to not change the parse representation,
but instead teach find_expr_references_walker to look into pg_cast when
it sees a node that's marked as "display as coercion", and if it finds a
match then report the cast OID.  This'd make find_expr_references_walker
noticeably slower for such cases, and I'm not sure that it could expect
to find the correct match in every case because of cheating in binary
compatibility situations.  Also, if problem #1 or #2 above really is a
problem then it affects this approach just as much.  On the plus side,
this way would lead to an at-least-theoretically back-patchable solution
... not that it would fix existing misreported dependencies, but at least
we could record new ones correctly.

Another idea, which is not a bulletproof fix but might take care of nearly
all practical cases, is to adjust pg_dump's default object ordering rules
so that domain types come out after casts.

Seeing that this has been wrong since we invented dependencies and just
now came to light, I don't feel a need to panic about it; in particular
I think we probably ought not try to squeeze a fix into v11, unless we
go with the relatively non-invasive way of hacking pg_dump's ordering.
But we should think about finding a fix going forward.

BTW, the existence of ALTER DOMAIN SET DEFAULT means that it's probably
possible to set up situations with circular dependencies, by adding a
domain default after-the-fact that has an expression that somehow depends
on the domain's existence.  In principle pg_dump could be taught to break
such a circularity by using SET DEFAULT to dump the default expression
separately --- but it has no such logic now, nor does pg_depend contain
sufficient information to determine where to place the SET DEFAULT,
because we record any dependencies of the expression as direct
dependencies of the domain type.  So fixing that would be quite a bit of
work, which I'm disinclined to do because I can't think of any real-world
use case that would justify it.

Also ... the current situation means that it's possible to store an
expression "foo::mytype" and then drop the cast from foo's type to mytype,
as long as you keep the cast's implementation function.  The expression
will continue to print in that form, but it'll fail to dump-and-reload.
So this is an independent reason why just hacking pg_dump's ordering rules
isn't a full solution.

regards, tom lane



Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-06-14 Thread Daniel Gustafsson
> On 13 Jun 2018, at 21:16, Andres Freund  wrote:

Thanks for the review!

> I'm not sure I really like the appending bit. There's a security
> argument to be made about doing so, but from a user POV that mostly
> seems restrictive.  I wonder if that aspect would be better handled by
> adding an error context that contains information about which process
> did the cancellation (or DETAIL?)?

That doesn’t sound like a bad idea at all, I took a stab at that in the
attached patch to see what it would look like.  Is that along the lines of what
you had in mind?

It does however make testing harder as the .out file can’t have the matching
pid (commented out the test in the attached hoping that there is a way to test
it that I fail to see).

>> +/*
>> + * Structure for registering a feedback payload to be sent to a cancelled, 
>> or
>> + * terminated backend. Each backend is registered per pid in the array 
>> which is
>> + * indexed by Backend ID. Reading and writing the message is protected by a
>> + * per-slot spinlock.
>> + */
>> +typedef struct
>> +{
>> +pid_t   pid;
> 
> This is the pid of the receiving process? If so, why do we need this in
> here? That's just duplicated data, no?

Correct, we need that for finding the right slot in backend_feedback() unless
I’m missing something?

>> +slock_t mutex;
>> +charmessage[MAX_CANCEL_MSG];
>> +int len;
>> +int sqlerrcode;
> 
> I'd vote for including the pid of the process that did the cancelling in
> here.

Did that for the above discussed log message change.

>> [ .. ]
> 
> Why isn't this just one function? Given that you already have a PG_NARGS
> check, I don't quite see the point?

I was trying to retain STRICT on pg_cancel_backend(int4) for no food reason,
and I’m not entirely sure what the reason was anymore.  Fixed in the attached
version.

>> [ .. ]
> 
> I'm not quite seeing the point of these variants.  What advantage are
> they over just doing the same on the caller level?

It seemed a cleaner API to provide these for when the caller just want to
provide messaging (which is where I started this a long time ago).  I’m not wed
to the idea at all though, they are clearly just for convenience.

>> +MemSet(slot->message, '\0', sizeof(slot->message));
>> +memcpy(slot->message, message, len);
> 
> This seems unnecessarily expensive.

My goal was to safeguard against the risk of fragments from an undelivered
message being delivered to another backend, which might be overly cautious.
Using strlcpy() instead to guarantee termination is perhaps enough?

>> +slot->len = len;
>> +slot->sqlerrcode = sqlerrcode;
>> +SpinLockRelease(>mutex);
>> +
>> +if (len != strlen(message))
>> +ereport(NOTICE,
>> +(errmsg("message is too long 
>> and has been truncated")));
>> +return 0;
>> +}
>> +}
> 
> So we made cancellation a O(N) proposition :/. Probably not too bad, but
> still…

When setting a message to be passed to the backend yes, with an upper bound of
MaxBackends.

>> [ .. ]
> 
> So we now do log spam if processes exit in the wrong moment?

Yeah, that was rather pointless. Removed in the attached version.

> I'm somewhat uncomfortable with acquiring a mutex in an error path.  We
> used to prcoess at least some interrupts in signal handlers in some
> cases - I think I removed all of them, but if not, we'd be in deep
> trouble here.  Wonder if we shouldn't try to get around needing that…

I’m all ears if you have a better idea, this was the only option I could come
up with.

cheers ./daniel



0001-Refactor-backend-signalling-code-v12.patch
Description: Binary data


0002-Support-optional-message-in-backend-cancel-terminate-v12.patch
Description: Binary data


Re: BUG #15237: I got "ERROR: source for a multiple-column UPDATE item must be a sub-SELECT or ROW() expression"

2018-06-14 Thread Robert Haas
On Wed, Jun 13, 2018 at 12:48 AM, Andrew Gierth
 wrote:
> The original patch, a self-described "lazy-sunday project", made
> absolutely no mention of any compatibility issue, simply discussed the
> new options it was allowing, and got no feedback or review. Had it
> mentioned or even hinted that existing queries might be broken, I would
> hope that there would have been more discussion at the time.

I think that is a fair criticism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Shared access methods?

2018-06-14 Thread Alvaro Herrera
On 2018-Jun-14, Andres Freund wrote:

> But I do think there's a few things that are doable without actually
> needing to invoke any user defined code aside of the AM code
> itself. E.g. heap pruning / aggressively setting hint bits doesn't need
> to invoke operators, and I can think of some ways to implement index
> delete marking that does so without invoking any comparators either.

So what you want to do is have bgwriter/checkpointer able to scan some
catalog and grab a function pointer that can "execute pruning on this
shared buffer", right?  For that maybe we need to split out a part of
AMs that is storage-level and another one that is data-level.  So an
access method would create two catalog entries, one of which is shared
(pg_shared_am? ugh) and the other is the regular one we already have in
pg_am.  The handler function in pg_shared_am gives you functions that
can only do storage-level stuff such as hint bit setting, page pruning,
tuple freezing, CRC, etc which does not require access to the data
itself.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Portability concerns over pq_sendbyte?

2018-06-14 Thread Andres Freund
Hi,

On 2018-06-14 16:17:28 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > How about not renaming the functions, but just change argument types?

Yea, I'm in favor of this. I don't think the 'u' in there would benefit
us, and the cast from signed to unsigned is well defined, so it's safe
to call the functions with signed input.


> Yeah, I didn't understand why anything else would be on the table.

Because it sounds a bit weird for a function with just 'int' in the name
to take unsigned ints as a parameter? I don't think that's enough
justification to rename everything, but it's not completely crazy either.

> We already changed their arg types for 11, no?

No, not really. We added new functions.

Greetings,

Andres Freund



Re: Portability concerns over pq_sendbyte?

2018-06-14 Thread Tom Lane
Alvaro Herrera  writes:
> How about not renaming the functions, but just change argument types?

Yeah, I didn't understand why anything else would be on the table.
We already changed their arg types for 11, no?  This is just second
thoughts about what to change them to.

regards, tom lane



Re: Microoptimization of Bitmapset usage in postgres_fdw

2018-06-14 Thread Bossart, Nathan
Thanks for the updated patch set.

On 6/14/18, 2:34 PM, "Daniel Gustafsson"  wrote:
>> 2) BuildRelationExtStatistics() in extended_stats.c.
>> 
>>  /* check allowed number of dimensions */
>>  Assert(bms_num_members(stat->columns) >= 2 &&
>> bms_num_members(stat->columns) <= STATS_MAX_DIMENSIONS);
>
> Since this usage is in an assertion I don’t see the value in changing it as 
> the
> current programming is more optimized for readability.

Agreed.  I hesitated to even point this one out.

I'll go ahead and mark this as Ready for Committer.

Nathan



Re: Shared access methods?

2018-06-14 Thread Tom Lane
Andres Freund  writes:
> On 2018-06-14 15:59:22 +0300, Alexander Korotkov wrote:
>> We already have CREATE ACCESS METHOD command.  I think this command
>> should handle that internally.  And I don't understand why "ON
>> CONFLICT DO NOTHING".  If AM with given name already exists in pg_am,
>> why should we ignore the error?

> Well, right now an AM containing extension creates things in each
> database (i.e. same scope as extensions). But with shared AMs that
> wouldn't be the case - you might still want to create the extension in
> another database.  So we'd need to have CREATE ACCESS METHOD check
> whether already is the same entry, and only delete it on DROP ACCESS
> METHOD if there's no dependencies from other databases...

I'm not really buying this idea at all, at least not for index AMs,
because you also need a pile of other database-local infrastructure
--- opclasses, operators, functions, etc.  Trying to make pieces of
that be shared is not going to end well.

regards, tom lane



Re: Transform for pl/perl

2018-06-14 Thread Alvaro Herrera
On 2018-Jun-08, Tom Lane wrote:

> I wrote:
> > I'm inclined to think that auto-dereference is indeed a good idea,
> > and am tempted to go make that change to make all this consistent.
> > Comments?
> 
> Here's a draft patch for that.  I ended up only changing
> plperl_sv_to_datum.  There is maybe a case for doing something
> similar in plperl_return_next_internal's fn_retistuple code path,
> so that you can return a reference-to-reference-to-hash there;
> but I was unable to muster much enthusiasm for touching that.

ilmari, did you have time to give Tom's patch a spin?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Portability concerns over pq_sendbyte?

2018-06-14 Thread Alvaro Herrera
Hello

How about not renaming the functions, but just change argument types?
Having to change all callsites to cope with some new naming convention
does not strike me as a great idea ... it may collide with any
backpatching in the area, for one.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Microoptimization of Bitmapset usage in postgres_fdw

2018-06-14 Thread Daniel Gustafsson
> On 14 Jun 2018, at 16:56, Nathan Bossart  wrote:

> The v2 patches look good to me.  However, I found a couple other
> places where we might be able to use this micro-optimization.

Thanks a lot for your review!

> 1) dependencies_clauselist_selectivity() in dependencies.c
> 
>   /*
>* If there's not at least two distinct attnums then reject the whole 
> list
>* of clauses. We must return 1.0 so the calling function's selectivity 
> is
>* unaffected.
>*/
>   if (bms_num_members(clauses_attnums) < 2)
>   {
>   pfree(list_attnums);
>   return 1.0;
>   }

I agree with this one, not sure why I missed that when grep’ing around while
writing the patch.  Fixed in the attached v3 patchset (which are now awkwardly
named but I didn’t change that).

> 2) BuildRelationExtStatistics() in extended_stats.c.
> 
>   /* check allowed number of dimensions */
>   Assert(bms_num_members(stat->columns) >= 2 &&
>  bms_num_members(stat->columns) <= STATS_MAX_DIMENSIONS);

Since this usage is in an assertion I don’t see the value in changing it as the
current programming is more optimized for readability.

cheers ./daniel



bms_num_members_comment-v3.patch
Description: Binary data


postgres_fdw_bms_multiple-v3.patch
Description: Binary data


Re: Few cosmetic suggestions for commit 16828d5c (Fast Alter Table Add Column...)

2018-06-14 Thread Andrew Dunstan



Somehow I missed this thread ...



On 06/14/2018 02:01 PM, Alvaro Herrera wrote:

On 2018-Jun-14, Amit Kapila wrote:


On Sun, Jun 3, 2018 at 5:08 AM, Amit Kapila  wrote:

2.
+/*
+ * Structure used to represent value to be used when the attribute is not
+ * present at all in a tuple, i.e. when the column was created after the
tuple
+ */
+
+typedef struct attrMissing
+{
+   boolammissingPresent;   /* true if non-NULL missing value exists
*/
+   Datum   ammissing;  /* value when attribute is missing */
+} AttrMissing;
+

As nobody responded, it seems that the variable naming pointed above
is okay, but in any case, I think we should fix cosmetic changes
proposed.  I will commit the patch unless you or someone thinks that
we should change the name of the variable.

We used to use prefixes for common struct members names to help
disambiguate across members that would otherwise have identical names in
different structs.  Our convention was to use _ as a separator.  This
convention has been partially lost, but seems we can use it to good
effect here, by renaming ammissingPresent to am_present and ammissing to
am_missing (I would go as far as suggesting am_default or am_substitute
or something like that).



am_present and am_value perhaps? I'm not dogmatic about it.





BTW I think "the result stored" is correct English.



Yes, it certainly is.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Attempt to fix inheritance limitations: unique and foreign key constraints

2018-06-14 Thread Robert Haas
On Tue, Jun 12, 2018 at 11:28 AM, Raphael Medaer  wrote:
> Hi pg-hackers,
>
> I'm trying to fix some of limitations in table inheritance. My first use
> case concerns "referencing" foreign keys in parent table.
>
> The attached patch propagates foreign keys to inherited tables. I'm
> automatically cloning foreign keys from parent table into children (with
> right dependencies). It's the native implementation of work-around described
> in documentation section 5.9.1. (Caveats).
>
> To be honest it's my first pg hack. Thus it might contain (serious) issues
> (please be lenient with it).
> Furthermore it's not yet documented (TODO !!)
>
> As far as I tested it works pretty well; the internal triggers are installed
> on CREATE TABLE .. INHERIT and even after ALTER TABLE .. INHERIT.
>
> If you have few minutes to challenge this hack and give your advices, be my
> guest !

Sounds very similar to commit 3de241dba86f3dd000434f70aebba725fb928032.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL prefetch

2018-06-14 Thread Stephen Frost
Greetings,

* Konstantin Knizhnik (k.knizh...@postgrespro.ru) wrote:
> There was very interesting presentation at pgconf about pg_prefaulter:
> 
> http://www.pgcon.org/2018/schedule/events/1204.en.html

I agree and I've chatted a bit w/ Sean further about it.

> But it is implemented in GO and using pg_waldump.

Yeah, that's not too good if we want it in core.

> I tried to do the same but using built-on Postgres WAL traverse functions.
> I have implemented it as extension for simplicity of integration.
> In principle it can be started as BG worker.

I don't think this needs to be, or should be, an extension..  If this is
worthwhile (and it certainly appears to be) then we should just do it in
core.

> First of all I tried to estimate effect of preloading data.
> I have implemented prefetch utility with is also attached to this mail.
> It performs random reads of blocks of some large file and spawns some number
> of prefetch threads:
> 
> Just normal read without prefetch:
> ./prefetch -n 0 SOME_BIG_FILE
> 
> One prefetch thread which uses pread:
> ./prefetch SOME_BIG_FILE
> 
> One prefetch thread which uses posix_fadvise:
> ./prefetch -f SOME_BIG_FILE
> 
> 4 prefetch thread which uses posix_fadvise:
> ./prefetch -f -n 4 SOME_BIG_FILE
> 
> Based on this experiments (on my desktop), I made the following conclusions:
> 
> 1. Prefetch at HDD doesn't give any positive effect.
> 2. Using posix_fadvise allows to speed-up random read speed at SSD up to 2
> times.
> 3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
> 4. Calling posix_fadvise in more than one thread has no sense.

Ok.

> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
> The speed of synchronous replication between two nodes is increased from 56k
> TPS to 60k TPS (on pgbench with scale 1000).

I'm also surprised that it wasn't a larger improvement.

Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby..  At least, that's the thinking I had when I was chatting w/
Sean.

Thanks!

Stephen


signature.asc
Description: PGP signature


Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-14 Thread Peter Geoghegan
I've been thinking about using heap TID as a tie-breaker when
comparing B-Tree index tuples for a while now [1]. I'd like to make
all tuples at the leaf level unique, as assumed by L This can
enable "retail index tuple deletion", which I think we'll probably end
up implementing in some form or another, possibly as part of the zheap
project. It's also possible that this work will facilitate GIN-style
deduplication based on run length encoding of TIDs, or storing
versioned heap TIDs in an out-of-line nbtree-versioning structure
(unique indexes only). I can see many possibilities, but we have to
start somewhere.

I attach an unfinished prototype of suffix truncation, that also
sometimes *adds* a new attribute in pivot tuples. It adds an extra
heap TID from the leaf level when truncating away non-distinguishing
attributes during a leaf page split, though only when it must. The
patch also has nbtree treat heap TID as a first class part of the key
space of the index. Claudio wrote a patch that did something similar,
though without the suffix truncation part [2] (I haven't studied his
patch, to be honest). My patch is actually a very indirect spin-off of
Anastasia's covering index patch, and I want to show what I have in
mind now, while it's still swapped into my head. I won't do any
serious work on this project unless and until I see a way to implement
retail index tuple deletion, which seems like a multi-year project
that requires the buy-in of multiple senior community members. On its
own, my patch regresses performance unacceptably in some workloads,
probably due to interactions with kill_prior_tuple()/LP_DEAD hint
setting, and interactions with page space management when there are
many "duplicates" (it can still help performance in some pgbench
workloads with non-unique indexes, though).

Note that the approach to suffix truncation that I've taken isn't even
my preferred approach [3] -- it's a medium-term solution that enables
making a heap TID attribute part of the key space, which enables
everything else. Cheap incremental/retail tuple deletion is the real
prize here; don't lose sight of that when looking through my patch. If
we're going to teach nbtree to truncate this new implicit heap TID
attribute, which seems essential, then we might as well teach nbtree
to do suffix truncation of other (user-visible) attributes while we're
at it. This patch isn't a particularly effective implementation of
suffix truncation, because that's not what I'm truly interested in
improving here (plus I haven't even bothered to optimize the logic for
picking a split point in light of suffix truncation).

amcheck
===

This patch adds amcheck coverage, which seems like essential
infrastructure for developing a feature such as this. Extensive
amcheck coverage gave me confidence in my general approach. The basic
idea, invariant-wise, is to treat truncated attributes (often
including a truncated heap TID attribute in internal pages) as "minus
infinity" attributes, which participate in comparisons if and only if
we reach such attributes before the end of the scan key (a smaller
keysz for the index scan could prevent this). I've generalized the
minus infinity concept that _bt_compare() has always considered as a
special case, extending it to individual attributes. It's actually
possible to remove that old hard-coded _bt_compare() logic with this
patch applied without breaking anything, since we can rely on the
comparison of an explicitly 0-attribute tuple working the same way
(pg_upgrade'd databases will break if we do this, however, so I didn't
go that far).

Note that I didn't change the logic that has _bt_binsrch() treat
internal pages in a special way when tuples compare as equal. We still
need that logic for cases where keysz is less than the number of
indexed columns. It's only possible to avoid this _bt_binsrch() thing
for internal pages when all attributes, including heap TID, were
specified and compared (an insertion scan key has to have an entry for
every indexed column, including even heap TID). Doing better there
doesn't seem worth the trouble of teaching _bt_compare() to tell the
_bt_binsrch() caller about this as a special case. That means that we
still move left on equality in some cases where it isn't strictly
necessary, contrary to L However, amcheck verifies that the classic
"Ki < v <= Ki+1" invariant holds (as opposed to "Ki <= v <= Ki+1")
when verifying parent/child relationships, which demonstrates that I
have restored the classic invariant (I just don't find it worthwhile
to take advantage of it within _bt_binsrch() just yet).

Most of this work was done while I was an employee of VMware, though I
joined Crunchy data on Monday and cleaned it up a bit more since then.
I'm excited about joining Crunchy, but I should also acknowledge
VMware's strong support of my work.

[1] 

Re: commitfest 2018-07

2018-06-14 Thread Stephen Frost
Greetings,

* Andrew Dunstan (andrew.duns...@2ndquadrant.com) wrote:
> Well, this went quiet. I'm happy to be CFM and assisted by Michael and
> Ashutosh
> 
> Are there any privileges required that I should see about obtaining?

I've set you up with the cf admin privs (which Michael also has).

Please let me know if you need anything further.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Few cosmetic suggestions for commit 16828d5c (Fast Alter Table Add Column...)

2018-06-14 Thread Alvaro Herrera
On 2018-Jun-14, Amit Kapila wrote:

> On Sun, Jun 3, 2018 at 5:08 AM, Amit Kapila  wrote:

> > 2.
> > +/*
> > + * Structure used to represent value to be used when the attribute is not
> > + * present at all in a tuple, i.e. when the column was created after the
> > tuple
> > + */
> > +
> > +typedef struct attrMissing
> > +{
> > +   boolammissingPresent;   /* true if non-NULL missing value exists
> > */
> > +   Datum   ammissing;  /* value when attribute is missing */
> > +} AttrMissing;
> > +

> As nobody responded, it seems that the variable naming pointed above
> is okay, but in any case, I think we should fix cosmetic changes
> proposed.  I will commit the patch unless you or someone thinks that
> we should change the name of the variable.

We used to use prefixes for common struct members names to help
disambiguate across members that would otherwise have identical names in
different structs.  Our convention was to use _ as a separator.  This
convention has been partially lost, but seems we can use it to good
effect here, by renaming ammissingPresent to am_present and ammissing to
am_missing (I would go as far as suggesting am_default or am_substitute
or something like that).

BTW I think "the result stored" is correct English.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: commitfest 2018-07

2018-06-14 Thread Andrew Dunstan




On 06/07/2018 09:01 AM, Ashutosh Bapat wrote:

On Thu, Jun 7, 2018 at 8:38 AM, Jonathan S. Katz  wrote:

On Jun 6, 2018, at 8:14 PM, Michael Paquier  wrote:

On Wed, Jun 06, 2018 at 12:40:40PM -0400, Andrew Dunstan wrote:

I'll volunteer for CFM, which seems appropriate since I was one of the
supporters of having an extra CF.

I don't mind helping out either.  There are many patches to handle.

+1 to either both of you working together or one of you taking the charge.

Doing a scan, there are currently 152 patches in there and presumably
more on the way. Scanning through the most recent CFs it appears # patches
ranged from 150-250, and conceivably the lower end of that # is still a lot to
manage for one person.

Count on me if you need more helping hands.




Well, this went quiet. I'm happy to be CFM and assisted by Michael and 
Ashutosh


Are there any privileges required that I should see about obtaining?

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()

2018-06-14 Thread Alvaro Herrera
On 2018-Jun-14, Michael Paquier wrote:

> On Thu, Jun 14, 2018 at 02:06:57AM +0900, Masahiko Sawada wrote:
> > On Thu, Jun 14, 2018 at 12:04 AM, Simon Riggs  wrote:
> >> On 13 June 2018 at 15:51, Alvaro Herrera  wrote:
> >>> I guess you could go either way ... we're just changing one unhelpful
> >>> error with a better one: there is no change in behavior.  I would
> >>> backpatch this, myself, and avoid the code divergence.
> >>
> >> WAL control functions all say the same thing, so we can do that here also.
> > 
> > +1
> 
> +1 for a back-patch to v10.  The new error message brings clarity in my
> opinion.

Pushed, backpatched to 9.5.  Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Shared access methods?

2018-06-14 Thread Andres Freund
On 2018-06-14 15:59:22 +0300, Alexander Korotkov wrote:
> > b) extensions containing AMs would need to do something INSERT ... ON
> >CONFLICT DO NOTHING like.
> 
> We already have CREATE ACCESS METHOD command.  I think this command
> should handle that internally.  And I don't understand why "ON
> CONFLICT DO NOTHING".  If AM with given name already exists in pg_am,
> why should we ignore the error?

Well, right now an AM containing extension creates things in each
database (i.e. same scope as extensions). But with shared AMs that
wouldn't be the case - you might still want to create the extension in
another database.  So we'd need to have CREATE ACCESS METHOD check
whether already is the same entry, and only delete it on DROP ACCESS
METHOD if there's no dependencies from other databases...

Greetings,

Andres Freund



Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-14 Thread Tom Lane
I wrote:
> This appears to be the fault of commit ab7271677, whose authors I've cc'd:
> the stanza starting at about allpaths.c:1672 is bullheadedly creating a
> parallel path whether that's allowed or not.  Fixing it might be as simple
> as adding "rel->consider_parallel" to the conditions there.

Oh, and while I'm bitching: the same commit has created this exceedingly
dangerous coding pattern in create_append_path:

pathnode->subpaths = list_concat(subpaths, partial_subpaths);

foreach(l, subpaths)
{
Path   *subpath = (Path *) lfirst(l);

Because list_concat() modifies its first argument, this will usually
have the effect that the foreach traverses the partial_subpaths as
well as the main subpaths.  But it's very unclear to the reader whether
that's intended or not.  Worse, if we had *only* partial subpaths so
that subpaths was initially NIL, then the loop would fail to traverse
the partial subpaths, resulting in inconsistent behavior.

It looks to me like traversal of the partial subpaths is the right
thing here, in which case we should do

-   foreach(l, subpaths)
+   foreach(l, pathnode->subpaths)

or perhaps better

-   pathnode->subpaths = list_concat(subpaths, partial_subpaths);
+   pathnode->subpaths = subpaths = list_concat(subpaths, partial_subpaths);

to make the behavior clear and consistent.  But not being familiar
with the partial-subpaths morass, I'm not sure.

regards, tom lane



Re: AtEOXact_ApplyLauncher() and subtransactions

2018-06-14 Thread Alvaro Herrera
On 2018-Jun-05, Amit Khandekar wrote:

> When a SUBSCRIPTION is altered, then the currently running
> table-synchronization workers that are no longer needed for the
> altered subscription, are terminated. This is done by the function
> AtEOXact_ApplyLauncher() inside CommitTransaction(). So during each
> ALTER-SUBSCRIPTION command, the on_commit_stop_workers list is
> appended with new set of workers to be stopped. And then at commit,
> AtEOXact_ApplyLauncher() stops the workers in the list.
> 
> But there is no handling for sub-transaction abort.

Peter, any comments here?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-14 Thread Tom Lane
Rajkumar Raghuwanshi  writes:
> I am getting a server crash for below test case.

> postgres=# CREATE TABLE test( c1 int, c2 int, c3 text) partition by
> range(c1);
> CREATE TABLE
> postgres=# create table test_p1 partition of test for values from
> (minvalue) to (0);
> CREATE TABLE
> postgres=# create table test_p2 partition of test for values from (0) to
> (maxvalue);
> CREATE TABLE
> postgres=#
> postgres=# select (select max((select t1.c2 from test t1 where t1.c1 =
> t2.c1))) from test t2;
> server closed the connection unexpectedly

Reproduced here.  The assert seems to be happening because
add_paths_to_append_rel is trying to create a parallel path for
an appendrel that is marked consider_parallel = false.

This appears to be the fault of commit ab7271677, whose authors I've cc'd:
the stanza starting at about allpaths.c:1672 is bullheadedly creating a
parallel path whether that's allowed or not.  Fixing it might be as simple
as adding "rel->consider_parallel" to the conditions there.

regards, tom lane



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-14 Thread Masahiko Sawada
On Wed, Jun 13, 2018 at 10:20 PM, Joe Conway  wrote:
> On 06/11/2018 05:22 AM, Masahiko Sawada wrote:
>> As per discussion at PGCon unconference, I think that firstly we need
>> to discuss what threats we want to defend database data against.
>
> Exactly. While certainly there is demand for encryption for the sake of
> "checking a box", different designs will defend against different
> threats, and we should be clear on which ones we are trying to protect
> against for any particular design.
>
>> Also, if I understand correctly, at unconference session there also
>> were two suggestions about the design other than the suggestion by
>> Alexander: implementing TDE at column level using POLICY, and
>> implementing TDE at table-space level. The former was suggested by Joe
>> but I'm not sure the detail of that suggestion. I'd love to hear the
>> deal of that suggestion.
>
> The idea has not been extensively fleshed out yet, but the thought was
> that we create column level POLICY, which would transparently apply some
> kind of transform on input and/or output. The transforms would
> presumably be expressions, which in turn could use functions (extension
> or builtin) to do their work. That would allow encryption/decryption,
> DLP (data loss prevention) schemes (masking, redacting), etc. to be
> applied based on the policies.

It seems good idea. Which does this design encrypt data on, buffer or
both buffer and disk? And does this design (per-column encryption) aim
to satisfy something specific security compliance?

> This, in and of itself, would not address key management. There is
> probably a separate need for some kind of built in key management --
> perhaps a flexible way to integrate with external systems such as Vault
> for example, or maybe something self contained, or perhaps both.

I agree to have a flexible way in order to address different
requirements. I thought that having a GUC parameter to which we store
a shell command to get encryption key is enough but considering
integration with various key managements seamlessly I think that we
need to have APIs for key managements. (fetching key, storing key,
generating key etc)

> Or
> maybe key management is really tied into the separately discussed effort
> to create SQL VARIABLEs somehow.
>

Could you elaborate on how key management is tied into SQL VARIABLEs?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Locking B-tree leafs immediately in exclusive mode

2018-06-14 Thread Alexander Korotkov
On Thu, Jun 14, 2018 at 6:32 PM Peter Geoghegan  wrote:
> On Thu, Jun 14, 2018 at 5:43 AM, Alexander Korotkov
>  wrote:
> > However, that doesn't
> > look like inevitable shortcoming, because we could store heap TID in
> > t_tid of pivot index tuples.
>
> But the offset doesn't have enough space for an entire TID.

Sorry, my bad.  It slipped my mind that we use t_tid.ip_blkid of pivot
tuples to store downlinks :)
So, yes, additional attribute is required to store heap TID in pivot tuple.

Anyway, I'm looking forward seeing your patch posted, if even it would
be not yet perfect shape.


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-14 Thread Rui DeSousa


> On Jun 14, 2018, at 9:19 AM, Robert Haas  wrote:
> 
>  anyone who wants a BEFORE trigger has a good reason
> for wanting it.

I have used before triggers to enforce the immutability of a column.

i.e. 

  if (new.member_key != old.member_key) then
raise exception 'Unable to change member_key, column is immutable';
  end if
  ; 



Re: Locking B-tree leafs immediately in exclusive mode

2018-06-14 Thread Peter Geoghegan
On Thu, Jun 14, 2018 at 5:43 AM, Alexander Korotkov
 wrote:
> Could you, please, clarify what do you mean by "fan-in"?  As I
> understood, higher "fan-in" means more children on single non-leaf
> page, and in turn "better branching".  Is my understanding correct?

Yes, your understanding is correct.

> If my understanding is correct, then making leaf level truly unique
> without suffix truncation will kill fan-in, because additional heap
> TID attribute will increase pivot tuple size.

You're right that that's theoretically possible. However, in practice
we can almost always truncate away the heap TID attribute from pivot
tuples. We only need the heap TID attribute when there is no other
distinguishing attribute.

> However, that doesn't
> look like inevitable shortcoming, because we could store heap TID in
> t_tid of pivot index tuples.

But the offset doesn't have enough space for an entire TID.

> So, my idea that it's not necessary to couple suffix truncation with
> making leaf tuples unique.

It seems like by far the simplest way. I don't really care about
suffix truncation, though. I care about making the leaf tuples unique.
Suffix truncation can help.

> Regarding just making leaf tuples unique,
> I understand that it wouldn't be always win.  When we have a lot of
> duplicates, current implementation searches among the pages containing
> those duplicates for the first page containing enough of free space.
> While with unique index, we would have to always insert into
> particular page.  Thus, current implementation gives us better filling
> of pages, and (probably) faster insertion.

You're definitely right that that's the weakness of the design.

>  But, unique index would
> give us much better performance of retail tuple delete and update
> (currently we don't support both).  But I believe that future
> pluggable table access methods will use both.

Right.

> Therefore, my idea is that there is a tradeoff between making btree
> index unique or non-unique.  I think we will need an option for that.
> I'm not yet sure if this option should be user-visible or
> auto-decision based on table access method used.

I agree.

-- 
Peter Geoghegan



Re: [GSoC] current working status

2018-06-14 Thread Charles Cui
good idea, will keep a feature branch, and set up ci.

On Thu, Jun 14, 2018, 8:03 AM Aleksander Alekseeev <
a.aleks...@postgrespro.ru> wrote:

> Hello Charles,
>
> >I saw the list of errors you posted. That's because I have some new
> > function implemented and pushed without testing(in order forget my
> > change in local machine).
> > So you are saying it is best to keep each commit workable. Right?
>
> Ideally, yes. You can always create a separate branch to experiment with
> new code and merge it to master branch when it becomes stable. This is
> not a strict GSoC requirement or something like this, but a good
> practice. If something goes wrong you can just delete an experimental
> branch and go back to working master branch.
>
>
> --
> Best regards,
> Aleksander Alekseev
>


Re: [GSoC] current working status

2018-06-14 Thread Aleksander Alekseeev
Hello Charles,

>I saw the list of errors you posted. That's because I have some new
> function implemented and pushed without testing(in order forget my
> change in local machine).
> So you are saying it is best to keep each commit workable. Right?

Ideally, yes. You can always create a separate branch to experiment with
new code and merge it to master branch when it becomes stable. This is
not a strict GSoC requirement or something like this, but a good
practice. If something goes wrong you can just delete an experimental
branch and go back to working master branch.


-- 
Best regards,
Aleksander Alekseev



Re: ntile() throws ERROR when hashagg is false

2018-06-14 Thread Tom Lane
David Rowley  writes:
> On 14 June 2018 at 18:57, Andrew Gierth  wrote:
>> What I think pg is actually doing is taking the value of the ntile()
>> argument from the first row and using that for the whole partition.

Yes, easily verified by looking at window_ntile(): the argument is only
examined on first call.

> I wonder if it would be worth adding a run-time check in
> window_ntile() that causes an ERROR on first call if there are any
> Vars or PARAM_EXEC Params in the function argument. An ERROR might be
> better than doing something that the user does not expect.

-1, that would break cases that are legal and useful, such as where a
PARAM_EXEC Param represents an outer-query-level variable, while still
failing to catch some problematic cases (eg. volatile functions).
I think also that there are cases that are not legal per spec but can
still be useful, as long as the user knows what they're doing.

It might be worth some documentation changes though.

regards, tom lane



Re: [GSoC] current working status

2018-06-14 Thread Charles Cui
Hi Aleksander,

   I saw the list of errors you posted. That's because I have some new
function implemented and pushed without testing(in order forget my change
in local machine).
So you are saying it is best to keep each commit workable. Right?

Thanks Charles.

2018-06-14 1:41 GMT-07:00 Aleksander Alekseeev :

> Hello Charles,
>
> >The first evaluation is coming. Here is my progress so far. During
> > the first stage of work, I have implemented the thrift binary
> > protocol as the format of postgresql plugin. Currently, the main
> > interface is byte. Users who use this plugin need to provide thrift
> > bytes to the plugin, and there are helpers for each data type to
> > parse out the value contained in the bytes. This method has been
> > verified by the use of several unit tests. However, the current
> > interface needs users understand thrift format very well to use this
> > plugin. After discussing with my mentors, it will be more useful to
> > implement the concept of "thrift type", which is a custom type where
> > user provides user understandable format(e.g., json), then the plugin
> > converts into byte. I am currently busy with implementing this
> > feature and still need sometime to complete it. If this part is done,
> > I will go ahead to implement thrift compact protocol.  Let me know if
> > you have comments. You can always track the progress from github
> > ( https://github.com/charles-cui/pg_thrift)
>
> Thanks for keeping us informed!
>
> Unfortunately I'm having difficulties compiling your code on Linux with
> PostgreSQL 10.4 or 11 and GCC 8.1.1. Here is a full list of warnings
> and errors: https://afiskon.ru/s/6e/edbefe818e_pg-thrift-errors.txt
>
> I also tried a VM with Ubuntu 16.04, PostgreSQL 9.6 and GCC 5.4.0. No
> luck either. Please make sure that your code compiles on Linux. Users
> will most likely try to run it on this OS.
>
> It might be also a good idea to add your repository to travic-ci.org
> service.
>
> --
> Best regards,
> Aleksander Alekseev
>


Re: Microoptimization of Bitmapset usage in postgres_fdw

2018-06-14 Thread Nathan Bossart
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hello,

The v2 patches look good to me.  However, I found a couple other
places where we might be able to use this micro-optimization.

1) dependencies_clauselist_selectivity() in dependencies.c

/*
 * If there's not at least two distinct attnums then reject the whole 
list
 * of clauses. We must return 1.0 so the calling function's selectivity 
is
 * unaffected.
 */
if (bms_num_members(clauses_attnums) < 2)
{
pfree(list_attnums);
return 1.0;
}

2) BuildRelationExtStatistics() in extended_stats.c.

/* check allowed number of dimensions */
Assert(bms_num_members(stat->columns) >= 2 &&
   bms_num_members(stat->columns) <= STATS_MAX_DIMENSIONS);

Nathan

The new status of this patch is: Waiting on Author


Re: Partitioning with temp tables is broken

2018-06-14 Thread Tom Lane
Ashutosh Bapat  writes:
> If I am reading Tom's reply upthread correctly, we should not allow
> creating a temporary partitioned table as well as temporary partitions
> altogether.

I think that if possible, we should still allow a partitioned table
in which all the rels are temp tables of the current session.  What we
have to disallow is (a) temp/permanent mixes and (b) temp tables from
different sessions.

regards, tom lane



Re: why partition pruning doesn't work?

2018-06-14 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jun 14, 2018 at 7:23 AM, David Rowley
>  wrote:
>> However, I only spent about 10 mins looking into this, there may be
>> some giant holes in the idea.  It would need much more research.

> It kind of flies in the face of the idea that a RangeTblEntry is just
> a node that can be freely copied around, serialized and deserialized,
> etc.

And also the idea that the Plan tree is read-only to the executor,
which is not a good property to give up.

> I think it would be better to keep the pointer in the RelOptInfo in
> the planner and in the EState or PlanState in the executor.  Those are
> things we don't think can be copied, serialized, etc.

Yeah, RelOptInfo seems like the natural place in the planner; we might
need index relcache links in IndexOptInfo, too.

I'm less sure what to do in the executor.  We already do keep open
relation pointers in PlanStates; the problem is just that it's
node-type-specific (ss_currentRelation, iss_RelationDesc, etc).  Perhaps
that's unavoidable and we should just add more such fields as needed.

regards, tom lane



Re: Locking B-tree leafs immediately in exclusive mode

2018-06-14 Thread Alexander Korotkov
On Thu, Jun 14, 2018 at 4:56 PM Claudio Freire  wrote:
> Not at all. Insertion cost in unique indexes with lots of duplicates
> (happens, dead duplicates) grows quadratically on the number of
> duplicates, and that's improved by making the index unique and sorted.

Sorry, I've messed up the terms.  I did actually compare current
non-unique indexes with non-unique indexes keeping duplicate entries
ordered by TID (which makes them somewhat unique).  I didn't really
considered indexes, which forces unique constraints.  For them
insertion cost grows quadratically (as you mentioned) independently on
whether we're keeping duplicates ordered by TID or not.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Locking B-tree leafs immediately in exclusive mode

2018-06-14 Thread Claudio Freire
On Thu, Jun 14, 2018 at 9:44 AM Alexander Korotkov
 wrote:
> > > Our B-tree is currently maintaining duplicates unordered.  So, during 
> > > insertion
> > > we can traverse rightlinks in order to find page, which would fit new
> > > index tuple.
> > > However, in this case we're traversing pages in exclusive lock mode, and
> > > that happens already after re-lock.  _bt_search() doesn't do anything 
> > > with that.
> > > So, I assume there shouldn't be any degradation in the case of many
> > > duplicate entries.
> >
> > BTW, I have a prototype patch that makes the keys at the leaf level
> > unique. It is actually an implementation of nbtree suffix truncation
> > that sometimes *adds* a new attribute in pivot tuples, rather than
> > truncating away non-distinguishing attributes -- it adds a special
> > heap TID attribute when it must. The patch typically increases fan-in,
> > of course, but the real goal was to make all entries at the leaf level
> > truly unique, as L intended (we cannot do it without suffix
> > truncation because that will kill fan-in). My prototype has full
> > amcheck coverage, which really helped me gain confidence in my
> > approach.
>
> Could you, please, clarify what do you mean by "fan-in"?  As I
> understood, higher "fan-in" means more children on single non-leaf
> page, and in turn "better branching".  Is my understanding correct?
>
> If my understanding is correct, then making leaf level truly unique
> without suffix truncation will kill fan-in, because additional heap
> TID attribute will increase pivot tuple size.  However, that doesn't
> look like inevitable shortcoming, because we could store heap TID in
> t_tid of pivot index tuples.  Yes, we've used t_tid offset for storing
> number of attributes in truncated tuples.  But heap TID is going to be
> virtually the last attribute of tuple.  So, pivot tuples containing
> heap TID are not going to be suffix-truncated anyway.  So, for such
> tuples we can just don't set INDEX_ALT_TID_MASK, and they would be
> assumed to have all the attributes.

I had a patch[1] that did pretty much just that. I saw no contention
issues or adverse effects, but my testing of it was rather limited.

The patch has rotted significantly since then, sadly, and it's quite complex.

> So, my idea that it's not necessary to couple suffix truncation with
> making leaf tuples unique.

No, but the work involved for one and the other shares so much that it
lends itself to be done in the same patch.

> Regarding just making leaf tuples unique,
> I understand that it wouldn't be always win.  When we have a lot of
> duplicates, current implementation searches among the pages containing
> those duplicates for the first page containing enough of free space.
> While with unique index, we would have to always insert into
> particular page.  Thus, current implementation gives us better filling
> of pages, and (probably) faster insertion.

Not at all. Insertion cost in unique indexes with lots of duplicates
(happens, dead duplicates) grows quadratically on the number of
duplicates, and that's improved by making the index unique and sorted.

> Therefore, my idea is that there is a tradeoff between making btree
> index unique or non-unique.  I think we will need an option for that.

We will need a flag somewhere to avoid having to require index
rebuilds during pg_upgrade. My old patch maintained backward
compatibility, but when using an older version index, the sort order
of tids could not be assumed, and thus many optimizations had to be
disabled.

But it is totally doable, and necessary unless you accept a very
painful pg_upgrade.


[1] 
https://www.postgresql.org/message-id/flat/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ%40mail.gmail.com#cagtbqpz-ktrqiaa13xg1gne461yowra-s-yccqptyfrpkta...@mail.gmail.com



Re: Fix some error handling for read() and errno

2018-06-14 Thread Robert Haas
On Mon, Jun 11, 2018 at 6:11 PM, Alvaro Herrera
 wrote:
> I would go as far as suggesting to remove qualifiers that indicate what
> the file is for (such as "relation mapping file"); relying on the path
> as indicator of what's going wrong should be sufficient, since it's an
> error that affects internals anyway, not anything that users can do much
> about.  Keep variations to a minimum, to ease translator's work;
> sometimes it's hard enough to come up with good translations for things
> like "relation mapping file" in the first place, and they don't help the
> end user.

+1.  I think those things are often hard to phrase even in English.
It makes the messages long, awkward, and almost invariably the style
differs from one message to the next depending on the author and how
easy it is to describe the type of file involved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-14 Thread Ashutosh Bapat
On Thu, Jun 14, 2018 at 6:49 PM, Robert Haas  wrote:
> On Wed, Jun 6, 2018 at 7:51 AM, Ashutosh Bapat
>  wrote:
>> On Tue, Jun 5, 2018 at 10:38 PM, Alvaro Herrera
>>  wrote:
>>
>>>  - BEFORE row triggers are not supported
>>
>> I think this is fine. The existing wording suggests that the user
>> creates the triggers on the partitioned table, and that will be
>> supported always, which can lead to problems. With this description,
>> if the user finds out that BEFORE triggers are supported on partitions
>> and creates those, and we implement something to support those on
>> partitioned table, s/he will have to change the implementation
>> accordingly.
>
> It sounds like you want to try to hide from users the fact that they
> can create triggers on the individual partitions.

No. I never said that in my mails (see [1], [2]) I object to the
explicit suggestion that they can/should create BEFORE ROW triggers on
partitions since those are not supported on the partitioned table.

>  I think that's the
> wrong approach.  First of all, it's not going to work, because people
> are going to find out about it and do it anyway.  Secondly, the
> documentation's job is to tell you about the system's capabilities,
> not conceal them from you.

Neither is documentation's job to "suggest" something that can lead to
problems in future.

[1] 
https://www.postgresql.org/message-id/cafjfprfzcvnul0l3_qafqr5xfn-eynbmow4gf-2moo7gnaz...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFjFpRfzCVnuL0L3_qAFqr5xFN-EynbMoW4gf-2moO7gNazADA%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: WAL prefetch

2018-06-14 Thread Konstantin Knizhnik




On 14.06.2018 16:25, Robert Haas wrote:

On Thu, Jun 14, 2018 at 9:23 AM, Konstantin Knizhnik
 wrote:

Speed of random HDD access is limited by speed of disk head movement.
By running several IO requests in parallel we just increase probability of
head movement, so actually parallel access to HDD may even decrease IO speed
rather than increase it.
In theory, given several concurrent IO requests, driver can execute them in
optimal order, trying to minimize head movement. But if there are really
access to random pages,
then probability that we can win something by such optimization is very
small.

You might be right, but I feel like I've heard previous reports of
significant speedups from prefetching on HDDs.  Perhaps I am
mis-remembering.



It is true for RAIDs of HDD which can really win by issuing parallel IO 
operations.


But there are some many different factors that I will not be surprised 
by any result:)


The last problem I have observed with NVME device at one of the 
customer's system was huge performance degradation (> 10 times: from 
500Mb/sec to 50Mb/sec write speed)
after space exhaustion at the device. There is 3Tb NVME RAID device with 
1.5Gb database. ext4 was mounted without "discard" option.
After incorrect execution of rsync, space was exhausted. Then I removed 
all data and copied database from master node.
Then I observed huge lags in async. replication between master and 
replica. wal_receiver is saving received data too slowly: write speed is 
about ~50Mb/sec vs. 0.5Gb at master.
All my attempts to use fstrim or ex4defrag didn't help. The problem was 
solved only after deleting all database files, performing fstrim and 
copying database once again.
After it wal_sender is writing data with normal speed ~0.5Gb and there 
is no lag between master and replica.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-06-14 Thread Ashutosh Bapat
On Mon, Jun 11, 2018 at 4:05 PM, Etsuro Fujita
 wrote:
> (2018/06/07 19:42), Ashutosh Bapat wrote:
>>
>> On Wed, Jun 6, 2018 at 5:00 PM, Etsuro Fujita
>>   wrote:
>>>
>>> Since I'm not 100% sure that that is the right way to go, I've been
>>> rethinking how to fix this issue.  Yet another idea I came up with to fix
>>> this is to redesign the handling of the tlists for children in the
>>> partitioning case.  Currently, we build the reltarget for a child by
>>> applying adjust_appendrel_attrs to the reltarget for its parent in
>>> set_append_rel_size, which maps a wholerow Var referring to the parent
>>> rel
>>> to a ConvertRowtypeExpr that translates a wholerow of the child rel into
>>> a
>>> wholerow of the parent rel's rowtype.  This works well for the
>>> non-partitioned inheritance case, but makes complicated the code for
>>> handling the partitioning case especially when planning
>>> partitionwise-joins.
>>> And I think this would be the root cause of this issue.
>>
>>
>> Although true, this is not only about targetlist. Even the whole-row
>> expressions in the conditions, equivalence classes and other
>> planner/optimizer data structures are translated to
>> ConvertRowtypeExpr.
>
>
> Yeah, but I mean we modify set_append_rel_size so that we only map a parent
> wholerow Var in the parent tlist to a child wholerow Var in the child's
> tlist; parent wholerow Vars in the parent's other expressions such as
> conditions are transformed into CREs as-is.

What happens to a PlaceHolderVar containing whole-row reference when
that appears in a condition and/or targetlist.

>
>
>>>   I don't think the
>>> tlists for the children need to have their wholerows transformed into the
>>> corresponding ConvertRowtypeExprs *at this point*, so what I'd like to
>>> propose is to 1) map a parent wholerow Var simply to a child wholerow
>>> Var,
>>> instead (otherwise, the same as adjust_appendrel_attrs), when building
>>> the
>>> reltarget for a child in set_append_rel_size, 2) build the tlists for
>>> child
>>> joins using those children's wholerow Vars at path creation time, and 3)
>>> adjust those wholerow Vars in the tlists for subpaths in the chosen
>>> AppendPath so that those are transformed into the corresponding
>>> ConvertRowtypeExprs, at plan creation time (ie, in
>>> create_append_plan/create_merge_append_plan).  IIUC, this would not
>>> require
>>> any changes to pull_var_clause as proposed by the patch.  This would not
>>> require any changes to postgres_fdw as proposed by the patch, either.  In
>>> addition, this would make unnecessary the code added to setrefs.c by the
>>> partitionwise-join patch.  Maybe I'm missing something though.
>>
>>
>> Not translating whole-row expressions to ConvertRowtypeExpr before
>> creating paths can lead to a number of anomalies. For example,
>>
>> 1. if there's an index on the whole-row expression of child,
>> translating parent's whole-row expression to child's whole-row
>> expression as is will lead to using that index, when in reality the it
>> can not be used since the condition/ORDER BY clause (which originally
>> referred the parent's whole-row expression) require the child's
>> whole-row reassembled as parent's whole-row.
>> 2. Constraints on child'd whole-row expression, will be used to prune
>> a child when they can not be used since the original condition was on
>> parent' whole-row expression and not that of the child.
>> 3. Equivalence classes will think that a child whole-row expression
>> (without ConvertRowtypeExpr) is equivalent to an expression which is
>> part of the same equivalence class as the parent' whole-row
>> expression.
>
>
> Is that still possible when we only map a parent wholerow Var in the
> parent's tlist to a child wholerow Var in the child's tlist?

Yes. 1 will affect the choice of index-only scans. We use pathkey
comparisons at a number of places so tlist going out of sync with
equivalence classes can affect a number of places.

build_tlist_to_deparse() is used to deparse targetlist at the time of
creating paths,as well as during the planning. According to your
proposal we will build different tlists during path creation and plan
creation. That doesn't look good. Apart from that your proposal of
changing a child's targetlist at the time of plan creation to use CRE
doesn't help since the original problem as described in [1] happens at
the time of creating plans as described in [1].

The parent's whole-row reference has a different data type than
child's whole-row reference. If we do not cover child's whole-row
reference by CRE, the parent and child targetlists will go out of sync
as far as the type of individual columns are concerned. That too
doesn't look good to me.

[1] 
https://www.postgresql.org/message-id/CAFjFpRfD5=lsoqg1euf-vmpm9koezbbgjo68d9wkubjxop_...@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: WAL prefetch

2018-06-14 Thread Robert Haas
On Thu, Jun 14, 2018 at 9:23 AM, Konstantin Knizhnik
 wrote:
> Speed of random HDD access is limited by speed of disk head movement.
> By running several IO requests in parallel we just increase probability of
> head movement, so actually parallel access to HDD may even decrease IO speed
> rather than increase it.
> In theory, given several concurrent IO requests, driver can execute them in
> optimal order, trying to minimize head movement. But if there are really
> access to random pages,
> then probability that we can win something by such optimization is very
> small.

You might be right, but I feel like I've heard previous reports of
significant speedups from prefetching on HDDs.  Perhaps I am
mis-remembering.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL prefetch

2018-06-14 Thread Konstantin Knizhnik




On 14.06.2018 15:44, Robert Haas wrote:

On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila  wrote:

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).

That's a reasonable improvement.

Somehow I would have expected more.  That's only a 7% speedup.

I am also surprised that HDD didn't show any improvement.


My be pgbench is not the best use case for prefetch. It is updating more 
or less random pages and if database is large enough and 
full_page_writes is true (default value)
then most pages will be updated only once since last checkpoint and most 
of updates will be represented in WAL by full page records.

And such records do not require reading any data from disk.


  Since HDD's
are bad at random I/O, I would have expected prefetching to help more
in that case.


Speed of random HDD access is limited by speed of disk head movement.
By running several IO requests in parallel we just increase probability 
of head movement, so actually parallel access to HDD may even decrease 
IO speed rather than increase it.
In theory, given several concurrent IO requests, driver can execute them 
in optimal order, trying to minimize head movement. But if there are 
really access to random pages,
then probability that we can win something by such optimization is very 
small.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-14 Thread Robert Haas
On Wed, Jun 6, 2018 at 7:51 AM, Ashutosh Bapat
 wrote:
> On Tue, Jun 5, 2018 at 10:38 PM, Alvaro Herrera
>  wrote:
>
>>  - BEFORE row triggers are not supported
>
> I think this is fine. The existing wording suggests that the user
> creates the triggers on the partitioned table, and that will be
> supported always, which can lead to problems. With this description,
> if the user finds out that BEFORE triggers are supported on partitions
> and creates those, and we implement something to support those on
> partitioned table, s/he will have to change the implementation
> accordingly.

It sounds like you want to try to hide from users the fact that they
can create triggers on the individual partitions.  I think that's the
wrong approach.  First of all, it's not going to work, because people
are going to find out about it and do it anyway.  Secondly, the
documentation's job is to tell you about the system's capabilities,
not conceal them from you.  Third, any speculation about what upgrade
problems people might have in the future is just that: speculation.
As the saying goes, it's tough to make predictions, especially about
the future.

To put that another way, if we really think nobody should do this, we
should prohibit it, not leave it out of the documentation.  But I
think prohibiting it would be a terrible idea: it would break upgrades
from existing releases and inconvenience users to no good purpose.
Very few, if any, users say, well, I don't really need a trigger on
this table, but PostgreSQL is really quite a bit faster than I need it
to be, so I think I'll add one anyway just to slow things down.  We
should assume that anyone who wants a BEFORE trigger has a good reason
for wanting it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-14 Thread Robert Haas
On Mon, Jun 4, 2018 at 5:40 PM, Tom Lane  wrote:
>> I think, in general, that we should try to pick semantics that make a
>> partitioned table behave like an unpartitioned table, provided that
>> all triggers are defined on the partitioned table itself.
>
> Well, then we lose the property Alvaro wanted, namely that if an
> application chooses to insert directly into a partition, that's
> just an optimization that changes no behavior (as long as it picked
> the right partition).  Maybe this can be dodged by propagating
> parent trigger definitions to the children, but it's going to be
> complicated I'm afraid.

Isn't this basically what I already proposed in
http://postgr.es/m/CA+TgmoYQD1xSM7=xry6rv2a-w43gkpcth76f3nsp5o2sgwe...@mail.gmail.com
?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partitioning with temp tables is broken

2018-06-14 Thread Ashutosh Bapat
On Thu, Jun 14, 2018 at 9:42 AM, Amit Langote
 wrote:
> On 2018/06/14 11:09, Michael Paquier wrote:
>> On Wed, Jun 13, 2018 at 10:25:23PM +0530, amul sul wrote:
>>> On Wed, Jun 13, 2018, 8:34 PM Tom Lane  wrote:
 Even if you want to argue that there's a use case for these situations,
 it seems far too late in the release cycle to be trying to fix all these
 issues.  I think we need to forbid the problematic cases for now, and
 leave relaxing the prohibition to be treated as a future new feature.
>>>
>>> +1, forbid the problematic case.
>>
>> +1 on that.  And I can see that nobody on this thread has tested with
>> REL_10_STABLE, right?  In that case you don't get a crash, but other
>> sessions can see the temporary partition of another's session, say with
>> \d+.  So it seems to me that we should forbid the use of temporary
>> relations in a partition tree and back-patch it as well to v10 for
>> consistency.  And if trying to insert into the temporary relation you
>> can a nice error:
>>
>> =# insert into listp values (2);
>> ERROR:  0A000: cannot access temporary tables of other sessions
>> LOCATION:  ReadBufferExtended, bufmgr.c:657
>>
>> This is also a bit uninstinctive as I would think of it as an authorized
>> operation, still my guts tell me that the code complications are not
>> really worth the use-cases:
>>
>> =# create temp table listp2 partition of listp for values in (2);
>> ERROR:  42P17: partition "listp2" would overlap partition "listp2"
>> LOCATION:  check_new_partition_bound, partition.c:81
>
> I have to agree to reverting this "feature".  As Michael points out, even
> PG 10 fails to give a satisfactory error message when using a declarative
> feature like tuple routing.  It should've been "partition not found for
> tuple" rather than "cannot access temporary tables of other sessions".
> Further as David and Ashutosh point out, PG 11 added even more code around
> declarative partitioning that failed to consider the possibility that some
> partitions may not be accessible in a given session even though visible
> through relcache.
>
> I'm attaching a patch here to forbid adding a temporary table as partition
> of permanent table.  I didn't however touch the feature that allows *all*
> members in a partition tree to be temporary tables.

Similar problems will happen if a temporary partitioned table's
hierarchy contains partitions from different sessions. Also, what
should happen to the partitions from the other session after the
session creating the temporary partitioned table goes away is not
clear to me. Should they get dropped, how?

If I am reading Tom's reply upthread correctly, we should not allow
creating a temporary partitioned table as well as temporary partitions
altogether. I thought that that's easily fixable in grammar itself.
But we allow creating a table in temporary schema. So that doesn't
work. A better fix might be to prohibit those cases in
DefineRelation() itself.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Few cosmetic suggestions for commit 16828d5c (Fast Alter Table Add Column...)

2018-06-14 Thread Amit Kapila
On Sun, Jun 3, 2018 at 5:08 AM, Amit Kapila  wrote:
> Some assorted comments:
>
> 2.
> +/*
> + * Structure used to represent value to be used when the attribute is not
> + * present at all in a tuple, i.e. when the column was created after the
> tuple
> + */
> +
> +typedef struct attrMissing
> +{
> +   boolammissingPresent;   /* true if non-NULL missing value exists
> */
> +   Datum   ammissing;  /* value when attribute is missing */
> +} AttrMissing;
> +
>
> a. Extra space (empty line) between structure and comments seems
> unnecessary.
> b. The names of structure members seem little difficult to understand if you
> and or others also think so, can we rename them to something like
> (missingPresent, missingVal) or (attmissingPresent, attmissingVal)  or
> something similar.
>
> Patch to address 1 and 2a is attached.  What do you think about 2b?
>

As nobody responded, it seems that the variable naming pointed above
is okay, but in any case, I think we should fix cosmetic changes
proposed.  I will commit the patch unless you or someone thinks that
we should change the name of the variable.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL prefetch

2018-06-14 Thread Amit Kapila
On Thu, Jun 14, 2018 at 6:14 PM, Robert Haas  wrote:
> On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila  wrote:
>>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
>>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>>> The speed of synchronous replication between two nodes is increased from 56k
>>> TPS to 60k TPS (on pgbench with scale 1000).
>>
>> That's a reasonable improvement.
>
> Somehow I would have expected more.  That's only a 7% speedup.
>

It might be due to the reason that there is already a big overhead of
synchronous mode of replication that it didn't show a big speedup.  We
might want to try recovery (PITR) or maybe async replication to see if
we see any better numbers.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-14 Thread Ashutosh Bapat
On Thu, Jun 14, 2018 at 1:54 PM, Amit Langote
 wrote:
> On 2018/06/12 22:22, Ashutosh Bapat wrote:
>> -- create triggers, user may create different trigger functions one
>> for each partition, unless s/he understands that the tables can share
>> trigger functions
>> create function trig_t1p1() returns trigger as $$ begin return new;
>> end;$$ language plpgsql;
>> create trigger trig_t1p1_1 before update on t1p1 execute procedure 
>> trig_t1p1();
>> create trigger trig_t1p1_2 before update on t1p1 execute procedure 
>> trig_t1p1();
>>
>> create function trig_t1p2() returns trigger as $$ begin return new;
>> end;$$ language plpgsql;
>> create trigger trig_t1p2 before update on t1p2 execute procedure trig_t1p2();
>>
>> When this schema gets upgraded to v12 or whatever version supports
>> BEFORE ROW triggers on a partitioned table and the user tries to
>> create a trigger on a partitioned table
>> 1. somehow the code has to detect that there are already triggers on
>> the partitions so no need to create new triggers on the partitions. I
>> don't see how this can be done, unless the user is smart enough to use
>> the same trigger function for all partitions.
>>
>> 2. user has to drop those triggers and trigger functions and create
>> trigger on the partitioned table.
>>
>> May be I am underestimating users but I am sure there will be some
>> users who will end up with scenario.
>
> Hmm, if a user *knowingly* makes triggers on different partitions call
> different functions, then they wouldn't want to use the feature to create
> the trigger on partitioned tables, because that feature implies same
> trigger definition for all partitions.

How many users would do that *knowingly*?

>
> Maybe, just as a reminder to users, we could mention in the docs that it
> is in fact possible to call the same function from different triggers
> (that is, triggers defined on different partitions) and that's what they
> should do if they intend to later use the feature to define that same
> trigger on the partitioned table.

+1 for something like this, but wording is important.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: late binding of shared libs for C functions

2018-06-14 Thread Robert Haas
On Tue, Jun 12, 2018 at 8:41 AM, Andrew Dunstan
 wrote:
> UNBOUNDED would be terrible. It does not mean the same thing as UNBOUND.
>
> Perhaps something like NO CHECK would meet the case, i.e. we're not checking
> the link at function creation time.
>
> I haven't thought through the other implications yet.

It seems like it might be better to control this through a GUC than
dedicated syntax, because you probably want it for purposes of
restoring an otherwise-unrestorable dump, and you want to make the
decision at restore time, not dump time.  If it's a GUC, that's a lot
easier than if you have to edit the dump.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Shared access methods?

2018-06-14 Thread Alexander Korotkov
Hi!

On Thu, Jun 14, 2018 at 5:37 AM Andres Freund  wrote:
> Several features in various discussed access methods would benefit from
> being able to perform actions when writing out a buffer. As an example,
> because it doesn't require understanding any of the new proposed storage
> formats, it'd be good for performance if we could eagerly set hint bits
> / perform page level pruning when cleaning a dirty buffer either during
> checkpoint writeout or bgwriter / backend reclaim.  That'd allow to
> avoid the write amplification issues in several of current and proposed
> cleanup schemes.

Yes, that could be useful.

> Unfortunately that's currently not really easy to do. Buffers don't
> currently know which AM they belong to, therefore we can't know how to
> treat it at writeout time.  It's not that hard to make space in the
> buffer descriptor to additionally store the oid of the associated AM,
> e.g. we could just replace buf_id with a small bit of pointer math.
>
> But even if we had a AM oid, it'd be unclear what to do with it as it'd
> be specific to a database. Which makes it pretty much useless for tasks
> happening on writeout of victim buffers / checkpoint.
>
> Thus I think it'd be better design to have pg_am be a shared
> relation. That'd imply a two things:
> a) amhandler couldn't be regproc but would need to be two fields, one
>pointing to internal or a shlib, the other to the function name.

Makes sense for me.

> b) extensions containing AMs would need to do something INSERT ... ON
>CONFLICT DO NOTHING like.

We already have CREATE ACCESS METHOD command.  I think this command
should handle that internally.  And I don't understand why "ON
CONFLICT DO NOTHING".  If AM with given name already exists in pg_am,
why should we ignore the error?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: WAL prefetch

2018-06-14 Thread Robert Haas
On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila  wrote:
>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>> The speed of synchronous replication between two nodes is increased from 56k
>> TPS to 60k TPS (on pgbench with scale 1000).
>
> That's a reasonable improvement.

Somehow I would have expected more.  That's only a 7% speedup.

I am also surprised that HDD didn't show any improvement.  Since HDD's
are bad at random I/O, I would have expected prefetching to help more
in that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Locking B-tree leafs immediately in exclusive mode

2018-06-14 Thread Alexander Korotkov
On Thu, Jun 14, 2018 at 1:01 AM Peter Geoghegan  wrote:
> On Mon, Jun 11, 2018 at 9:30 AM, Alexander Korotkov
>  wrote:
> > On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs  wrote:
> >> It's a good idea. How does it perform with many duplicate entries?
>
> I agree that this is a good idea. It independently occurred to me to
> do this. The L algorithm doesn't take a position on this at all,
> supposing that *no* locks are needed at all (that's why I recommend
> people skip L and just read the Lanin & Shasha paper -- L is
> unnecessarily confusing). This idea seems relatively low risk.

Great.  Unfortunately, we don't have access to 72-cores machine
anymore.  But I'll do benchmarks on smaller 40-cores machine, which we
have access to.  I hope there should be some win from this patch too.

> > Our B-tree is currently maintaining duplicates unordered.  So, during 
> > insertion
> > we can traverse rightlinks in order to find page, which would fit new
> > index tuple.
> > However, in this case we're traversing pages in exclusive lock mode, and
> > that happens already after re-lock.  _bt_search() doesn't do anything with 
> > that.
> > So, I assume there shouldn't be any degradation in the case of many
> > duplicate entries.
>
> BTW, I have a prototype patch that makes the keys at the leaf level
> unique. It is actually an implementation of nbtree suffix truncation
> that sometimes *adds* a new attribute in pivot tuples, rather than
> truncating away non-distinguishing attributes -- it adds a special
> heap TID attribute when it must. The patch typically increases fan-in,
> of course, but the real goal was to make all entries at the leaf level
> truly unique, as L intended (we cannot do it without suffix
> truncation because that will kill fan-in). My prototype has full
> amcheck coverage, which really helped me gain confidence in my
> approach.

Could you, please, clarify what do you mean by "fan-in"?  As I
understood, higher "fan-in" means more children on single non-leaf
page, and in turn "better branching".  Is my understanding correct?

If my understanding is correct, then making leaf level truly unique
without suffix truncation will kill fan-in, because additional heap
TID attribute will increase pivot tuple size.  However, that doesn't
look like inevitable shortcoming, because we could store heap TID in
t_tid of pivot index tuples.  Yes, we've used t_tid offset for storing
number of attributes in truncated tuples.  But heap TID is going to be
virtually the last attribute of tuple.  So, pivot tuples containing
heap TID are not going to be suffix-truncated anyway.  So, for such
tuples we can just don't set INDEX_ALT_TID_MASK, and they would be
assumed to have all the attributes.

So, my idea that it's not necessary to couple suffix truncation with
making leaf tuples unique.  Regarding just making leaf tuples unique,
I understand that it wouldn't be always win.  When we have a lot of
duplicates, current implementation searches among the pages containing
those duplicates for the first page containing enough of free space.
While with unique index, we would have to always insert into
particular page.  Thus, current implementation gives us better filling
of pages, and (probably) faster insertion.  But, unique index would
give us much better performance of retail tuple delete and update
(currently we don't support both).  But I believe that future
pluggable table access methods will use both.

Therefore, my idea is that there is a tradeoff between making btree
index unique or non-unique.  I think we will need an option for that.
I'm not yet sure if this option should be user-visible or
auto-decision based on table access method used.

> There are still big problems with my patch, though. It seems to hurt
> performance more often than it helps when there is a lot of
> contention, so as things stand I don't see a path to making it
> commitable. I am still going to clean it up some more and post it to
> -hackers, though. I still think it's quite interesting. I am not
> pursuing it much further now because it seems like my timing is bad --
> not because it seems like a bad idea. I can imagine something like
> zheap making this patch important again. I can also imagine someone
> seeing something that I missed.

I think that joint community efforts are necessary, when single person
don't see path to make patch committable.  So, I'm looking forward
seeing your patch posted.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: why partition pruning doesn't work?

2018-06-14 Thread Robert Haas
On Thu, Jun 14, 2018 at 7:23 AM, David Rowley
 wrote:
> However, I only spent about 10 mins looking into this, there may be
> some giant holes in the idea.  It would need much more research.

It kind of flies in the face of the idea that a RangeTblEntry is just
a node that can be freely copied around, serialized and deserialized,
etc.

I think it would be better to keep the pointer in the RelOptInfo in
the planner and in the EState or PlanState in the executor.  Those are
things we don't think can be copied, serialized, etc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-06-14 Thread Marina Polyakova

On 13-06-2018 22:44, Fabien COELHO wrote:

Hello Marina,

I suppose that this is related; because of my patch there may be a lot 
of such code (see v7 in [1]):


-   fprintf(stderr,
-   "malformed variable \"%s\" value: 
\"%s\"\n",
-   var->name, var->svalue);
+   if (debug_level >= DEBUG_FAILS)
+   {
+   fprintf(stderr,
+   "malformed variable \"%s\" value: 
\"%s\"\n",
+   var->name, var->svalue);
+   }

-   if (debug)
+   if (debug_level >= DEBUG_ALL)
fprintf(stderr, "client %d sending %s\n", st->id, sql);


I'm not sure that debug messages needs to be kept after debug, if it
is about debugging pgbench itself. That is debatable.


AFAICS it is not about debugging pgbench itself, but about more detailed 
information that can be used to understand what exactly happened during 
its launch. In the case of errors this helps to distinguish between 
failures or errors by type (including which limit for retries was 
violated and how far it was exceeded for the serialization/deadlock 
errors).



The code adapts/duplicates existing server-side "ereport" stuff and
brings it to the frontend, where the logging needs are somehow quite
different.

I'd prefer to avoid duplication and/or have some code sharing.


I was recommended to use the same interface in [3]:

On elog/errstart: we already have a convention for what ereport() 
calls look like; I suggest to use that instead of inventing your 
own.


The "elog" interface already exists, it is not an invention. "ereport"
is a hack which is somehow necessary in some cases. I prefer a simple
function call if possible for the purpose, and ISTM that this is the
case.



That is a lot of complication which are justified server side
where logging requirements are special, but in this case I see it as
overkill.


I think we need ereport() if we want to make detailed error messages 
(see examples in [1])..


If it really needs to be duplicated, I'd suggest to put all this 
stuff in separated files. If we want to do that, I think that it 
would belong to fe_utils, and where it could/should be used by all 
front-end programs.


I'll try to do it..


Dunno. If you only need one "elog" function which prints a message to
stderr and decides whether to abort/exit/whatevrer, maybe it can just
be kept in pgbench. If there are are several complicated functions and
macros, better with a file. So I'd say it depends.



So my current view is that if you only need an "elog" function, it is
simpler to add it to "pgbench.c".


Thank you!


For logging purposes, ISTM that the "elog" macro interface is nicer,
closer to the existing "fprintf(stderr", as it would not introduce 
the

additional parentheses hack for "rest".


I was also recommended to use ereport() instead of elog() in [3]:


Probably. Are you hoping that advises from different reviewers should
be consistent? That seems optimistic:-)


To make the patch committable there should be no objection to it..

[1] 
https://www.postgresql.org/message-id/c89fcc380a19380260b5ea463efc1416%40postgrespro.ru


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-06-14 Thread Marina Polyakova

On 13-06-2018 22:59, Alvaro Herrera wrote:

For context: in the backend, elog() is only used for internal messages
(i.e. "can't-happen" conditions), and ereport() is used for user-facing
messages.  There are many things ereport() has that elog() doesn't, 
such
as additional message fields (HINT, DETAIL, etc) that I think could 
have
some use in pgbench as well.  If you use elog() then you can't have 
that.


AFAIU originally it was not supposed that the pgbench error messages 
have these fields, so will it be good to change the final output to 
stderr?.. For example:


-   fprintf(stderr, "%s", PQerrorMessage(con));
-   fprintf(stderr, "(ignoring this error and continuing 
anyway)\n");
+   ereport(LOG,
+   (errmsg("Ignoring the server error and continuing 
anyway"),
+errdetail("%s", PQerrorMessage(con;

-   fprintf(stderr, "%s", PQerrorMessage(con));
-   if (sqlState && strcmp(sqlState, 
ERRCODE_UNDEFINED_TABLE) == 0)
-   {
-fprintf(stderr, "Perhaps you need to do initialization (\"pgbench 
-i\") in database \"%s\"\n", PQdb(con));

-   }
-
-   exit(1);
+   ereport(ERROR,
+   (errmsg("Server error"),
+errdetail("%s", PQerrorMessage(con)),
+sqlState && strcmp(sqlState, 
ERRCODE_UNDEFINED_TABLE) == 0 ?
+	 errhint("Perhaps you need to do initialization (\"pgbench -i\") 
in database \"%s\"\n",

+PQdb(con)) : 0));

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-14 Thread Rajkumar Raghuwanshi
Hi,

I am getting a server crash for below test case.

postgres=# CREATE TABLE test( c1 int, c2 int, c3 text) partition by
range(c1);
CREATE TABLE
postgres=# create table test_p1 partition of test for values from
(minvalue) to (0);
CREATE TABLE
postgres=# create table test_p2 partition of test for values from (0) to
(maxvalue);
CREATE TABLE
postgres=#
postgres=# select (select max((select t1.c2 from test t1 where t1.c1 =
t2.c1))) from test t2;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

--*logfile*
TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)",
File: "pathnode.c", Line: 1288)
2018-06-13 12:48:41.577 IST [52050] LOG:  server process (PID 52061) was
terminated by signal 6: Aborted
2018-06-13 12:48:41.577 IST [52050] DETAIL:  Failed process was running:
select (select max((select t1.c2 from test t1 where t1.c1 = t2.c1))) from
test t2;

*--core file *
Core was generated by `postgres: edb postgres [local]
SELECT   '.
Program terminated with signal 6, Aborted.
#0  0x003dd2632495 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install
keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x003dd2632495 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x003dd2633c75 in abort () at abort.c:92
#2  0x00a32782 in ExceptionalCondition (conditionName=0xc23718
"!(!parallel_aware || pathnode->path.parallel_safe)", errorType=0xc23411
"FailedAssertion",
fileName=0xc2357d "pathnode.c", lineNumber=1288) at assert.c:54
#3  0x007f1a3d in create_append_path (root=0x27a5930,
rel=0x27c25f0, subpaths=0x27c4e60, partial_subpaths=0x0,
required_outer=0x0, parallel_workers=2,
parallel_aware=true, partitioned_rels=0x27c36f0, rows=-1) at
pathnode.c:1288
#4  0x00797d40 in add_paths_to_append_rel (root=0x27a5930,
rel=0x27c25f0, live_childrels=0x27c4958) at allpaths.c:1700
#5  0x007d3392 in apply_scanjoin_target_to_paths (root=0x27a5930,
rel=0x27c25f0, scanjoin_targets=0x27c4558,
scanjoin_targets_contain_srfs=0x0,
scanjoin_target_parallel_safe=false, tlist_same_exprs=true) at
planner.c:6962
#6  0x007cb218 in grouping_planner (root=0x27a5930,
inheritance_update=false, tuple_fraction=0) at planner.c:2014
#7  0x007c943a in subquery_planner (glob=0x27855e0,
parse=0x26c7250, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at
planner.c:966
#8  0x007c7ff7 in standard_planner (parse=0x26c7250,
cursorOptions=256, boundParams=0x0) at planner.c:405
#9  0x007c7d1f in planner (parse=0x26c7250, cursorOptions=256,
boundParams=0x0) at planner.c:263
#10 0x008c461e in pg_plan_query (querytree=0x26c7250,
cursorOptions=256, boundParams=0x0) at postgres.c:809
#11 0x008c4751 in pg_plan_queries (querytrees=0x27a58f8,
cursorOptions=256, boundParams=0x0) at postgres.c:875
#12 0x008c4a26 in exec_simple_query (query_string=0x26c5798 "select
(select max((select t1.c2 from test t1 where t1.c1 = t2.c1))) from test
t2;") at postgres.c:1050
#13 0x008c8e67 in PostgresMain (argc=1, argv=0x26ef2a0,
dbname=0x26ef100 "postgres", username=0x26c2298 "edb") at postgres.c:4153
#14 0x00826657 in BackendRun (port=0x26e7060) at postmaster.c:4361
#15 0x00825dc5 in BackendStartup (port=0x26e7060) at
postmaster.c:4033
#16 0x008221a7 in ServerLoop () at postmaster.c:1706
#17 0x00821ad9 in PostmasterMain (argc=3, argv=0x26c01f0) at
postmaster.c:1379
#18 0x00748cb8 in main (argc=3, argv=0x26c01f0) at main.c:228

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: SCRAM with channel binding downgrade attack

2018-06-14 Thread Magnus Hagander
On Tue, Jun 12, 2018 at 6:49 AM, Michael Paquier 
wrote:

> On Mon, Jun 11, 2018 at 04:54:45PM +0200, Magnus Hagander wrote:
> > I'm wondering if that means we should then also not do it specifically
> for
> > scram in this version. Otherwise we're likely to end up with a parameter
> > that only has a "lifetime" of one version, and that seems like a bad
> idea.
> > If nothing else we should clearly think out what the path is to make sure
> > that doesn't happen. (e.g. we don't want a
> > scram_channel_binding_mode=require in this version, if the next one is
> > going to replace it with something like heikkis suggested
> > allowed_authentication_methods=SCRAM-SHA-256-PLUS or whatever we end up
> > coming up with there).
>
> Conceptually, it depends on if we consider SCRAM and
> SCRAM+channel_binding as two separate authentication protocols.  However
> it seems to me that as both are the same thing as they use the same
> protocol so it would be confusing for the user to be able to define both
> SCRAM-SHA-256 and SCRAM-SHA-256-PLUS within the same list, so I would
> tend to think that we should have in this future
> "allowed_authentication_methods" only scram-sha-256 listed as an option,
> which counts for both SCRAM with and without channel binding.
>

One could argue they are equally the same protocol if we have
SCRAM-SHA-512 or whatever in the future. So how would those be handled?


Thinking harder about this thread, it could be as well possible in the
> future that we add support for channel binding for some other SASL
> mechanism, in which case I would tend to rename
> scram_channel_binding_type and scram_channel_binding_mode to simply
> channel_binding_type and channel_binding_mode, without any concepts of
> SCRAM attached to it.  So in short, I'd like to keep both enforcement
> mechanisms as separate concepts.  One future compatibility issue is how
> to deal with for example cases like allowed_authentication_methods="md5"
> and channel_binding_mode=require where an allowed authentication does
> not have channel binding?  And it seems to me that this should result in
> an error.
>

Yeah, not embedding scram in the name seems smarter. But you might still
need to define which one, so channel_binding_mode=require wouldn't be
enough in that case -- you'd need to have
channel_binding_mode=require-scram-sha-256-plus, wouldn't you?

And yes, in your suggested example, it should absolutely fail early, as
there is no way to actually connect with that setting. Arguably we should
also fail on e.g. sslmode=require over a unix socket, though, which we
don't. But that is probably not somethign to copy :)

I still think that the fact that we are still discussing what is basically
the *basic concepts* of how this would be set up after we have released
beta1 is a clear sign that this should not go into 11.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Logging transaction IDs for DDL.

2018-06-14 Thread Magnus Hagander
On Thu, Jun 14, 2018 at 12:34 AM, Vik Fearing 
wrote:

> I just noticed a problem with log_statement = 'ddl' and log_line_prefix
> containing '%x'.  If the statement is the first in the transaction, it
> will be logged before it is executed, and more importantly, before a
> transaction ID is assigned.  That means that %x will be 0.
>
> If the administrator has configured postgres like this in order to ease
> PITR on bad DDL, it's actually of no use whatsoever and they'll have to
> use pg_waldump to try to figure things out.
>
> PFA a simple patch that I hope addresses the issue in a clean way.  It
> also handles the same problem for 'mod'.
>
> I'm undecided whether this is a bugfix or an improvement.  I'm leaning
> towards bugfix.
>

FWIW, I'm with Andres on the general points -- especially about having to
look at overheads etc...

For your specific usecase, and until the actual logging can be rethought on
it, I suggest you take a look at event triggers for it. E.g.:

https://blog.hagander.net/logging-transactions-that-dropped-tables-236/

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: why partition pruning doesn't work?

2018-06-14 Thread David Rowley
On 14 June 2018 at 19:17, Amit Langote  wrote:
> I had sent a patch to try to get rid of the open(NoLock) there a couple of
> months ago [1].  The idea was to both lock and open the relation in
> ExecNonLeafAppendTables, which is the first time all partitioned tables in
> a given Append node are locked for execution.  Also, the patch makes it a
> responsibility of ExecEndAppend to release the relcache pins, so the
> recently added ExecDestroyPartitionPruneState would not be needed.

Robert and I briefly discussed something more complete a while back.
Not much detail was talked about, but the basic idea was to store the
Relation somewhere in the planner an executor that we could lookup by
rel index rather than having to relation_open all the time.

I just had a very quick look around and thought maybe RangeTblEntry
might be a good spot to store the Relation and current lock level that
we hold on that relation. This means that we can use
PlannerInfo->simple_rte_array in the planner and
EState->es_range_table in the executor. The latter of those would be
much nicer if we built an array instead of keeping the list (can
probably build that during InitPlan()).  We could then get hold of a
Relation object when needed without having to do relation_open(...,
NoLock).

Code which currently looks like:

reloid = getrelid(scanrelid, estate->es_range_table);
rel = heap_open(reloid, lockmode);

In places like ExecOpenScanRelation, could be replaced with some
wrapper function call like: rte_open_rel(estate, scanrelid, lockmode);

This could also be used to ERROR out if rte_open_rel() was done
initially with NoLock. Right now there are so many places that we do
relation_open(..., NoLock) and write a comment /* Lock already taken
by ... */, which we hope is always true.

If the Relation is already populated in the RangeTblEntry then the
function would just return that, otherwise, we'd just look inside the
RangeTblEntry for the relation Oid and do a heap_open on it, and store
the lockmode that's held. We could then also consider getting of a
bunch of fields like boundinfo and nparts from RelOptInfo.

We could also perhaps do a WARNING about lock upgrade hazards in
there, at least maybe in debug builds.

However, I only spent about 10 mins looking into this, there may be
some giant holes in the idea.  It would need much more research.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pgsql: Store 2PC GID in commit/abort WAL recs for logical decoding

2018-06-14 Thread Nikhil Sontakke
Hi Heikki,


>
> Pushed, thanks for the review!
>

There was a slight oversight in the twophase_gid length calculation in
the XactLogCommitRecord() code path in the cf5a1890592 commit. The
corresponding XactLogAbortRecord() code path was ok. PFA, a small
patch to fix the oversight.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


gid_length.patch
Description: Binary data


Re: ON CONFLICT DO NOTHING on pg_dump

2018-06-14 Thread Surafel Temesgen
On Tue, Jun 12, 2018 at 12:05 PM, Ideriha, Takeshi <
ideriha.take...@jp.fujitsu.com> wrote:
thank you for the review

> Hi,
> I feel like that on-conflict-do-nothing support is useful especially
> coupled with --data-only option.
> Only the difference of data can be restored.
>
> >The attache patch add --on-conflect-do-nothing option to pg_dump in order
> to do the above.
>
> The followings are some comments.
>
> +  --on-conflect-do-nothing
> Here's a typo: conflect -> conflict. This typo also applies to pg_dump.c
>
> printf(_("  --insertsdump data as INSERT
> commands, rather than COPY\n"));
> +   printf(_("  --on-conflect-do-nothing dump data as INSERT
> commands with on conflect do nothing\n"));
> printf(_("  --no-commentsdo not dump comments\n"));
>
> The output of help should be in alphabetical order according to the
> convention. So changing the order seems logical.
> Please apply my review to the documentation as well.
> By the way, 4d6a854 breaks the patch on this point.
>
> +This option is not valid unless --inserts is
> also specified.
> +   
>
> +   if (dopt.do_nothing && !dopt.dump_inserts)
> +   exit_horribly(NULL, "option --on-conflect-do-nothing
> requires option --inserts\n");
>
> How about mentioning --column-inserts? --on-conflict-do-nothing with
> --column-inserts should work.
>
fixed

>
> Do you have any plan to support on-conlict-do-update? Supporting this
> seems to me complicated and take much time so I don't mind not implementing
> this.
>
i agree its complicated and i don't have a plan to implementing it.

> What do you think about adding some test cases?
> command_fails_like() at 001_basic.pl checks command fail pattern with
> invalid comnibation of option.
> And 002_pg_dump.pl checks the feature iteself.
>
thank you for pointing me that i add basic test and it seems to me the rest
of the test is covered by column_inserts test

> Regards,
> Takeshi Ideriha
>


pg_dump_onConflect_v2.pach
Description: Binary data


Re: ntile() throws ERROR when hashagg is false

2018-06-14 Thread David Rowley
On 14 June 2018 at 18:57, Andrew Gierth  wrote:
> What I think pg is actually doing is taking the value of the ntile()
> argument from the first row and using that for the whole partition.
> In your example, enabling or disabling hashagg changes the order of the
> input rows for the window function (since you've specified no ordering
> in the window definition), and with hashagg off, you get the smallest
> value of a first (which is 0 and thus an error).

Seems that's the case. I'd guess it was written that way so we could
allow PARAM_EXTERN Params rather than requiring the arg to be a Const.

I wonder if it would be worth adding a run-time check in
window_ntile() that causes an ERROR on first call if there are any
Vars or PARAM_EXEC Params in the function argument. An ERROR might be
better than doing something that the user does not expect.

Ideally, something would alert the user much sooner than the executor,
but I think doing it that way would be quite a bit more work.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: why partition pruning doesn't work?

2018-06-14 Thread David Rowley
On 14 June 2018 at 04:10, Tom Lane  wrote:
> There's still one thing I'm a bit confused about here.  I noticed that
> we weren't actually using the partopfamily and partopcintype fields in
> PartitionPruneContext, so I removed them.  But that still leaves both
> partsupfunc and partcollation as pointers into the relcache that were
> subject to this hazard.  My testing agrees with lousyjack's results
> that both of those were, in fact, being improperly accessed.  The OID
> comparison effect I mentioned upthread explains why the buildfarm's
> cache-clobbering members failed to notice any problem with garbage
> partsupfunc data ... but why did we not see any complaints about invalid
> collation OIDs?  Tis strange.

FWIW It's not working for me before e23bae82cf3 with
CLOBBER_FREED_MEMORY, CLOBBER_CACHE_ALWAYS and RELCACHE_FORCE_RELEASE,
and:

create table listp (a text) partition by list(a);
create table listp1 partition of listp for values in ('1');
select * from listp where a = (select '1');

I get:

ERROR:  cache lookup failed for collation 2139062143

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



question on streaming replication

2018-06-14 Thread Atul Kumar
Hi,

I have postgres edb 9.6 version, i have below query to solve it out.

i have configured streaming replication having master and slave node
on same  server just to test it.

All worked fine but when i made slave service stop, and create some
test databases in master, after then i made slave service start, slave
didn't pick the changes.

The replication was on async state.

Then after doing some search on google i tried to make it sync state
but even making changes in postgresql.conf file I am neither getting
sync state nor getting any changes on slave server.

Please suggest the needful.


Regards,
Atul



Re: [GSoC] current working status

2018-06-14 Thread Aleksander Alekseeev
Hello Charles,

>The first evaluation is coming. Here is my progress so far. During
> the first stage of work, I have implemented the thrift binary
> protocol as the format of postgresql plugin. Currently, the main
> interface is byte. Users who use this plugin need to provide thrift
> bytes to the plugin, and there are helpers for each data type to
> parse out the value contained in the bytes. This method has been
> verified by the use of several unit tests. However, the current
> interface needs users understand thrift format very well to use this
> plugin. After discussing with my mentors, it will be more useful to
> implement the concept of "thrift type", which is a custom type where
> user provides user understandable format(e.g., json), then the plugin
> converts into byte. I am currently busy with implementing this
> feature and still need sometime to complete it. If this part is done,
> I will go ahead to implement thrift compact protocol.  Let me know if
> you have comments. You can always track the progress from github
> ( https://github.com/charles-cui/pg_thrift)

Thanks for keeping us informed!

Unfortunately I'm having difficulties compiling your code on Linux with
PostgreSQL 10.4 or 11 and GCC 8.1.1. Here is a full list of warnings
and errors: https://afiskon.ru/s/6e/edbefe818e_pg-thrift-errors.txt

I also tried a VM with Ubuntu 16.04, PostgreSQL 9.6 and GCC 5.4.0. No
luck either. Please make sure that your code compiles on Linux. Users
will most likely try to run it on this OS.

It might be also a good idea to add your repository to travic-ci.org
service.

-- 
Best regards,
Aleksander Alekseev



Re: [PROPOSAL] Shared Ispell dictionaries

2018-06-14 Thread Arthur Zakirov
On Wed, May 16, 2018 at 02:36:33PM +0300, Arthur Zakirov wrote:
> ... I attached the rebased patch.

I attached new version of the patch.

I found a bug when CompoundAffix,
SuffixNodes, PrefixNodes, DictNodes of IspellDictData structure are
empty. Now they have terminating entry and therefore they have at least
one node entry.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index 6f5b635413..09297e384c 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -78,6 +78,8 @@
 #define tmpalloc(sz)  MemoryContextAlloc(Conf->buildCxt, (sz))
 #define tmpalloc0(sz)  MemoryContextAllocZero(Conf->buildCxt, (sz))
 
+#define tmpstrdup(str) MemoryContextStrdup(Conf->buildCxt, (str))
+
 /*
  * Prepare for constructing an ISpell dictionary.
  *
@@ -498,7 +500,7 @@ NIAddSpell(IspellDict *Conf, const char *word, const char 
*flag)
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + 
strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0')
-   ? cpstrdup(Conf, flag) : VoidString;
+   ? tmpstrdup(flag) : VoidString;
Conf->nspell++;
 }
 
@@ -1040,7 +1042,7 @@ setCompoundAffixFlagValue(IspellDict *Conf, 
CompoundAffixFlag *entry,
entry->flag.i = i;
}
else
-   entry->flag.s = cpstrdup(Conf, s);
+   entry->flag.s = tmpstrdup(s);
 
entry->flagMode = Conf->flagMode;
entry->value = val;
@@ -1541,6 +1543,9 @@ nextline:
return;
 
 isnewformat:
+   pfree(recoded);
+   pfree(pstr);
+
if (oldformat)
ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
diff --git a/contrib/dict_int/dict_int.c b/contrib/dict_int/dict_int.c
index 56ede37089..8dd4959028 100644
--- a/contrib/dict_int/dict_int.c
+++ b/contrib/dict_int/dict_int.c
@@ -30,7 +30,7 @@ PG_FUNCTION_INFO_V1(dintdict_lexize);
 Datum
 dintdict_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
DictInt*d;
ListCell   *l;
 
@@ -38,7 +38,7 @@ dintdict_init(PG_FUNCTION_ARGS)
d->maxlen = 6;
d->rejectlong = false;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/contrib/dict_xsyn/dict_xsyn.c b/contrib/dict_xsyn/dict_xsyn.c
index a79ece240c..0b8a32d459 100644
--- a/contrib/dict_xsyn/dict_xsyn.c
+++ b/contrib/dict_xsyn/dict_xsyn.c
@@ -140,7 +140,7 @@ read_dictionary(DictSyn *d, const char *filename)
 Datum
 dxsyn_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
DictSyn*d;
ListCell   *l;
char   *filename = NULL;
@@ -153,7 +153,7 @@ dxsyn_init(PG_FUNCTION_ARGS)
d->matchsynonyms = false;
d->keepsynonyms = true;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
index 247c202755..2a2fbee5fa 100644
--- a/contrib/unaccent/unaccent.c
+++ b/contrib/unaccent/unaccent.c
@@ -267,12 +267,12 @@ PG_FUNCTION_INFO_V1(unaccent_init);
 Datum
 unaccent_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
TrieChar   *rootTrie = NULL;
boolfileloaded = false;
ListCell   *l;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/src/backend/commands/tsearchcmds.c 
b/src/backend/commands/tsearchcmds.c
index 3a843512d1..3753e32b2c 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -386,17 +386,25 @@ verify_dictoptions(Oid tmplId, List *dictoptions)
}
else
{
+   DictInitData init_data;
+
/*
 * Copy the options just in case init method thinks it can 
scribble on
 * them ...
 */
dictoptions = copyObject(dictoptions);
 
+   init_data.dict_options = dictoptions;
+   init_data.dict.id = InvalidOid;
+   init_data.dict.xmin = InvalidTransactionId;
+   init_data.dict.xmax = InvalidTransactionId;
+   ItemPointerSetInvalid(_data.dict.tid);
+
/*
 * Call the init method and see if it complains.  We don't 
worry about
 * it leaking memory, 

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-14 Thread Amit Langote
On 2018/06/12 22:22, Ashutosh Bapat wrote:
> -- create triggers, user may create different trigger functions one
> for each partition, unless s/he understands that the tables can share
> trigger functions
> create function trig_t1p1() returns trigger as $$ begin return new;
> end;$$ language plpgsql;
> create trigger trig_t1p1_1 before update on t1p1 execute procedure 
> trig_t1p1();
> create trigger trig_t1p1_2 before update on t1p1 execute procedure 
> trig_t1p1();
> 
> create function trig_t1p2() returns trigger as $$ begin return new;
> end;$$ language plpgsql;
> create trigger trig_t1p2 before update on t1p2 execute procedure trig_t1p2();
> 
> When this schema gets upgraded to v12 or whatever version supports
> BEFORE ROW triggers on a partitioned table and the user tries to
> create a trigger on a partitioned table
> 1. somehow the code has to detect that there are already triggers on
> the partitions so no need to create new triggers on the partitions. I
> don't see how this can be done, unless the user is smart enough to use
> the same trigger function for all partitions.
> 
> 2. user has to drop those triggers and trigger functions and create
> trigger on the partitioned table.
> 
> May be I am underestimating users but I am sure there will be some
> users who will end up with scenario.

Hmm, if a user *knowingly* makes triggers on different partitions call
different functions, then they wouldn't want to use the feature to create
the trigger on partitioned tables, because that feature implies same
trigger definition for all partitions.

Maybe, just as a reminder to users, we could mention in the docs that it
is in fact possible to call the same function from different triggers
(that is, triggers defined on different partitions) and that's what they
should do if they intend to later use the feature to define that same
trigger on the partitioned table.

Thanks,
Amit




Re: Needless additional partition check in INSERT?

2018-06-14 Thread Amit Langote
On 2018/06/09 3:41, Alvaro Herrera wrote:
> BTW, while working on this, I was a bit disturbed by the
> execReplication.c changes (namely: if the partitioning is not identical
> on both sides, things are likely to blow up pretty badly), but that's a
> separate topic.

Hmm, yes.  If the partition of a given name on subscription side doesn't
have the same partition constraint as on the publication side,
subscription worker simply fails, which is the right thing to do anyway.

ERROR:  new row for relation "p1" violates partition constraint
DETAIL:  Failing row contains (1, 1).
LOG:  background worker "logical replication worker" (PID 25739) exited
with exit code 1

Maybe, it's a user error to set up logical replication that way.

> I didn't see any tests of logical replication with
> partitioned tables ... Is the current understanding that we don't
> promise those things to work terribly well together?

As far as the documentation is concerned, there is this note on

31.4. Restrictions (Chapter 31. Logical Replication)
https://www.postgresql.org/docs/devel/static/logical-replication-restrictions.html

  Replication is only possible from base tables to base tables. That is,
  the tables on the publication and on the subscription side must be
  normal tables, not views, materialized views, partition root tables, or
  foreign tables. In the case of partitions, you can therefore replicate a
  partition hierarchy one-to-one, but you cannot currently replicate to a
  differently partitioned setup. Attempts to replicate tables other than
  base tables will result in an error.

Thanks,
Amit




Re: why partition pruning doesn't work?

2018-06-14 Thread Amit Langote
On 2018/06/13 23:39, Tom Lane wrote:
> Robert Haas  writes:
>> Seems reasonable.  Really, I think we should look for a way to hang
>> onto the relation at the point where it's originally opened and locked
>> instead of reopening it here.  But that's probably more invasive than
>> we can really justify right at the moment, and I think this is a step
>> in a good direction.
> 
> The existing coding there makes me itch a bit, because there's only a
> rather fragile line of reasoning justifying the assumption that there is a
> pre-existing lock at all.  So I'd be in favor of what you suggest just to
> get rid of the "open(NoLock)" hazard.  But I agree that it'd be rather
> invasive and right now is probably not the time for it.

I had sent a patch to try to get rid of the open(NoLock) there a couple of
months ago [1].  The idea was to both lock and open the relation in
ExecNonLeafAppendTables, which is the first time all partitioned tables in
a given Append node are locked for execution.  Also, the patch makes it a
responsibility of ExecEndAppend to release the relcache pins, so the
recently added ExecDestroyPartitionPruneState would not be needed.

Attached is a rebased version of that patch if there is interest in it.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/0b361a22-f995-e15c-a385-6d1b72dd0d13%40lab.ntt.co.jp
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 7a4665cc4e..ea6b05934b 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1401,7 +1401,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * in each PartitionPruneInfo.
  */
 PartitionPruneState *
-ExecCreatePartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
+ExecCreatePartitionPruneState(PlanState *planstate, List *partitionpruneinfo,
+ Relation 
*partitioned_rels)
 {
PartitionPruneState *prunestate;
PartitionPruningData *prunedata;
@@ -1440,6 +1441,7 @@ ExecCreatePartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
PartitionPruneInfo *pinfo = castNode(PartitionPruneInfo, 
lfirst(lc));
PartitionPruningData *pprune = [i];
PartitionPruneContext *context = >context;
+   Relationrel;
PartitionDesc partdesc;
PartitionKey partkey;
int partnatts;
@@ -1460,17 +1462,11 @@ ExecCreatePartitionPruneState(PlanState *planstate, 
List *partitionpruneinfo)
/* present_parts is also subject to later modification */
pprune->present_parts = bms_copy(pinfo->present_parts);
 
-   /*
-* We need to hold a pin on the partitioned table's relcache 
entry so
-* that we can rely on its copies of the table's partition key 
and
-* partition descriptor.  We need not get a lock though; one 
should
-* have been acquired already by InitPlan or
-* ExecLockNonLeafAppendTables.
-*/
-   context->partrel = relation_open(pinfo->reloid, NoLock);
-
-   partkey = RelationGetPartitionKey(context->partrel);
-   partdesc = RelationGetPartitionDesc(context->partrel);
+   Assert(partitioned_rels[i] != NULL);
+   rel = partitioned_rels[i];
+   Assert(RelationGetRelid(rel) == pinfo->reloid);
+   partkey = RelationGetPartitionKey(rel);
+   partdesc = RelationGetPartitionDesc(rel);
n_steps = list_length(pinfo->pruning_steps);
 
context->strategy = partkey->strategy;
@@ -1546,26 +1542,6 @@ ExecCreatePartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
 }
 
 /*
- * ExecDestroyPartitionPruneState
- * Release resources at plan shutdown.
- *
- * We don't bother to free any memory here, since the whole executor context
- * will be going away shortly.  We do need to release our relcache pins.
- */
-void
-ExecDestroyPartitionPruneState(PartitionPruneState *prunestate)
-{
-   int i;
-
-   for (i = 0; i < prunestate->num_partprunedata; i++)
-   {
-   PartitionPruningData *pprune = >partprunedata[i];
-
-   relation_close(pprune->context.partrel, NoLock);
-   }
-}
-
-/*
  * ExecFindInitialMatchingSubPlans
  * Identify the set of subplans that cannot be eliminated by 
initial
  * pruning (disregarding any pruning constraints involving 
PARAM_EXEC
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index b963cae730..87809054ed 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -858,22 +858,52 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 /*
  * ExecLockNonLeafAppendTables
  *
- * Locks, if necessary, the 

Re: WAL prefetch

2018-06-14 Thread Konstantin Knizhnik




On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
 wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache...  what about shared buffers?



It is good question. I thought a lot about prefetching directly to 
shared buffers.
But the current c'est la vie with Postgres is that allocating too large 
memory for shared buffers is not recommended.
Due to many different reasons: degradation of clock replacement 
algorithm, "write storm",...


If your system has 1Tb of memory,  almost none of Postgresql 
administrators will recommend to use all this 1Tb for shared buffers.
Moreover there are recommendations to choose shared buffers size based 
on size of internal cache of persistent storage device
(so that it will be possible to flush changes without doing writes to 
physical media). So at this system with 1Tb of RAM, size of shared 
buffers will be most likely set to few hundreds of gigabytes.


Also PostgreSQL is not currently supporting dynamic changing of shared 
buffers size. Without it, the only way of using Postgres in clouds and 
another multiuser systems where system load is not fully controlled by  
user is to choose relatively small shared buffer size and rely on OS 
caching.


Yes, access to shared buffer is about two times faster than reading data 
from file system cache.
But it is better, then situation when shared buffers are swapped out and 
effect of large shared buffers becomes negative.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: ntile() throws ERROR when hashagg is false

2018-06-14 Thread Andrew Gierth
> "Rajkumar" == Rajkumar Raghuwanshi 
>  writes:

 Rajkumar> Hi
 
 Rajkumar> ntile() throws ERROR when hashagg is false, test case given
 Rajkumar> below.

 Rajkumar> postgres=# create table foo (a int, b int, c text);
 Rajkumar> CREATE TABLE
 Rajkumar> postgres=# insert into foo select i%20, i%30, to_char(i%12, 
'FM') from
 Rajkumar> generate_series(0, 36) i;
 Rajkumar> INSERT 0 37
 Rajkumar> postgres=# explain select ntile(a) OVER () from foo GROUP BY a;

This query isn't actually legal per the spec; the argument of ntile is
restricted to being a constant or parameter, so it can't change from row
to row. PG is more flexible, but that doesn't make the query any more
meaningful.

What I think pg is actually doing is taking the value of the ntile()
argument from the first row and using that for the whole partition.
In your example, enabling or disabling hashagg changes the order of the
input rows for the window function (since you've specified no ordering
in the window definition), and with hashagg off, you get the smallest
value of a first (which is 0 and thus an error).

-- 
Andrew (irc:RhodiumToad)



Re: WAL prefetch

2018-06-14 Thread Thomas Munro
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
 wrote:
> pg_wal_prefetch function will infinitely traverse WAL and prefetch block
> references in WAL records
> using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache...  what about shared buffers?

-- 
Thomas Munro
http://www.enterprisedb.com



ntile() throws ERROR when hashagg is false

2018-06-14 Thread Rajkumar Raghuwanshi
Hi

ntile() throws ERROR when hashagg is false, test case given below.

postgres=# create table foo (a int, b int, c text);
CREATE TABLE
postgres=# insert into foo select i%20, i%30, to_char(i%12, 'FM') from
generate_series(0, 36) i;
INSERT 0 37
postgres=# explain select ntile(a) OVER () from foo GROUP BY a;
QUERY PLAN
---
 WindowAgg  (cost=25.00..29.50 rows=200 width=8)
   ->  HashAggregate  (cost=25.00..27.00 rows=200 width=4)
 Group Key: a
 ->  Seq Scan on foo  (cost=0.00..22.00 rows=1200 width=4)
(4 rows)

postgres=# select ntile(a) OVER () from foo GROUP BY a;
 ntile
---
 1
 1
 2
 2
 3
 3
 4
 4
 5
 5
 6
 6
 7
 7
 8
 8
 9
 9
10
11
(20 rows)

postgres=# set enable_hashagg to false;
SET
postgres=# explain select ntile(a) OVER () from foo GROUP BY a;
   QUERY PLAN
-
 WindowAgg  (cost=83.37..91.87 rows=200 width=8)
   ->  Group  (cost=83.37..89.37 rows=200 width=4)
 Group Key: a
 ->  Sort  (cost=83.37..86.37 rows=1200 width=4)
   Sort Key: a
   ->  Seq Scan on foo  (cost=0.00..22.00 rows=1200 width=4)
(6 rows)
postgres=# select ntile(a) OVER () from foo GROUP BY a;
ERROR:  argument of ntile must be greater than zero

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation