Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-06-22 Thread Ashutosh Sharma
On Sat, Jun 23, 2018 at 7:36 AM, Haribabu Kommi
 wrote:
> On Sat, Jun 23, 2018 at 5:45 AM Euler Taveira  wrote:
>>
>> 2018-06-22 12:06 GMT-03:00 Robert Haas :
>> > On Wed, Jun 20, 2018 at 10:19 AM, Euler Taveira 
>> > wrote:
>> >> 2018-06-20 4:30 GMT-03:00 Haribabu Kommi :
>> >>> Attached is a simple patch with implementation. Comments?
>> >>>
>> >> Why don't you extend the existing function pg_stat_statements_reset()?
>> >
>> > Well, the existing function doesn't take any arguments.  We could add
>> > an additional version of it that takes an argument, or we could
>> > replace the existing version with one that has an optional argument.
>> > But are either of those things any better than just adding a new
>> > function with a different name, like
>> > pg_stat_statements_reset_statement()?
>> >
>> From the user's POV, overloading is a good goal. It is better to
>> remember one function name than 3 different function names to do the
>> same task (reset statement statistics).
>
>
> I agree that function overloading is beneficial until unless it doesn't
> introduce
> confusion and difficulty in using it.
>
>
>> > I have not had such good experiences with function overloading, either
>> > in PostgreSQL or elsewhere, that I'm ready to say reusing the same
>> > name is definitely the right approach.  For example, suppose we
>> > eventually end up with a function that resets all the statements, a
>> > function that resets just one statement, a function that resets all
>> > statements for one user, and a function that resets all statements
>> > where the statement text matches a certain regexp.  If those all have
>> > separate names, everything is fine.  If they all have the same name,
>> > there's no way that's not confusing.
>> >
>> I think part of your frustration with overloading is because there are
>> so many arguments and/or argument type combinations to remember.
>> Frankly, I never remember the order / type of arguments when a
>> function has more than 2 arguments (unless I use that function a lot).
>> If we want to consider overloading I think we should put all
>> possibilities for that function on the table and decide between
>> overload it or not. Bad overloading decisions tend to be very
>> frustrating for the user. In this case, all possibilities (queryid,
>> user, regex, database, statement regex) can be summarized as (i) 0
>> arguments that means all statements and (ii) 1 argument (queryid is
>> unique for all statements) -- because queryid can be obtain using
>> SELECT pg_stat_statements_reset(queryid) FROM pg_stat_statements WHERE
>> my-condition-here.
>
>
> In pg_stat_statements the uniqueness of the query is decided by (userid,
> dbid and queryid).
> I feel the reset function should take maximum of 3 input arguments and rest
> of the
> conditions can be filtered using the WHERE condition.
>

I'm slightly confused about the function overloading part here (not
sure whether we should try using the same function name or introduce a
function with new name), however, I think, passing userid, dbid and
queryid as input arguments to the user function (may be
pg_stat_statements_reset_statement or pg_stat_statements_reset) looks
like a good option as it would allow users to remove an entry for a
particular query across the databases not just in the database that
the user is currently connected to. The current patch definitely lacks
this flexibility.

> if we are going to support a single function that wants to reset all the
> statement
> statistics of a "user" or specific to a "database" and "specific queries"
> based on
> other input parameter values, I am not sure that the overloading function
> can cause
> confusion in using it? And also if the specified input query data doesn't
> found, better
> to ignore or raise an error?
>
> Regards,
> Haribabu Kommi
> Fujitsu Australia



Re: I'd like to discuss scaleout at PGCon

2018-06-22 Thread Bruce Momjian
On Sat, Jun 23, 2018 at 12:41:00PM +1000, Haribabu Kommi wrote:
> 
> On Sat, Jun 23, 2018 at 12:17 PM Bruce Momjian  wrote:
> 
> On Fri, Jun 22, 2018 at 01:28:58PM -0500, Merlin Moncure wrote:
> > On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian  wrote:
> > >
> > > What we don't want to do is to add a bunch of sharding-specific code
> > > without knowing which workloads it benefits, and how many of our users
> > > will actually use sharding.  Some projects have it done that, and it
> > > didn't end well since they then had a lot of product complexity with
> > > little user value.
> >
> > Key features from my perspective:
> > *) fdw in parallel.  how do i do it today? ghetto implemented parallel
> > queries with asynchronous dblink
> 
> Andres has outlined what needs to be done here:
> 
>         https://www.postgresql.org/message-id/
> 20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de 
> 
> 
> Sorry if this was already been discussed in up-thread.
> 
> Just I would like to bring out idea scale out by adding many instances that
> can share the lock and buffer pool manager with all the instances with
> the help of Remote direct memory access.
> 
> By adding pluggable buffer pool and lock manager, how about adding
> many instances and all share the buffers using RDMA to provide
> better scaling with shared everything.
> 
> Currently I didn't know have any idea whether is it possible or not and also
> the problems in using RDMA. 
> 
> Just want to check whether is it worth idea to consider in supporting scale
> out?

Yes, Robert Haas did mention this.  It might be something we consider
much later.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: SCRAM with channel binding downgrade attack

2018-06-22 Thread Bruce Momjian
On Sun, Jun 17, 2018 at 10:21:27PM +0900, Michael Paquier wrote:
> On Fri, Jun 15, 2018 at 05:23:27PM -0400, Robert Haas wrote:
> > On Thu, Jun 14, 2018 at 7:43 AM, Magnus Hagander  
> > wrote:
> >> I still think that the fact that we are still discussing what is basically
> >> the *basic concepts* of how this would be set up after we have released
> >> beta1 is a clear sign that this should not go into 11.
> > 
> > +1.
> 
> Yes, that sounds right.

Uh, as I am understanding it, if we don't allow clients to force channel
binding, then channel binding is useless because it cannot prevent
man-in-the-middle attacks.  I am sure some users will try to use it, and
not understand that it serves no purpose.  If we then allow clients to
force channel binding in PG 12, they will then need to fix their
clients.

I suggest that if we don't allow users to use channel binding
effectively that we should remove all documentation about this feature.

This is different from downgrade attacks like SCRAM to MD5 or MD5 to
'password' because the way the password is transmitted is not integral
to preventing man-in-the-middle attacks.  Channel binding's sole value
is to prevent such attacks, so if it cannot prevent them, it has no use
and will just confuse people until we make it useful in a later release.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: I'd like to discuss scaleout at PGCon

2018-06-22 Thread Haribabu Kommi
On Sat, Jun 23, 2018 at 12:17 PM Bruce Momjian  wrote:

> On Fri, Jun 22, 2018 at 01:28:58PM -0500, Merlin Moncure wrote:
> > On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian  wrote:
> > >
> > > What we don't want to do is to add a bunch of sharding-specific code
> > > without knowing which workloads it benefits, and how many of our users
> > > will actually use sharding.  Some projects have it done that, and it
> > > didn't end well since they then had a lot of product complexity with
> > > little user value.
> >
> > Key features from my perspective:
> > *) fdw in parallel.  how do i do it today? ghetto implemented parallel
> > queries with asynchronous dblink
>
> Andres has outlined what needs to be done here:
>
>
> https://www.postgresql.org/message-id/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de
>


Sorry if this was already been discussed in up-thread.

Just I would like to bring out idea scale out by adding many instances that
can share the lock and buffer pool manager with all the instances with
the help of Remote direct memory access.

By adding pluggable buffer pool and lock manager, how about adding
many instances and all share the buffers using RDMA to provide
better scaling with shared everything.

Currently I didn't know have any idea whether is it possible or not and also
the problems in using RDMA.

Just want to check whether is it worth idea to consider in supporting scale
out?

Regards,
Haribabu Kommi
Fujitsu Australia


Re: I'd like to discuss scaleout at PGCon

2018-06-22 Thread Bruce Momjian
On Fri, Jun 22, 2018 at 01:28:58PM -0500, Merlin Moncure wrote:
> On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian  wrote:
> >
> > What we don't want to do is to add a bunch of sharding-specific code
> > without knowing which workloads it benefits, and how many of our users
> > will actually use sharding.  Some projects have it done that, and it
> > didn't end well since they then had a lot of product complexity with
> > little user value.
> 
> Key features from my perspective:
> *) fdw in parallel.  how do i do it today? ghetto implemented parallel
> queries with asynchronous dblink

Andres has outlined what needs to be done here:


https://www.postgresql.org/message-id/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de

> *) column store

This could be part of the plugable storage engine.

> *) automatic partition management through shards

Yes, but I am afraid we need to know where we are going before we can
implement management.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-06-22 Thread Haribabu Kommi
On Sat, Jun 23, 2018 at 5:45 AM Euler Taveira  wrote:

> 2018-06-22 12:06 GMT-03:00 Robert Haas :
> > On Wed, Jun 20, 2018 at 10:19 AM, Euler Taveira 
> wrote:
> >> 2018-06-20 4:30 GMT-03:00 Haribabu Kommi :
> >>> Attached is a simple patch with implementation. Comments?
> >>>
> >> Why don't you extend the existing function pg_stat_statements_reset()?
> >
> > Well, the existing function doesn't take any arguments.  We could add
> > an additional version of it that takes an argument, or we could
> > replace the existing version with one that has an optional argument.
> > But are either of those things any better than just adding a new
> > function with a different name, like
> > pg_stat_statements_reset_statement()?
> >
> From the user's POV, overloading is a good goal. It is better to
> remember one function name than 3 different function names to do the
> same task (reset statement statistics).
>

I agree that function overloading is beneficial until unless it doesn't
introduce
confusion and difficulty in using it.


> I have not had such good experiences with function overloading, either
> > in PostgreSQL or elsewhere, that I'm ready to say reusing the same
> > name is definitely the right approach.  For example, suppose we
> > eventually end up with a function that resets all the statements, a
> > function that resets just one statement, a function that resets all
> > statements for one user, and a function that resets all statements
> > where the statement text matches a certain regexp.  If those all have
> > separate names, everything is fine.  If they all have the same name,
> > there's no way that's not confusing.
> >
> I think part of your frustration with overloading is because there are
> so many arguments and/or argument type combinations to remember.
> Frankly, I never remember the order / type of arguments when a
> function has more than 2 arguments (unless I use that function a lot).
> If we want to consider overloading I think we should put all
> possibilities for that function on the table and decide between
> overload it or not. Bad overloading decisions tend to be very
> frustrating for the user. In this case, all possibilities (queryid,
> user, regex, database, statement regex) can be summarized as (i) 0
> arguments that means all statements and (ii) 1 argument (queryid is
> unique for all statements) -- because queryid can be obtain using
> SELECT pg_stat_statements_reset(queryid) FROM pg_stat_statements WHERE
> my-condition-here.
>

In pg_stat_statements the uniqueness of the query is decided by (userid,
dbid and queryid).
I feel the reset function should take maximum of 3 input arguments and rest
of the
conditions can be filtered using the WHERE condition.

if we are going to support a single function that wants to reset all the
statement
statistics of a "user" or specific to a "database" and "specific queries"
based on
other input parameter values, I am not sure that the overloading function
can cause
confusion in using it? And also if the specified input query data doesn't
found, better
to ignore or raise an error?

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Fix some error handling for read() and errno

2018-06-22 Thread Michael Paquier
On Fri, Jun 22, 2018 at 09:51:30AM -0400, Alvaro Herrera wrote:
> On 2018-Jun-22, Robert Haas wrote:
>> I think this should be split into two patches.  Fixing failure to save
>> and restore errno is a different issue from cleaning up short write
>> messages.  I agree that the former should be back-patched and the
>> latter not.
> 
> Hmm, yeah, good thought, +1.

That's exactly why I have started this thread so as both problems are
addressed separately:
https://www.postgresql.org/message-id/20180622061535.gd5...@paquier.xyz
And back-patching the errno patch while only bothering about messages on
HEAD matches also what I got in mind.  I'll come back to this thread
once the errno issues are all addressed.
--
Michael


signature.asc
Description: PGP signature


Re: Using JSONB directly from application

2018-06-22 Thread Nico Williams
On Mon, Feb 26, 2018 at 09:36:52PM +0800, Craig Ringer wrote:
> On 26 February 2018 at 04:05, Anthony Communier  > wrote:
> > It would be nice if application connected to a Postrgesql database could
> > send and receive JSONB in binary. It could save some useless text
> > conversion. All works could be done on application side which are often
> > more scalable than database itself.
> 
> To support this, you'd need to extract PostgreSQL's jsonb support into a C
> library that could be used independently of backend server infrastructure
> like 'palloc' and memory contexts, ereport(), etc. Or write a parallel
> implementation.
> 
> It's one of the reasons some people questioned the wisdom of doing jsonb as
> a bespoke format not using protobufs or whatever. I'm not one of them,
> since I wasn't the one doing the work, and I also know how hard it can be
> to neatly fit general purpose library code into the DB server where we
> expect it to do little things like not abort() on malloc() failure.
> 
> If you're interested in writing such a library, I suggest proposing a broad
> design for how you intend to do it here. I suspect that duplicating enough
> server backend infrastructure to make the existing jsonb implementation
> friendly to frontend code would be frustrating, but maybe it's not as bad
> as I think. Certainly if you did such a thing, many people would thank you,
> because the inability to use ereport() and elog(), PG_TRY, the List API,
> etc, in FRONTEND code is a constant if minor source of irritation in
> PostgreSQL development.

A good starting point might be to take the description of the JSONB
on-disk encoding from src/include/utils/jsonb.h, fill in some of the
blanks (the encoding of scalar types is not discussed in sufficient
detail in that file, IIRC), and put it in an XML document, perhaps with
some ASCII art.  Eventually this could become yet another binary
encoding of JSON that could be standardized at the IETF, say.

The JSONB format is actually really neat, and deserves wider use outside
PG, both as an interchange format, and as a storage format.

Nico
-- 



Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-06-22 Thread David Rowley
On 22 June 2018 at 18:28, David Rowley  wrote:
> I've written fixes for items 1-6 above.
>
> I did:
>
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those
> ones. Make this array small and increase it only when we need more
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto

I've added this to the July 'fest:

https://commitfest.postgresql.org/18/1690/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Using JSONB directly from application

2018-06-22 Thread Christian Ohler
(continuing an old thread from
https://www.postgresql.org/message-id/CAMsr%2BYEtamQYZ5EocsuthQCvyvmRnQrucDP6GZynPtf0gsMbuw%40mail.gmail.com
)

Craig Ringer , 2018-02-26:

> On 26 February 2018 at 04:05, Anthony Communier
 wrote:
>
> > It would be nice if application connected to a Postrgesql database could
> > send and receive JSONB in binary. It could save some useless text
> > conversion. All works could be done on application side which are often
> > more scalable than database itself.
>
> To support this, you'd need to extract PostgreSQL's jsonb support into a C
> library that could be used independently of backend server infrastructure
> like 'palloc' and memory contexts, ereport(), etc. Or write a parallel
> implementation.

At Shift, we also have use cases that would likely be sped up quite a bit
if we could avoid the conversion from JSONB to JSON, and instead pass
binary JSONB to the application side and parse it there (in Go).  I doubt
we'd want to reuse any of Postgres's C code, and would instead go with your
"parallel implementation" idea; I can't imagine it being particularly
difficult to implement a JSONB parser from scratch.

All we need, I think, is a Postgres function raw_jsonb(jsonb) that returns
bytea but is the identity function at the byte level.  (Or allow a cast
from jsonb to bytea.)

Our Go code would then send queries like SELECT col1, col2, raw_jsonb(col3)
FROM table1 WHERE ...; I haven't thought in depth about how we'd parse the
JSONB in Go, but perhaps we can synthesize a stream of JSON tokens from the
binary JSONB (one token at a time, to avoid copies and allocations) and
adapt the streaming parser https://github.com/json-iterator/go to turn it
into Go values.

Sending raw JSONB to Postgres might also be interesting, but I'd start with
receiving.

Would implementing raw_jsonb be as trivial as it sounds?  What about usages
like SELECT raw_jsonb(col3->'foo'); does the subobject returned by '->'
share structure with the containing object, making the conversion to a
self-contained JSONB value less direct?

Can these conversions be implemented without copying the bytes?


An open question about the API contract would be how raw_jsonb would be
affected if Postgres introduces a version 2 of JSONB encoding.  My
intuition is to punt that problem to the application, and define that
raw_jsonb returns whatever version of JSONB is most convenient for Postgres
for that particular datum; this minimizes conversion work on the Postgres
side, which is the purpose of the mechanism.  Applications that want a
stable format can use the conventional textual JSON format.  But I could
see a case for making the function raw_jsonb(int, jsonb) and allowing the
caller to specify what (maximum?) version of JSONB they want.


Re: bug with expression index on partition

2018-06-22 Thread Alvaro Herrera
On 2018-Jun-21, Amit Langote wrote:

> explain (costs off) select p from p order by p;
>   QUERY PLAN
> ---
>  Merge Append
>Sort Key: ((p1.*)::p)
>->  Index Scan using p1_p_idx on p1
>->  Index Scan using p2_p_idx on p2
>->  Index Scan using p3_p_idx on p3
> (5 rows)

Nice, but try adding a row > operator in the where clause.

I think it's clearly desirable to allow this row-based search to use indexes;
as I recall, we mostly enable pagination of results via this kind of
constructs.  However, we're lacking planner or executor features apparently,
because a query using a row > operator does not use indexes:

create table partp (a int, b int) partition by range (a);
create table partp1 partition of partp for values from (0) to (35);
create table partp2 partition of partp for values from (35) to (100);
create index on partp1 ((partp1.*));
create index on partp2 ((partp2.*));
explain select * from partp where partp > row(0,0) order by partp limit 25 ;
QUERY PLAN
──
 Limit  (cost=6.69..6.75 rows=25 width=40)
   ->  Sort  (cost=6.69..6.86 rows=66 width=40)
 Sort Key: ((partp1.*)::partp)
 ->  Append  (cost=0.00..4.83 rows=66 width=40)
   ->  Seq Scan on partp1  (cost=0.00..1.88 rows=23 width=40)
 Filter: ((partp1.*)::partp > '(0,0)'::record)
   ->  Seq Scan on partp2  (cost=0.00..2.62 rows=43 width=40)
 Filter: ((partp2.*)::partp > '(0,0)'::record)
(8 filas)

Note the indexes are ignored, as opposed to what it does in a non-partitioned
table:

create table p (a int, b int);
create index on p((p.*));
explain select * from p where p > row(0,0) order by p limit 25 ;
QUERY PLAN 
───
 Limit  (cost=0.15..2.05 rows=25 width=40)
   ->  Index Scan using p_p_idx on p  (cost=0.15..57.33 rows=753 width=40)
 Index Cond: (p.* > '(0,0)'::record)
(3 filas)

So it would be good to fix this, but there are more pieces missing.  Or
there is some trick to enable the indexes to be used in that example --
if so I'm all ears.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: bug with expression index on partition

2018-06-22 Thread Alvaro Herrera
On 2018-Jun-21, Amit Langote wrote:

> On 2018/06/21 15:35, Amit Langote wrote:
> > So, CompareIndexInfo and generateClonedIndexStmt are both doing the right
> > thing, but DefineIndex is not.  Attached is a patch to fix DefineIndex so
> > that it converts indexParams before recursing to create the index on a
> > partition.
> 
> I noticed that while CompareIndexInfo and generateClonedIndexStmt would
> reject the case where index expressions contain a whole-row Var, my patch
> didn't teach to do the same to DefineIndex, causing asymmetric behavior.
> So, whereas ATTACH PARTITION would error out when trying to clone a
> parent's index that contains a whole-row Var, recursively creating an
> index on partition won't.
> 
> I updated the patch so that even DefineIndex will check if any whole-row
> Vars were encountered during conversion and error out if so.

Thanks.  I pushed this version, although I tweaked the test so that this
condition is verified in the test that is supposed to do so, rather than
creating a whole new set of tables for this purpose.  The way it would
fail with unpatched code is perhaps less noisy that what you proposed,
but I think it's quite enough.

I think the whole-row vars issue is worthy of some more thinking.  I'm
letting that one go for now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-22 Thread Peter Geoghegan
On Fri, Jun 22, 2018 at 12:43 PM, Peter Geoghegan  wrote:
> On Fri, Jun 22, 2018 at 4:24 AM, Andrey V. Lepikhov
>  wrote:
>> According to your feedback, i develop second version of the patch.
>> In this version:
>> 1. High-level functions index_beginscan(), index_rescan() not used. Tree
>> descent made by _bt_search(). _bt_binsrch() used for positioning on the
>> page.
>> 2. TID list introduced in amtargetdelete() interface. Now only one tree
>> descent needed for deletion all tid's from the list with equal scan key
>> value - logical duplicates deletion problem.
>> 3. Only one WAL record for index tuple deletion per leaf page per
>> amtargetdelete() call.
>
> Cool.
>
> What is this "race" code about?

I noticed another bug in your patch, when running a
"wal_consistency_checking=all" smoke test. I do this simple, generic
test for anything that touches WAL-logging, actually -- it's a good
practice to adopt.

I enable "wal_consistency_checking=all" on the installation, create a
streaming replica with pg_basebackup (which also has
"wal_consistency_checking=all"), and then run "make installcheck"
against the primary. Here is what I see on the standby when I do this
with v2 of your patch applied:

9524/2018-06-22 13:03:12 PDT LOG:  entering standby mode
9524/2018-06-22 13:03:12 PDT LOG:  consistent recovery state reached
at 0/3D0
9524/2018-06-22 13:03:12 PDT LOG:  invalid record length at 0/3D0:
wanted 24, got 0
9523/2018-06-22 13:03:12 PDT LOG:  database system is ready to accept
read only connections
9528/2018-06-22 13:03:12 PDT LOG:  started streaming WAL from primary
at 0/300 on timeline 1
9524/2018-06-22 13:03:12 PDT LOG:  redo starts at 0/3D0
9524/2018-06-22 13:03:32 PDT FATAL:  inconsistent page found, rel
1663/16384/1259, forknum 0, blkno 0
9524/2018-06-22 13:03:32 PDT CONTEXT:  WAL redo at 0/3360B00 for
Heap2/CLEAN: remxid 599
9523/2018-06-22 13:03:32 PDT LOG:  startup process (PID 9524) exited
with exit code 1
9523/2018-06-22 13:03:32 PDT LOG:  terminating any other active server processes
9523/2018-06-22 13:03:32 PDT LOG:  database system is shut down

I haven't investigated this at all, but I assume that the problem is a
simple oversight. The new ItemIdSetDeadRedirect() concept that you've
introduced probably necessitates changes in both the WAL logging
routines and the redo/recovery routines. You need to go make those
changes. (By the way, I don't think you should be using the constant
"3" with the ItemIdIsDeadRedirection() macro definition.)

Let me know if you get stuck on this, or need more direction.

-- 
Peter Geoghegan



Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-22 Thread Alvaro Herrera
On 2018-Jun-22, Andres Freund wrote:

> On 2018-06-22 15:50:06 -0400, Alvaro Herrera wrote:
> > On 2018-Jun-22, Andres Freund wrote:
> > 
> > > On 2018-06-22 15:26:06 -0400, Alvaro Herrera wrote:
> > > > On 2018-Jun-22, Andres Freund wrote:
> > > > > I think a fair argument could be made that you'd want to have
> > > > > application_name logged exactly once, not in every line. Just to cope
> > > > > with log volume. With decent log analysis tools once is enough.
> > > > 
> > > > Seems harder than it sounds ... because if the user turns off
> > > > log_connections then it's not longer in the log.
> > > 
> > > That's superuser only, so I really don't quite buy that argument.
> > 
> > I meant if the DBA disables it in postgresql.conf then the info is
> > nowhere.
> 
> I'm not following - application_name isn't logged anywhere by default
> currently either.

Right.  I +1'd the proposal to have it in the log_connections line.

My mind was wandering about doing more than that (or doing something
different), but let's not derail the thread.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-22 Thread Andres Freund
On 2018-06-22 15:50:06 -0400, Alvaro Herrera wrote:
> On 2018-Jun-22, Andres Freund wrote:
> 
> > On 2018-06-22 15:26:06 -0400, Alvaro Herrera wrote:
> > > On 2018-Jun-22, Andres Freund wrote:
> > > > I think a fair argument could be made that you'd want to have
> > > > application_name logged exactly once, not in every line. Just to cope
> > > > with log volume. With decent log analysis tools once is enough.
> > > 
> > > Seems harder than it sounds ... because if the user turns off
> > > log_connections then it's not longer in the log.
> > 
> > That's superuser only, so I really don't quite buy that argument.
> 
> I meant if the DBA disables it in postgresql.conf then the info is
> nowhere.

I'm not following - application_name isn't logged anywhere by default
currently either.


Greetings,

Andres Freund



Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-22 Thread Alvaro Herrera
On 2018-Jun-22, Andres Freund wrote:

> On 2018-06-22 15:26:06 -0400, Alvaro Herrera wrote:
> > On 2018-Jun-22, Andres Freund wrote:
> > 
> > > On 2018-06-22 12:16:18 -0400, Robert Haas wrote:
> > 
> > > > OK, that makes more sense, but I'm still skeptical of adding a special
> > > > case particularly for application_name.
> > > 
> > > I think a fair argument could be made that you'd want to have
> > > application_name logged exactly once, not in every line. Just to cope
> > > with log volume. With decent log analysis tools once is enough.
> > 
> > Seems harder than it sounds ... because if the user turns off
> > log_connections then it's not longer in the log.
> 
> That's superuser only, so I really don't quite buy that argument.

I meant if the DBA disables it in postgresql.conf then the info is
nowhere.

> > One idea would be to have a log line designed specifically to be
> > printed once at connection start (if not log_connections) and then
> > once immediately after it changes.  Am I the only one for whom this
> > sounds like overengineering?
> 
> Yea. I think on balance, I don't buy that it's worth the cost. But I
> don't think it's a clear cut "you don't need this".

Yeah, that's true, particularly for the case of the connection pooler ...
(I'm anxious to see where the Odyssey thing goes, because pgbouncer at
this point doesn't seem to be cutting it anymore.  If odyssey takes off,
we could start listening more from them on what they need.)

For the time being, I think adding it to the log_connections line is a
good change, so +1.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-06-22 Thread Euler Taveira
2018-06-22 12:06 GMT-03:00 Robert Haas :
> On Wed, Jun 20, 2018 at 10:19 AM, Euler Taveira  wrote:
>> 2018-06-20 4:30 GMT-03:00 Haribabu Kommi :
>>> Attached is a simple patch with implementation. Comments?
>>>
>> Why don't you extend the existing function pg_stat_statements_reset()?
>
> Well, the existing function doesn't take any arguments.  We could add
> an additional version of it that takes an argument, or we could
> replace the existing version with one that has an optional argument.
> But are either of those things any better than just adding a new
> function with a different name, like
> pg_stat_statements_reset_statement()?
>
>From the user's POV, overloading is a good goal. It is better to
remember one function name than 3 different function names to do the
same task (reset statement statistics).

> I have not had such good experiences with function overloading, either
> in PostgreSQL or elsewhere, that I'm ready to say reusing the same
> name is definitely the right approach.  For example, suppose we
> eventually end up with a function that resets all the statements, a
> function that resets just one statement, a function that resets all
> statements for one user, and a function that resets all statements
> where the statement text matches a certain regexp.  If those all have
> separate names, everything is fine.  If they all have the same name,
> there's no way that's not confusing.
>
I think part of your frustration with overloading is because there are
so many arguments and/or argument type combinations to remember.
Frankly, I never remember the order / type of arguments when a
function has more than 2 arguments (unless I use that function a lot).
If we want to consider overloading I think we should put all
possibilities for that function on the table and decide between
overload it or not. Bad overloading decisions tend to be very
frustrating for the user. In this case, all possibilities (queryid,
user, regex, database, statement regex) can be summarized as (i) 0
arguments that means all statements and (ii) 1 argument (queryid is
unique for all statements) -- because queryid can be obtain using
SELECT pg_stat_statements_reset(queryid) FROM pg_stat_statements WHERE
my-condition-here.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-22 Thread Peter Geoghegan
On Fri, Jun 22, 2018 at 4:24 AM, Andrey V. Lepikhov
 wrote:
> According to your feedback, i develop second version of the patch.
> In this version:
> 1. High-level functions index_beginscan(), index_rescan() not used. Tree
> descent made by _bt_search(). _bt_binsrch() used for positioning on the
> page.
> 2. TID list introduced in amtargetdelete() interface. Now only one tree
> descent needed for deletion all tid's from the list with equal scan key
> value - logical duplicates deletion problem.
> 3. Only one WAL record for index tuple deletion per leaf page per
> amtargetdelete() call.

Cool.

What is this "race" code about?

> +   buffer = ReadBufferExtended(rel, MAIN_FORKNUM, 
> ItemPointerGetBlockNumber(tid), RBM_NORMAL, NULL);
> +   LockBuffer(buffer, BUFFER_LOCK_SHARE);
> +
> +   page = (Page) BufferGetPage(buffer);
> +   offnum = ItemPointerGetOffsetNumber(tid);
> +   lp = PageGetItemId(page, offnum);
> +
> +   /*
> +* VACUUM Races: someone already remove the tuple from HEAP. Ignore it.
> +*/
> +   if (!ItemIdIsUsed(lp))
> +   return NULL;

It looks wrong -- why should another process have set the item as
unused? And if we assume that that's possible at all, what's to stop a
third process from actually reusing the item pointer before we arrive
(at get_tuple_by_tid()), leaving us to find a tuple that is totally
unrelated to the original tuple to be deleted?

(Also, you're not releasing the buffer lock before you return.)

> 4. VACUUM can sort TID list preliminary for more quick search of duplicates.

This version of the patch prevents my own "force unique keys" patch
from working, since you're not using my proposed new
_bt_search()/_bt_binsrch()/_bt_compare() interface (you're not passing
them a heap TID). It is essential that your patch be able to quickly
reach any tuple that it needs to kill. Otherwise, the worst case
performance is totally unacceptable; it will never be okay to go
through 10%+ of the index to kill a single tuple hundreds or even
thousands of times per VACUUM. It seems to me that doing this
tid_list_search() binary search is pointless -- you should instead be
relying on logical duplicates using their heap TID as a tie-breaker.
Rather than doing a binary search within tid_list_search(), you should
instead match the presorted heap TIDs at the leaf level against the
sorted in-memory TID list. You know, a bit like a merge join.

I suggest that you go even further than this: I think that you should
just start distributing my patch as part of your patch series. You can
change my code if you need to. I also suggest using "git format patch"
with simple, short commit messages to produce patches. This makes it a
lot easier to track the version of the patch, changes over time, etc.

I understand why you'd hesitate to take ownership of my code (it has
big problems!), but the reality is that all the problems that my patch
has are also problems for your patch. One patch cannot get committed
without the other, so they are already the same project. As a bonus,
my patch will probably improve the best case performance for your
patch, since multi-deletions will now have much better locality of
access.

-- 
Peter Geoghegan



Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-22 Thread Andres Freund
On 2018-06-22 15:26:06 -0400, Alvaro Herrera wrote:
> On 2018-Jun-22, Andres Freund wrote:
> 
> > On 2018-06-22 12:16:18 -0400, Robert Haas wrote:
> 
> > > OK, that makes more sense, but I'm still skeptical of adding a special
> > > case particularly for application_name.
> > 
> > I think a fair argument could be made that you'd want to have
> > application_name logged exactly once, not in every line. Just to cope
> > with log volume. With decent log analysis tools once is enough.
> 
> Seems harder than it sounds ... because if the user turns off
> log_connections then it's not longer in the log.

That's superuser only, so I really don't quite buy that argument.


> And what about the application changing it after the fact?

I don't think that matters in a lot of scenarios. Poolers are the  big
exception to that, obviously.


> One idea would be to have a log line designed specifically to be
> printed once at connection start (if not log_connections) and then
> once immediately after it changes.  Am I the only one for whom this
> sounds like overengineering?

Yea. I think on balance, I don't buy that it's worth the cost. But I
don't think it's a clear cut "you don't need this".

Greetings,

Andres Freund



Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-22 Thread Alvaro Herrera
On 2018-Jun-22, Andres Freund wrote:

> On 2018-06-22 12:16:18 -0400, Robert Haas wrote:

> > OK, that makes more sense, but I'm still skeptical of adding a special
> > case particularly for application_name.
> 
> I think a fair argument could be made that you'd want to have
> application_name logged exactly once, not in every line. Just to cope
> with log volume. With decent log analysis tools once is enough.

Seems harder than it sounds ... because if the user turns off
log_connections then it's not longer in the log.  And what about the
application changing it after the fact?  One idea would be to have a log
line designed specifically to be printed once at connection start (if
not log_connections) and then once immediately after it changes.  Am I
the only one for whom this sounds like overengineering?

I think the idea is nice, but I'm not sure about feasibility.

I further think that the idea in the OP is sound enough.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Incorrect errno used with %m for backend code

2018-06-22 Thread Andres Freund
Hi,

On 2018-06-22 11:15:53 -0400, Alvaro Herrera wrote:
> I don't understand the logic in RestoreSlotFromDisk that fsync the file
> prior to reading it.  What are we protecting against?

we could previously have crashed before fsyncing. But we can only rely
on slot contents being durable if we fsync them. Therefore we fsync
before reading, after a crash.


> Anyway, while I think it would be nice to have CloseTransientFile
> restore the original errno if there was one and close goes well, it
> probably unduly complicates its API contract.

Yea, agreed.

Greetings,

Andres Freund



Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-22 Thread Andres Freund
On 2018-06-22 12:16:18 -0400, Robert Haas wrote:
> On Fri, Jun 22, 2018 at 11:17 AM, Stephen Frost  wrote:
> > The issue here is exactly that at the point where we emit the
> > 'connection authorized' message, we haven't processed generic GUCs from
> > the startup packet yet and therefore application_name isn't set as a
> > GUC and, as a result, isn't included in the 'connection authorized'
> > message, even if it's specified in log_line_prefix.
> >
> > There's no way, today, to get the application name included in the
> > 'connection authorized' message, which certainly seems unfortunate and a
> > bit surprising, hence this patch to fix that.
> 
> OK, that makes more sense, but I'm still skeptical of adding a special
> case particularly for application_name.

I think a fair argument could be made that you'd want to have
application_name logged exactly once, not in every line. Just to cope
with log volume. With decent log analysis tools once is enough.

Greetings,

Andres Freund



Re: I'd like to discuss scaleout at PGCon

2018-06-22 Thread Joshua D. Drake

On 06/22/2018 11:28 AM, Merlin Moncure wrote:


Key features from my perspective:
*) fdw in parallel.  how do i do it today? ghetto implemented parallel
queries with asynchronous dblink

*) column store


Although not in core, we do have this as an extension through Citus 
don't we?


JD

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *




Re: I'd like to discuss scaleout at PGCon

2018-06-22 Thread Merlin Moncure
On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian  wrote:
>
> On Fri, Jun  1, 2018 at 11:29:43AM -0500, Merlin Moncure wrote:
> > FWIW, Distributed analytical queries is the right market to be in.
> > This is the field in which I work, and this is where the action is at.
> > I am very, very, sure about this.  My view is that many of the
> > existing solutions to this problem (in particular hadoop class
> > soltuions) have major architectural downsides that make them
> > inappropriate in use cases that postgres really shines at; direct
> > hookups to low latency applications for example.  postgres is
> > fundamentally a more capable 'node' with its multiple man-millennia of
> > engineering behind it.  Unlimited vertical scaling (RAC etc) is
> > interesting too, but this is not the way the market is moving as
> > hardware advancements have reduced or eliminated the need for that in
> > many spheres.
> >
> > The direction of the project is sound and we are on the cusp of the
> > point where multiple independent coalescing features (FDW, logical
> > replication, parallel query, executor enhancements) will open new
> > scaling avenues that will not require trading off the many other
> > benefits of SQL that competing contemporary solutions might.  The
> > broader development market is starting to realize this and that is a
> > major driver of the recent upswing in popularity.  This is benefiting
> > me tremendously personally due to having gone 'all-in' with postgres
> > almost 20 years ago :-D. (Time sure flies)These are truly
> > wonderful times for the community.
>
> While I am glad people know a lot about how other projects handle
> sharding, these can be only guides to how Postgres will handle such
> workloads.  I think we need to get to a point where we have all of the
> minimal sharding-specific code features done, at least as
> proof-of-concept, and then test Postgres with various workloads like
> OLTP/OLAP and read-write/read-only.  This will tell us where
> sharding-specific code will have the greatest impact.
>
> What we don't want to do is to add a bunch of sharding-specific code
> without knowing which workloads it benefits, and how many of our users
> will actually use sharding.  Some projects have it done that, and it
> didn't end well since they then had a lot of product complexity with
> little user value.

Key features from my perspective:
*) fdw in parallel.  how do i do it today? ghetto implemented parallel
queries with asynchronous dblink

*) column store

*) automatic partition management through shards

probably some more, gotta run :-)

merlin



Re: libpq compression

2018-06-22 Thread Robbie Harwood
Konstantin Knizhnik  writes:

> On 22.06.2018 18:59, Robbie Harwood wrote:
>> Konstantin Knizhnik  writes:
>>> On 21.06.2018 20:14, Robbie Harwood wrote:
 Konstantin Knizhnik  writes:
> On 21.06.2018 17:56, Robbie Harwood wrote:
>> Konstantin Knizhnik  writes:
>>> On 20.06.2018 23:34, Robbie Harwood wrote:
 Konstantin Knizhnik  writes:

> My idea was the following: client want to use compression. But
> server may reject this attempt (for any reasons: it doesn't
> support it, has no proper compression library, do not want to
> spend CPU for decompression,...) Right now compression
> algorithm is hardcoded. But in future client and server may
> negotiate to choose proper compression protocol.  This is why
> I prefer to perform negotiation between client and server to
> enable compression.

 Well, for negotiation you could put the name of the algorithm
 you want in the startup.  It doesn't have to be a boolean for
 compression, and then you don't need an additional round-trip.
>>>
>>> Sorry, I can only repeat arguments I already mentioned:
>>>
>>> - in future it may be possible to specify compression algorithm
>>>
>>> - even with boolean compression option server may have some
>>> reasons to reject client's request to use compression
>>>
>>> Extra flexibility is always good thing if it doesn't cost too
>>> much. And extra round of negotiation in case of enabling
>>> compression seems to me not to be a high price for it.
>>
>> You already have this flexibility even without negotiation.  I
>> don't want you to lose your flexibility.  Protocol looks like
>> this:
>>
>> - Client sends connection option "compression" with list of
>> algorithms it wants to use (comma-separated, or something).
>>
>> - First packet that the server can compress one of those algorithms
>> (or none, if it doesn't want to turn on compression).
>>
>> No additional round-trips needed.
>
> This is exactly how it works now...  Client includes compression
> option in connection string and server replies with special
> message ('Z') if it accepts request to compress traffic between
> this client and server.

 No, it's not.  You don't need this message.  If the server receives
 a compression request, it should just turn compression on (or not),
 and then have the client figure out whether it got compression
 back.
>>>
>>> How it will managed to do it. It receives some reply and first of
>>> all it should know whether it has to be decompressed or not.
>>
>> You can tell whether a message is compressed by looking at it.  The
>> way the protocol works, every message has a type associated with it:
>> a single byte, like 'R', that says what kind of message it is.
>
> Compressed message can contain any sequence of bytes, including 'R':)

Then tag your messages with a type byte.  Or do it the other way around
- look for the zstd framing within a message.

Please, try to work with me on this instead of fighting every design
change.

Thanks,
--Robbie


signature.asc
Description: PGP signature


Re: I'd like to discuss scaleout at PGCon

2018-06-22 Thread Bruce Momjian
On Fri, Jun  1, 2018 at 11:29:43AM -0500, Merlin Moncure wrote:
> FWIW, Distributed analytical queries is the right market to be in.
> This is the field in which I work, and this is where the action is at.
> I am very, very, sure about this.  My view is that many of the
> existing solutions to this problem (in particular hadoop class
> soltuions) have major architectural downsides that make them
> inappropriate in use cases that postgres really shines at; direct
> hookups to low latency applications for example.  postgres is
> fundamentally a more capable 'node' with its multiple man-millennia of
> engineering behind it.  Unlimited vertical scaling (RAC etc) is
> interesting too, but this is not the way the market is moving as
> hardware advancements have reduced or eliminated the need for that in
> many spheres.
> 
> The direction of the project is sound and we are on the cusp of the
> point where multiple independent coalescing features (FDW, logical
> replication, parallel query, executor enhancements) will open new
> scaling avenues that will not require trading off the many other
> benefits of SQL that competing contemporary solutions might.  The
> broader development market is starting to realize this and that is a
> major driver of the recent upswing in popularity.  This is benefiting
> me tremendously personally due to having gone 'all-in' with postgres
> almost 20 years ago :-D. (Time sure flies)These are truly
> wonderful times for the community.

I am coming in late, but I am glad we are having this conversation.  We
have made great strides toward sharding while adding minimal
sharding-specific code.  We can now see a time when we will complete the
the minimal sharding-specific code tasks.  Once we reach that point, we
will need to decide what sharding-specific code to add, and to do that,
we need to understand which direction to go in, and to do that, we need
to know the trade-offs.

While I am glad people know a lot about how other projects handle
sharding, these can be only guides to how Postgres will handle such
workloads.  I think we need to get to a point where we have all of the
minimal sharding-specific code features done, at least as
proof-of-concept, and then test Postgres with various workloads like
OLTP/OLAP and read-write/read-only.  This will tell us where
sharding-specific code will have the greatest impact.

What we don't want to do is to add a bunch of sharding-specific code
without knowing which workloads it benefits, and how many of our users
will actually use sharding.  Some projects have it done that, and it
didn't end well since they then had a lot of product complexity with
little user value.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Concurrency bug in UPDATE of partition-key

2018-06-22 Thread Amit Khandekar
On 20 June 2018 at 18:54, Amit Kapila  wrote:
> On Tue, Jun 19, 2018 at 9:33 PM, Amit Khandekar  
> wrote:
>>
>> Could not add RAISE statement, because the isolation test does not
>> seem to print those statements in the output. Instead, I have dumped
>> some rows in a new table through the trigger function, and did a
>> select from that table. Attached is v3 patch.
>>
>
> Comments
> -
> 1.
> + /*
> + * If the tuple was concurrently updated and the caller of this
> + * function requested for the updated tuple, skip the trigger
> + * execution.
> + */
> + if (newSlot != NULL && epqslot != NULL)
> + return false;
>
> There is a possibility of memory leak due to above change.  Basically,
> GetTupleForTrigger returns a newly allocated tuple.  We should free
> the triggertuple by calling heap_freetuple(trigtuple).

Right, done.

>
> 2.
> ExecBRDeleteTriggers(..)
> {
> ..
> + /* If requested, pass back the concurrently updated tuple, if any. */
> + if (epqslot != NULL)
> + *epqslot = newSlot;
> +
>   if (trigtuple == NULL)
>   return false;
> +
> + /*
> + * If the tuple was concurrently updated and the caller of this
> + * function requested for the updated tuple, skip the trigger
> + * execution.
> + */
> + if (newSlot != NULL && epqslot != NULL)
> + return false;
> ..
> }
>
> Can't we merge the first change into second, something like:
>
> if (newSlot != NULL && epqslot != NULL)
> {
> *epqslot = newSlot;
> return false;
> }

We want to update *epqslot with whatever the value of newSlot is. So
if newSlot is NULL, epqslot should be NULL. So we can't do that in the
"if (newSlot != NULL && epqslot != NULL)" condition.

>
> 3.
> ExecBRDeleteTriggers(..)
> {
> - TupleTableSlot *newSlot;
>   int i;
>
>   Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
>   if (fdw_trigtuple == NULL)
>   {
> + TupleTableSlot *newSlot;
> +
> ..
> }
>
> This change is okay on its own, but I think it has nothing to do with
> this patch. If you don't mind, can we remove it?

Done.

>
> 4.
> +/* --
> + * ExecBRDeleteTriggers()
> + *
> + * Called to execute BEFORE ROW DELETE triggers.
> + *
> + * Returns false in following scenarios :
> + * 1. Trigger function returned NULL.
> + * 2. The tuple was concurrently deleted OR it was concurrently updated and 
> the
> + * new tuple didn't pass EvalPlanQual() test.
> + * 3. The tuple was concurrently updated and the new tuple passed the
> + * EvalPlanQual() test, BUT epqslot parameter is not NULL. In such case, this
> + * function skips the trigger function execution, because the caller has
> + * indicated that it wants to further process the updated tuple. The updated
> + * tuple slot is passed back through epqslot.
> + *
> + * In all other cases, returns true.
> + * --
> + */
>  bool
>  ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
>
> The above comments appear unrelated to this function, example, this
> function is not at all aware of concurrent updates.

But with the patch, this function does become aware of concurrent
updates, because it makes use of the epqslot returned by
GetTupleForTrigger, This is similar to ExecBRUpdateTriggers() being
aware of concurrent updates.

The reason I made the return-value-related comment is because earlier
it was simple : If trigger function returned NULL, this function
returns false. But now that has changed, so its better if the change
is noted in the function header. And on HEAD, there is no function
header, so we need to explain when exactly this function returns true
or false.

> I think if you want to add comments to this function, we can add them as a 
> separate
> patch or if you want to add them as part of this patch at least make
> them succinct.

If it looks complicated, can you please suggest something to make it a
bit simpler.

>
> 5.
> + /*
> + * If this is part of an UPDATE of partition-key, the
> + * epq tuple will contain the changes from this
> + * transaction over and above the updates done by the
> + * other transaction. The caller should now use this
> + * tuple as its NEW tuple, rather than the earlier NEW
> + * tuple.
> + */
>
> I think we can simplify this comment as "If requested, pass back the
> updated tuple if any.", something similar to what you have at some
> other place in the patch.

Ok, Done. Also changed the subsequent comment to :
/* Normal DELETE: So delete the updated row. */


>
> 6.
> +# Session A is moving a row into another partition, but is waiting for
> +# another session B that is updating the original row. The row that ends up
> +# in the new partition should contain the changes made by session B.
>
> You have named sessions as s1 and s2, but description uses A and B, it
> will be better to use s1 and s2 respectively.

Done.

Attached is v4 version of the patch. Thanks !

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


fix_concurrency_bug_v4.patch
Description: Binary data


Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-22 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Jun 22, 2018 at 11:17 AM, Stephen Frost  wrote:
> > The issue here is exactly that at the point where we emit the
> > 'connection authorized' message, we haven't processed generic GUCs from
> > the startup packet yet and therefore application_name isn't set as a
> > GUC and, as a result, isn't included in the 'connection authorized'
> > message, even if it's specified in log_line_prefix.
> >
> > There's no way, today, to get the application name included in the
> > 'connection authorized' message, which certainly seems unfortunate and a
> > bit surprising, hence this patch to fix that.
> 
> OK, that makes more sense, but I'm still skeptical of adding a special
> case particularly for application_name.

I'd argue that application_name, when available, makes as much sense to
have in the connection authorization message as the other hard-coded
values (user, database), and those actually do get set and included in
the log_line_prefix when connection authorized is logged, if they're
asked for.  There certainly wasn't too much concern raised over adding
the SSL information to that same message, nor complaints from what I
recall.

I did also look at the other items available through log_line_prefix and
didn't see anything else that really felt like it should be included.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: libpq compression

2018-06-22 Thread Konstantin Knizhnik




On 22.06.2018 19:05, Nico Williams wrote:

On Fri, Jun 22, 2018 at 10:18:12AM +0300, Konstantin Knizhnik wrote:

On 22.06.2018 00:34, Nico Williams wrote:

So I think you just have to have lengths.

Now, this being about compression, I understand that you might now want
to have 4-byte lengths, especially given that most messages will be
under 8KB.  So use a varint encoding for the lengths.

No explicit framing and lengths are needed in case of using streaming
compression.
There can be certainly some kind of frames inside compression protocol
itself, but it is intrinsic of compression algorithm.

I don't think that's generally true.  It may be true of the compression
algorithm you're working with.  This is fine, of course, but plugging in
other compression algorithms will require the authors to add framing.


If compression algorithm supports streaming mode (and most of them 
does), then you should not worry about frames.
And if compression algorithm doesn't support streaming mode, then it 
should not be used for libpq traffic compression.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: libpq compression

2018-06-22 Thread Konstantin Knizhnik




On 22.06.2018 18:59, Robbie Harwood wrote:

Konstantin Knizhnik  writes:


On 21.06.2018 20:14, Robbie Harwood wrote:

Konstantin Knizhnik  writes:

On 21.06.2018 17:56, Robbie Harwood wrote:

Konstantin Knizhnik  writes:

On 20.06.2018 23:34, Robbie Harwood wrote:

Konstantin Knizhnik  writes:

Well, that's a design decision you've made.  You could put
lengths on chunks that are sent out - then you'd know exactly how
much is needed.  (For instance, 4 bytes of network-order length
followed by a complete payload.)  Then you'd absolutely know
whether you have enough to decompress or not.

Do you really suggest to send extra header for each chunk of data?
Please notice that chunk can be as small as one message: dozen of
bytes because libpq is used for client-server communication with
request-reply pattern.

I want you to think critically about your design.  I *really* don't
want to design it for you - I have enough stuff to be doing.  But
again, the design I gave you doesn't necessarily need that - you
just need to properly buffer incomplete data.

Right now secure_read may return any number of available bytes. But
in case of using streaming compression, it can happen that available
number of bytes is not enough to perform decompression. This is why
we may need to try to fetch additional portion of data. This is how
zpq_stream is working now.

No, you need to buffer and wait until you're called again.  Which is
to say: decompress() shouldn't call secure_read().  secure_read()
should call decompress().

I this case I will have to implement this code twice: both for backend
and frontend, i.e. for secure_read/secure_write and
pqsecure_read/pqsecure_write.

Likely, yes.  You can see that this is how TLS does it (which you should
be using as a model, architecture-wise).


Frankly speaking i was very upset by design of libpq communication
layer in Postgres: there are two different implementations of almost
the same stuff for cbackend and frontend.

Changing the codebases so that more could be shared is not necessarily a
bad idea; however, it is a separate change from compression.


I do not understand how it is possible to implement in different way
and what is wrong with current implementation.

The most succinct thing I can say is: absolutely don't modify
pq_recvbuf().  I gave you pseudocode for how to do that.  All of your
logic should be *below* the secure_read/secure_write functions.

I cannot approve code that modifies pq_recvbuf() in the manner you
currently do.

Well. I understand you arguments.  But please also consider my
argument above (about avoiding code duplication).

In any case, secure_read function is called only from pq_recvbuf() as
well as pqsecure_read is called only from pqReadData.  So I do not
think principle difference in handling compression in secure_read or
pq_recvbuf functions and do not understand why it is "destroying the
existing model".

Frankly speaking, I will really like to destroy existed model, moving
all system dependent stuff in Postgres to SAL and avoid this awful mix
of code sharing and duplication between backend and frontend. But it
is a another story and I do not want to discuss it here.

I understand you want to avoid code duplication.  I will absolutely
agree that the current setup makes it difficult to share code between
postmaster and libpq clients.  But the way I see it, you have two
choices:

1. Modify the code to make code sharing easier.  Once this has been
done, *then* build a compression patch on top, with the nice new
architecture.

2. Leave the architecture as-is and add compression support.
(Optionally, you can make code sharing easier at a later point.)

Fundamentally, I think you value code sharing more than I do.  So while
I might advocate for (2), you might personally prefer (1).

But right now you're not doing either of those.


If we are speaking about the "right design", then neither your
suggestion, neither my implementation are good and I do not see
principle differences between them.

The right approach is using "decorator pattern": this is how streams
are designed in .Net/Java. You can construct pipe of "secure",
"compressed" and whatever else streams.

I *strongly* disagree, but I don't think you're seriously suggesting
this.


My idea was the following: client want to use compression. But
server may reject this attempt (for any reasons: it doesn't support
it, has no proper compression library, do not want to spend CPU for
decompression,...) Right now compression algorithm is
hardcoded. But in future client and server may negotiate to choose
proper compression protocol.  This is why I prefer to perform
negotiation between client and server to enable compression.  Well,
for negotiation you could put the name of the algorithm you want in
the startup.  It doesn't have to be a boolean for compression, and
then you don't need an additional round-trip.

Sorry, I can only repeat arguments I already mentioned:
- in future it may be 

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-06-22 Thread Matheus de Oliveira
Hello again...

On Tue, Jun 19, 2018 at 12:53 PM, Andres Freund  wrote:

> ...
>
> After that, yes, deleting the
> global/pg_internal.init file is the way to go, and I can't think of a
> case where it's problematic, even without stopping the server.
>
>
Just to let you know. I applied the work around in the affected server and
seemed to work way, so far no error.

Thank you a lot for all the information and help.

Best regards,
-- 
Matheus de Oliveira


Re: Threat models for DB cryptography (Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key) Management Service (KMS)

2018-06-22 Thread Nico Williams
On Fri, Jun 22, 2018 at 05:31:44AM +, Tsunakawa, Takayuki wrote:
> From: Nico Williams [mailto:n...@cryptonector.com]
> > Let's start with a set of threat models then.  I'll go first:
> 
> Thank you so much for summarizing the current situation.  I'd
> appreciate it if you could write this on the PostgreSQL wiki, when the
> discussion has settled somehow.

Sure, that's a good idea.

> >  - local adversaries (addressed by standard OS user process isolation)
> 
> Does this also mean that we don't have to worry about the following?
> 
> * unencrypted data in the server process memory and core files
> * passwords in .pgpass and recovery.conf (someone familiar with PCI
>   DSS audit said this is a problem)
> * user data in server logs

Short of using things like Intel SGX or homomorphic encryption, I don't
think we can do anything about plaintext in memory -- at some point it
has to be there, therefore it is as vulnerable as the host OS makes it.

Users can always run only the one postgres instance and nothing else
(and no other non-admin users) on the host to reduce local attack
surface to zero.

So, yes, I think this flavor of local vulnerability should be out of
scope for PG.

> > One shortcoming of relying on OS functionality for protection against
> > malicious storage is that not all OSes may provide such functionality.
> > This could be an argument for implementing full, transparent encryption
> > for an entire DB in the postgres server.  Not a very compelling
> > argument, but that's just my opinion -- reasonable people could differ
> > on this.
> 
> Yes, this is one reason I developed TDE in our product.  And
> in-database encryption allows optimization by encrypting only user
> data.

I understand this motivation.  I wouldn't reject this out of hand, even
though I'm not exactly interested either.

Can you keep the impact on the codebase isolated and limited, and the
performance impact when disabled to zero?

> > So I think for (3) the best answer is to just not have that problem:
> > just reduce and audit admin access.
> > 
> > Still, if anyone wants to cover (3), I argue that PG gives you
> > everything you need right now: FDW and pgcrypto.  Just build a
> > solution where you have a PG server proxy that acts as a smart
> > client to untrusted servers:
> 
> Does sepgsql help?

Any functionality in PG that allows DBAs to manage storage, sessions,
..., without having to see table data will help.  It doesn't ahve to be
tied to trusted OS functionality.

I've not looked at SEP [0] so I don't know if it helps.  I would prefer
that PG simply have native functionality to allow this sort of
separation -- as I'm not a DBA, I don't really know if PG has this.

[0] https://wiki.postgresql.org/wiki/SEPostgreSQL_SELinux_Overview

> Should a malfunctioning or buggy application be considered as as a
> threat?  That's what sql_firewall extension addresses.

I suppose so, yes.

Nico
-- 



Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-22 Thread Robert Haas
On Fri, Jun 22, 2018 at 11:17 AM, Stephen Frost  wrote:
> The issue here is exactly that at the point where we emit the
> 'connection authorized' message, we haven't processed generic GUCs from
> the startup packet yet and therefore application_name isn't set as a
> GUC and, as a result, isn't included in the 'connection authorized'
> message, even if it's specified in log_line_prefix.
>
> There's no way, today, to get the application name included in the
> 'connection authorized' message, which certainly seems unfortunate and a
> bit surprising, hence this patch to fix that.

OK, that makes more sense, but I'm still skeptical of adding a special
case particularly for application_name.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: libpq compression

2018-06-22 Thread Nico Williams
On Fri, Jun 22, 2018 at 10:18:12AM +0300, Konstantin Knizhnik wrote:
> On 22.06.2018 00:34, Nico Williams wrote:
> >So I think you just have to have lengths.
> >
> >Now, this being about compression, I understand that you might now want
> >to have 4-byte lengths, especially given that most messages will be
> >under 8KB.  So use a varint encoding for the lengths.
>
> No explicit framing and lengths are needed in case of using streaming
> compression.
> There can be certainly some kind of frames inside compression protocol
> itself, but it is intrinsic of compression algorithm.

I don't think that's generally true.  It may be true of the compression
algorithm you're working with.  This is fine, of course, but plugging in
other compression algorithms will require the authors to add framing.



Re: libpq compression

2018-06-22 Thread Robbie Harwood
Konstantin Knizhnik  writes:

> On 21.06.2018 20:14, Robbie Harwood wrote:
>> Konstantin Knizhnik  writes:
>>> On 21.06.2018 17:56, Robbie Harwood wrote:
 Konstantin Knizhnik  writes:
> On 20.06.2018 23:34, Robbie Harwood wrote:
>> Konstantin Knizhnik  writes:
>>
>> Well, that's a design decision you've made.  You could put
>> lengths on chunks that are sent out - then you'd know exactly how
>> much is needed.  (For instance, 4 bytes of network-order length
>> followed by a complete payload.)  Then you'd absolutely know
>> whether you have enough to decompress or not.
>
> Do you really suggest to send extra header for each chunk of data?
> Please notice that chunk can be as small as one message: dozen of
> bytes because libpq is used for client-server communication with
> request-reply pattern.

 I want you to think critically about your design.  I *really* don't
 want to design it for you - I have enough stuff to be doing.  But
 again, the design I gave you doesn't necessarily need that - you
 just need to properly buffer incomplete data.
>>>
>>> Right now secure_read may return any number of available bytes. But
>>> in case of using streaming compression, it can happen that available
>>> number of bytes is not enough to perform decompression. This is why
>>> we may need to try to fetch additional portion of data. This is how
>>> zpq_stream is working now.
>>
>> No, you need to buffer and wait until you're called again.  Which is
>> to say: decompress() shouldn't call secure_read().  secure_read()
>> should call decompress().
>
> I this case I will have to implement this code twice: both for backend
> and frontend, i.e. for secure_read/secure_write and
> pqsecure_read/pqsecure_write.

Likely, yes.  You can see that this is how TLS does it (which you should
be using as a model, architecture-wise).

> Frankly speaking i was very upset by design of libpq communication
> layer in Postgres: there are two different implementations of almost
> the same stuff for cbackend and frontend.

Changing the codebases so that more could be shared is not necessarily a
bad idea; however, it is a separate change from compression.

>>> I do not understand how it is possible to implement in different way
>>> and what is wrong with current implementation.
>>
>> The most succinct thing I can say is: absolutely don't modify
>> pq_recvbuf().  I gave you pseudocode for how to do that.  All of your
>> logic should be *below* the secure_read/secure_write functions.
>>
>> I cannot approve code that modifies pq_recvbuf() in the manner you
>> currently do.
>
> Well. I understand you arguments.  But please also consider my
> argument above (about avoiding code duplication).
>
> In any case, secure_read function is called only from pq_recvbuf() as
> well as pqsecure_read is called only from pqReadData.  So I do not
> think principle difference in handling compression in secure_read or
> pq_recvbuf functions and do not understand why it is "destroying the
> existing model".
>
> Frankly speaking, I will really like to destroy existed model, moving
> all system dependent stuff in Postgres to SAL and avoid this awful mix
> of code sharing and duplication between backend and frontend. But it
> is a another story and I do not want to discuss it here.

I understand you want to avoid code duplication.  I will absolutely
agree that the current setup makes it difficult to share code between
postmaster and libpq clients.  But the way I see it, you have two
choices:

1. Modify the code to make code sharing easier.  Once this has been
   done, *then* build a compression patch on top, with the nice new
   architecture.

2. Leave the architecture as-is and add compression support.
   (Optionally, you can make code sharing easier at a later point.)

Fundamentally, I think you value code sharing more than I do.  So while
I might advocate for (2), you might personally prefer (1).

But right now you're not doing either of those.

> If we are speaking about the "right design", then neither your
> suggestion, neither my implementation are good and I do not see
> principle differences between them.
>
> The right approach is using "decorator pattern": this is how streams
> are designed in .Net/Java. You can construct pipe of "secure",
> "compressed" and whatever else streams.

I *strongly* disagree, but I don't think you're seriously suggesting
this.

>> My idea was the following: client want to use compression. But
>> server may reject this attempt (for any reasons: it doesn't support
>> it, has no proper compression library, do not want to spend CPU for
>> decompression,...) Right now compression algorithm is
>> hardcoded. But in future client and server may negotiate to choose
>> proper compression protocol.  This is why I prefer to perform
>> negotiation between client and server to enable compression.  Well,
>> for negotiation you could put the name 

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-06-22 Thread Rui DeSousa
Why not just parameterize it with the three key fields; userid, dbid, and 
queryid?

i.e It would then allow it be limited to only records associated with a given 
user and/or database as well.

pg_stat_statements_reset(dbid oid, userid oid, queryid bigint)

pg_stat_statements_reset(null, null, 3569076157)— all for a given query
pg_stat_statements_reset(6384, null, 3569076157)  
pg_stat_statements_reset(null, 16429, 3569076157)
pg_stat_statements_reset(6384, 6384, 3569076157)
pg_stat_statements_reset(6384, null, null) — all for a given database.
.
.
.  

>> pg_stat_statements_reset()

> On Jun 22, 2018, at 11:06 AM, Robert Haas  wrote:
> 
> On Wed, Jun 20, 2018 at 10:19 AM, Euler Taveira  wrote:
>> 2018-06-20 4:30 GMT-03:00 Haribabu Kommi :
>>> Attached is a simple patch with implementation. Comments?
>>> 
>> Why don't you extend the existing function pg_stat_statements_reset()?
> 
> Well, the existing function doesn't take any arguments.  We could add
> an additional version of it that takes an argument, or we could
> replace the existing version with one that has an optional argument.
> But are either of those things any better than just adding a new
> function with a different name, like
> pg_stat_statements_reset_statement()?
> 
> I have not had such good experiences with function overloading, either
> in PostgreSQL or elsewhere, that I'm ready to say reusing the same
> name is definitely the right approach.  For example, suppose we
> eventually end up with a function that resets all the statements, a
> function that resets just one statement, a function that resets all
> statements for one user, and a function that resets all statements
> where the statement text matches a certain regexp.  If those all have
> separate names, everything is fine.  If they all have the same name,
> there's no way that's not confusing.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 




Re: PATCH: backtraces for error messages

2018-06-22 Thread Korry Douglas
A few misc comments:

In general, +1.

It might be nice to move the stack trace code into a separate function (not
static to elog.c) - I have often found it useful to attach backtraces to
objects so I can debug complex allocation/deallocation problems.

The major expense in capturing a stack trace is in the symbolization of the
stack addresses, not the stack walk itself.  You typically want to
symbolize the addresses at the time you generate the trace, but that's not
always the case.  For example, if you want to record the stack trace of
every call to AllocSetAlloc() (and attach the trace to the AllocChunk)
performance gets unbearable if you symbolize every trace.  So a flag that
specifies whether to symbolize would be nice too.

If you don't symbolize, you need a way to capture the address range of each
dynamically loaded shared object (so that you can later symbolize using
something like addr2line).

(The above suggestions relate to server debugging, not application
debugging).

End of wish list...

 -- Korry

On Fri, Jun 22, 2018 at 3:07 AM, Craig Ringer  wrote:

> On 22 June 2018 at 08:48, Andres Freund  wrote:
>
>> On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
>> > On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, 
>> > wrote:
>> >
>> > >
>> > >
>> > > On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <
>> mich...@paquier.xyz>
>> > > wrote:
>> > >
>> > >> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
>> > >> > I wrote it because I got sick of Assert(false) debugging, and I was
>> > >> chasing
>> > >> > down some "ERROR:  08P01: insufficient data left in message"
>> errors.
>> > >> Then I
>> > >> > got kind of caught up in it... you know how it is.
>> > >>
>> > >> Yes, I know that feeling!  I have been using as well the
>> Assert(false)
>> > >> and the upgrade-to-PANIC tricks a couple of times, so being able to
>> get
>> > >> more easily backtraces would be really nice.
>> > >>
>> > >>
>> > > Sometime back I'd suggested an idea to be able to dynamically manage
>> log
>> > > levels for elog messages [1].
>> > >
>> >
>> >
>> > Huge +1 from me on being able to selectively manage logging on a
>> > module/subsystem, file, or line level.
>> >
>> > I don't think I saw the post.
>> >
>> > Such a thing would obviously make built in backtrace support much more
>> > useful.
>>
>> I strongly suggest keeping these as separate as possible. Either is
>> useful without the other, and both are nontrivial. Tackling them
>> together imo makes it much more likely to get nowhere.
>>
>>
> Totally agree, and it's why I raised this as its own thing.
>
> Thanks.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-22 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Jun 20, 2018 at 3:45 PM, Stephen Frost  wrote:
> > * Don Seiler (d...@seiler.us) wrote:
> >> In trying to troubleshoot the source of a recent connection storm, I was
> >> frustrated to find that the app name was not included in the connection
> >> messages. It is there in the log_line_prefix on the disconnection messages
> >> but I would prefer it be directly visible with the connection itself. With
> >> some guidance from Stephen Frost I've put together this patch which does
> >> that.
> >
> > Yeah, I tend to agree that it'd be extremely useful to have this
> > included in the 'connection authorized' message.
> 
> I don't get it.  It seems like a bad idea to me to copy information
> that can already be logged using log_line_prefix into the message
> itself.  If we start doing that, we'll end up with duplicated
> information all over the place, and even worse, it won't be
> consistent, because different people will want different things
> duplicated into different messages.
> 
> Am I missing something?

The issue here is exactly that at the point where we emit the
'connection authorized' message, we haven't processed generic GUCs from
the startup packet yet and therefore application_name isn't set as a
GUC and, as a result, isn't included in the 'connection authorized'
message, even if it's specified in log_line_prefix.

There's no way, today, to get the application name included in the
'connection authorized' message, which certainly seems unfortunate and a
bit surprising, hence this patch to fix that.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Incorrect errno used with %m for backend code

2018-06-22 Thread Alvaro Herrera
On 2018-Jun-22, Michael Paquier wrote:

> A couple of places use CloseTransientFile without bothering much that
> this can overwrite errno.  I was tempted in keeping errno saved and kept
> if set to a non-zero value, but refrained from doing so as some callers
> may rely on the existing behavior, and the attached shows better the
> intention. 

I wondered for a bit if it would be a better idea to have
CloseTransientFile restore the existing errno if there is any and close
works fine -- when I noticed that that function itself says that caller
should check errno for close() errors.  Most callers seem to do it
correctly, but a few fail to check the return value.  

Quite some places open files O_RDONLY, so lack of error checking after
closing seems ok.  (Unless there's some funny interaction with the fsync
fiasco, of course.)

A bunch of other places use CloseTransientFile while closing shop after
some other syscall failure, which is what your patch fixes.  I didn't
review those; checking for close failure seems pointless.

In some cases, we fsync the file and check that return value, then close
the file and do *not* check CloseTransientFile's return value --
examples are CheckPointLogicalRewriteHeap, heap_xlog_logical_rewrite,
SnapBuildSerialize, SaveSlotToPath, durable_rename.  I don't know if
it's reasonable for close() to fail after successfully fsyncing a file;
maybe this is not a problem.  I would patch those nonetheless.

be_lo_export() is certainly missing a check: it writes and closes,
without intervening fsync.

I don't understand the logic in RestoreSlotFromDisk that fsync the file
prior to reading it.  What are we protecting against?


Anyway, while I think it would be nice to have CloseTransientFile
restore the original errno if there was one and close goes well, it
probably unduly complicates its API contract.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-22 Thread Robert Haas
On Wed, Jun 20, 2018 at 3:45 PM, Stephen Frost  wrote:
> * Don Seiler (d...@seiler.us) wrote:
>> In trying to troubleshoot the source of a recent connection storm, I was
>> frustrated to find that the app name was not included in the connection
>> messages. It is there in the log_line_prefix on the disconnection messages
>> but I would prefer it be directly visible with the connection itself. With
>> some guidance from Stephen Frost I've put together this patch which does
>> that.
>
> Yeah, I tend to agree that it'd be extremely useful to have this
> included in the 'connection authorized' message.

I don't get it.  It seems like a bad idea to me to copy information
that can already be logged using log_line_prefix into the message
itself.  If we start doing that, we'll end up with duplicated
information all over the place, and even worse, it won't be
consistent, because different people will want different things
duplicated into different messages.

Am I missing something?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-06-22 Thread Robert Haas
On Tue, Jun 19, 2018 at 8:46 AM, Etsuro Fujita
 wrote:
> Here is a patch for that.
>
> * As I said upthread, the patch makes code much more simple; I removed all
> the changes to setrefs.c added by the partitionwise-join patch.  I also
> simplified the logic for building a tlist for a child-join rel. The original
> PWJ computes attr_needed data even for child rels, and build the tlist for a
> child-join by passing to build_joinrel_tlist that data for input child rels
> for the child-join.  But I think that's redundant, and it's more
> straightforward to apply adjust_appendrel_attrs to the parent-join's tlist
> to get the child-join's tlist.  So, I changed that way, which made
> unnecessary all the changes to build_joinrel_tlist and placeholder.c added
> by the PWJ patch, so I removed those as well.
>
> * The patch contains all of the regression tests in the original patch
> proposed by Ashutosh.

I think this approach is going to run into trouble if the level at
which we have to apply the ConvertRowTypeExpr happens not to be a
projection-capable node.  And, in general, it seems to me that we want
to produce the right outputs at the lowest possible level of the plan
tree.  For instance, suppose that one of the relations being scanned
is not parallel-safe but the others are.  Then, we could end up with a
plan like this:

Append
-> Seq Scan on temp_rela
-> Gather
  -> Parallel Seq Scan on thing1
-> Gather
  -> Parallel Seq Scan on thing2

If a projection is required to convert the row type expression, we
certainly want that to get pushed below the Gather, not to happen at
the Gather level itself.  We certainly don't want it to happen at the
Append level, which can't even handle it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Fix some error handling for read() and errno

2018-06-22 Thread Alvaro Herrera
On 2018-Jun-22, Robert Haas wrote:

> I think this should be split into two patches.  Fixing failure to save
> and restore errno is a different issue from cleaning up short write
> messages.  I agree that the former should be back-patched and the
> latter not.

Hmm, yeah, good thought, +1.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Psql patch to show access methods info

2018-06-22 Thread Sergey Cherkashin
Hello!

There are command in psql to list access methods, but there are no fast
way to look detailed info about them. So here a patch with new
commands:

\dAp [PATTERN]   list access methods with properties (Table
pg_am)
\dAf[+]  [AMPTRN [OPFPTRN]]  list operator families of access method. +
prints owner of operator family. (Table pg_opfamily) 
\dAfp[AMPTRN [OPFPTRN]]  list procedures of operator family related
to access method (Table pg_amproc)
\dAfo[AMPTRN [OPFPTRN]]  list operators of family related to access
method (Table pg_amop)
\dAoc[+] [AMPTRN [OPCPTRN]]  list operator classes of index access
methods. + prints owner of operator class. (Table pg_opclass)
\dip[S]  [PATTERN]   list indexes with properties (Table
pg_class)
\dicp[S] [IDXNAME [COLNAME]] show index column properties (Table
pg_class)

You can display information only on the access methods, specified by a
template. You can also filter operator classes, operator families, or
the name of the indexed column.

I also have a question about testing commands \dAf+ and \dAoc+: is it
good idea to test them by changing an owner of one operator family or
class to created new one, checking the output, and restoring the owner
back? Or we should create a new opclass or opfamily with proper owner.
Or maybe it is not necesary to test these commands?

Best regards,
Sergey Cherkashin
s.cherkas...@postgrespro.rudiff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3ed9021..b699548 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -675,7 +675,7 @@
search and ordering purposes.)
   
 
-  
+  
pg_amop Columns
 

@@ -818,7 +818,7 @@
is one row for each support procedure belonging to an operator family.
   
 
-  
+  
pg_amproc Columns
 

@@ -4458,7 +4458,7 @@ SCRAM-SHA-256$iteration count:
Operator classes are described at length in .
   
 
-  
+  
pg_opclass Columns
 

@@ -4720,7 +4720,7 @@ SCRAM-SHA-256$iteration count:
Operator families are described at length in .
   
 
-  
+  
pg_opfamily Columns
 

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 10b9795..b5d2095 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1203,6 +1203,95 @@ testdb=
 
 
   
+  
+  
+
+  \dAf
+  [ access-method-pattern 
+[operator-family-pattern]]
+  
+
+
+
+
+Lists operator families (). If access-method-pattern is specified, only
+families whose access method name matches the pattern are shown.
+If operator-family-pattern is specified, only
+opereator families associated with whose name matches the pattern are shown.
+If + is appended to the command name, each operator
+family is listed with it's owner.
+
+
+  
+
+  
+
+  \dAfo
+[access-method-pattern
+  [operator-family-pattern]]
+  
+
+
+
+
+Lists operators () associated with access method operator families.
+If access-method-patttern is specified,
+only operators associated with access method whose name matches pattern are shown.
+If operator-family-pattern is specified, only
+opereators associated with families whose name matches the pattern are shown.
+
+
+  
+
+  
+
+  \dAfp
+[access-method-pattern
+  [operator-family-pattern]]
+  
+
+
+
+List procedures () accociated with access method operator families.
+If access-method-patttern is specified,
+only procedures associated with access method whose name matches pattern are shown.
+If operator-family-pattern is specified, only
+procedures associated with families whose name matches the pattern are shown.
+
+
+  
+
+  
+
+  \dAop
+[access-method-pattern
+  [operator-class-pattern]]
+  
+
+
+
+Shows index access method operator classes listed in .
+If access-method-patttern is specified,
+only operator classes associated with access method whose name matches pattern are shown.
+If operator-class-pattern is specified, only
+procedures associated with families whose name matches the pattern are shown.
+
+
+  
+
+
+  
+\dAp [ pattern ]
+
+
+
+Shows access method properties listed in . If pattern is specified, only access
+methods whose names match the pattern are shown.
+
+
+  
 
   
 \db[+] [ pattern ]
@@ -1351,6 +1440,35 @@ testdb=
 
   
 
+  
+\dip [ pattern ]
+
+
+
+Shows index properties listed in . If pattern 

Re: PSA: --enable-coverage interferes with parallel query scheduling

2018-06-22 Thread Robert Haas
On Thu, Jun 21, 2018 at 4:00 PM, Tom Lane  wrote:
> We could probably fix it by using a significantly larger test case,
> but that's not very attractive to put into the regression tests.
> Anybody have a better idea about how to improve this?  Or even a
> clear explanation for what's causing it?  (I'd expect coverage
> instrumentation to impose costs at process exit, not startup.)

I don't know what's causing this to happen, but what jumps out at me
is that worker 3 is the one that eats all of the rows, rather than,
say, worker 0, or the leader.  Normally what happens in parallel query
-- pretty much by design -- is that the processes that are started
earlier get going before the ones that are started later, and they
finish gobbling up all the input before the others finish
initializing.  But here the last process that started was the only one
that got to do any work.  That seems mighty odd.  Why should the
leader get descheduled like that?  And all the workers, too?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-22 Thread Robert Haas
On Fri, Jun 22, 2018 at 2:26 AM, Rajkumar Raghuwanshi
 wrote:
> I have applied patch and checked reported issue. Patch applied cleanly and
> issues not reproducible any more.

Committed, with a few changes:

- Renamed found_partially_grouped_child to partial_grouping_valid.
The old name seemed to me to be a poor choice, because it sounds from
the name like it gets set to true whenever we've found at least one
partially grouped child, whereas really it gets set to false whenever
we've failed to find at least one partially grouped child.  The new
name is also more like the names in add_paths_to_append_rel.

- Modified the wording of the comment.

- Omitted the new assertion in add_paths_to_append_rel.  I doubt
whether that's correct.  I don't see any obvious reason why
live_childrels can't be NIL there, and further down I see this:

/*
 * If we found unparameterized paths for all children, build
an unordered,
 * unparameterized Append path for the rel.  (Note: this is correct even
 * if we have zero or one live subpath due to constraint exclusion.)
 */

If it's not possible to have no live_childrels, then that comment is
highly suspect.

Also, even if this assertion *is* correct, I think it needs a better
comment explaining why it's correct, because there isn't anything
obvious in set_append_rel_pathlist that keeps IS_DUMMY_REL() from
being true for every child.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Fast default stuff versus pg_upgrade

2018-06-22 Thread Andrew Dunstan




On 06/21/2018 04:41 PM, Tom Lane wrote:

Andrew Dunstan  writes:

I left that for a separate exercise. I think this does things the way
you want. I must admit to being a bit surprised, I was expecting you to
have more to say about the upgrade function than the pg_dump changes :-)

Well, the upgrade function is certainly a bit ugly (I'm generally allergic
to patches that result in a large increase in the number of #includes in
a file --- it suggests that the code was put in an inappropriate place).
But I don't see a good way to make it better, at least not without heavy
refactoring of StoreAttrDefault so that some code could be shared.

I think this is probably OK now, except for one thing: you're likely
to have issues if a dropped column has an attmissingval, because often
the value won't agree with the bogus "integer" type we use for those;
worse, the array might refer to a type that no longer exists.
One way to improve that is to change

   "CASE WHEN a.atthasmissing THEN a.attmissingval "
   "ELSE null END AS attmissingval "

to

   "CASE WHEN a.atthasmissing AND NOT a.attisdropped 
THEN a.attmissingval "
   "ELSE null END AS attmissingval "

However, this might be another reason to reconsider the decision to use
anyarray: it could easily lead to situations where harmless-looking
queries like "select * from pg_attribute" fail.  Or maybe we should tweak
DROP COLUMN so that it forcibly resets atthasmissing/attmissingval along
with setting atthasdropped, so that the case can't arise.

Like Andres, I haven't actually tested, just eyeballed it.





I moved the bulk of the code into a function in heap.c where it seemed 
right at home.


I added removal of missing values from dropped columns, as well as 
qualifying the test as above.


The indent issue raised by Alvaro is also fixed. I included your patch 
fixing the regression tests.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: "wal receiver" process hang in syslog() while exiting after receiving SIGTERM while the postgres has been promoted.

2018-06-22 Thread Robert Haas
On Thu, Jun 21, 2018 at 1:11 AM, Chen, Yan-Jack (NSB - CN/Hangzhou)
 wrote:
> Hi Hackers,
>   We encounter one problem happened while we try to promote standby
> postgres(version 9.6.9) server to active. From the trace(we triggered the
> process abort). We can see the process was hang in syslog() while handling
> SIGTERM. According to below article. Looks it is risky to write syslog in
> signal handling. Any idea to avoid it?

Huh.  I thought that Andres had removed all of this kind of stuff back
in 675f55e1d9bcb9da4323556b456583624a07,
4f85fde8eb860f263384fffdca660e16e77c7f76, and
387da18874afa17156ee3af63766f17efb53c4b9, and related commits, but
rereading the commit message I see that it wasn't that ambitious.
Probably a similar approach would make sense here, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Fix some error handling for read() and errno

2018-06-22 Thread Robert Haas
On Thu, Jun 21, 2018 at 2:32 AM, Michael Paquier  wrote:
> Sure.  I have looked at that and I am finishing with the updated version
> attached.
>
> I removed the portion about slru in the code, but I would like to add at
> least a mention about it in the commit message.  The simplifications are
> also applied for the control file messages you changed as well in
> cfb758b.  Also in ReadControlFile(), we use "could not open control
> file", so for consistency this should be replaced with a more simple
> message.  I noticed as well that the 2PC file handling loses errno a
> couple of times, and that we had better keep the messages also
> consistent if we do the simplification move...  relmapper.c also gains
> simplifications.
>
> All the incorrect errno handling I found should be back-patched in my
> opinion as we did so in the past with 2a11b188 for example.  I am not
> really willing to bother about the error message improvements on
> anything else than HEAD (not v11, but v12), but...  Opinions about all
> that?

I think this should be split into two patches.  Fixing failure to save
and restore errno is a different issue from cleaning up short write
messages.  I agree that the former should be back-patched and the
latter not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Keeping temporary tables in shared buffers

2018-06-22 Thread Robert Haas
On Mon, May 28, 2018 at 4:25 AM, Amit Kapila  wrote:
> On Fri, May 25, 2018 at 6:33 AM, Asim Praveen  wrote:
>> We are evaluating the use of shared buffers for temporary tables.  The
>> advantage being queries involving temporary tables can make use of parallel
>> workers.
> This is one way, but I think there are other choices as well.  We can
> identify and flush all the dirty (local) buffers for the relation
> being accessed parallelly.  Now, once the parallel operation is
> started, we won't allow performing any write operation on them.  It
> could be expensive if we have a lot of dirty local buffers for a
> particular relation.  I think if we are worried about the cost of
> writes, then we can try some different way to parallelize temporary
> table scan.  At the beginning of the scan, leader backend will
> remember the dirty blocks present in local buffers, it can then share
> the list with parallel workers which will skip scanning those blocks
> and in the end leader ensures that all those blocks will be scanned by
> the leader.  This shouldn't incur a much additional cost as the
> skipped blocks should be present in local buffers of backend.

This sounds awkward and limiting.  How about using DSA to allocate
space for the backend's temporary buffers and a dshash for lookups?
Then all backends can share, but we don't have to go through
shared_buffers.

Honestly, I don't see how pushing this through the main shared_buffers
arena is every going to work.  Widening the buffer tags by another 4
bytes is a non-starter, I think.  Maybe with some large rejiggering we
could get to a point where every relation, temporary or permanent, can
be identified by a single OID, regardless of tablespace, etc.  But if
you wait for that to happen this might not happen for a lng time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-22 Thread Andrey V. Lepikhov

Hi,
According to your feedback, i develop second version of the patch.
In this version:
1. High-level functions index_beginscan(), index_rescan() not used. Tree 
descent made by _bt_search(). _bt_binsrch() used for positioning on the 
page.
2. TID list introduced in amtargetdelete() interface. Now only one tree 
descent needed for deletion all tid's from the list with equal scan key 
value - logical duplicates deletion problem.
3. Only one WAL record for index tuple deletion per leaf page per 
amtargetdelete() call.

4. VACUUM can sort TID list preliminary for more quick search of duplicates.

Background worker will come later.

On 19.06.2018 22:38, Peter Geoghegan wrote:

On Tue, Jun 19, 2018 at 2:33 AM, Masahiko Sawada  wrote:

I think that we do the partial lazy vacuum using visibility map even
now. That does heap pruning, index tuple killing but doesn't advance
relfrozenxid.


Right, that's what I was thinking. Opportunistic HOT pruning isn't
like vacuuming because it doesn't touch indexes. This patch adds an
alternative strategy for conventional lazy vacuum that is also able to
run a page at a time if needed. Perhaps page-at-a-time operation could
later be used for doing something that is opportunistic in the same
way that pruning is opportunistic, but it's too early to worry about
that.


Since this patch adds an ability to delete small amount
of index tuples quickly, what I'd like to do with this patch is to
invoke autovacuum more frequently, and do the target index deletion or
the index bulk-deletion depending on amount of garbage and index size
etc. That is, it might be better if lazy vacuum scans heap in ordinary
way and then plans and decides a method of index deletion based on
costs similar to what query planning does.


That seems to be what Andrey wants to do, though right now the
prototype patch actually just always uses its alternative strategy
while doing any kind of lazy vacuuming (some simple costing code is
commented out right now). It shouldn't be too hard to add some costing
to it. Once we do that, and once we polish the patch some more, we can
do performance testing. Maybe that alone will be enough to make the
patch worth committing; "opportunistic microvacuuming" can come later,
if at all.



--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3..96f1d47 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -126,6 +126,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbc..a0e06bd 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -103,6 +103,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182..acf14e7 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -58,6 +58,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42eff..d7a53d2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -80,6 +80,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0002df3..5fb32d6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -76,6 +76,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = hashbuild;
 	amroutine->ambuildempty = hashbuildempty;
 	amroutine->aminsert = hashinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 22b5cc9..9ebeb78 100644
--- 

Re: Wrong cost estimation for foreign tables join with use_remote_estimate disabled

2018-06-22 Thread Konstantin Knizhnik




On 22.06.2018 13:30, Ashutosh Bapat wrote:

On Fri, Jun 22, 2018 at 11:56 AM, Konstantin Knizhnik
 wrote:


On 21.06.2018 20:08, Tom Lane wrote:

Konstantin Knizhnik  writes:

The following very simple test reduce the problem with wrong cost
estimation:
create foreign table t1_fdw(x integer, y integer) server pg_fdw options
(table_name 't1', use_remote_estimate 'false');
create foreign table t2_fdw(x integer) server pg_fdw options (table_name
't2', use_remote_estimate 'false');
It is possible to force Postgres to use correct plan by setting
"fdw_startup_cost" to some very large value (1 for example).
...
Also correct plan is used when use_remote_estimate is true.

If you are unhappy about the results with use_remote_estimate off, don't
run it that way.  The optimizer does not have a crystal ball.


As I wrote, use_remote_estimate can not be used because in this case query
compilation time is unacceptable (10 seconds, while time of query execution
itself is ~200msec).
So the problem can be addressed in two ways:

1. Try to reduce time of remote estimation. I wonder why postgres_fdw sends
so much queries to remote server. For join of two tables there are 7
queries.
I suspect that for ~20 joined tables in the original query number of calls
is more than hundred,  so on wonder that it takes so much time.
2. Try to make optimizer make better estimation of join cost based on local
statistic (please notice that ANALYZE is explicitly called for all foreign
tables and number of rows in the result was correctly calculated).


I think estimate_path_cost_size() is too pessimistic about how many
times the join conditions are evaluated (Sorry, I have written that
code when I was worked on join pushdown for postgres_fdw.)

 /* Estimate of number of rows in cross product */
 nrows = fpinfo_i->rows * fpinfo_o->rows;

and somewhere down in the code
run_cost += nrows * join_cost.per_tuple;

It assumes that the join conditions are run on the cross-product of
the joining tables. In reality that never happens for large tables. In
such cases the optimizer will choose either hash or merge join, which
will apply join conditions only on a small portion of cross-product.
But the reason it was written that way was the local server can not
estimate the fraction of cross product on which the join conditions
will be applied. May be we could assume that the join conditions will
be applied to only 1% of the cross product, i.e. run_cost +=
clamp_rows(nrows/100) * join_cost.per_tuple. With this change I think
the cost of remote plan will be less than local plan.

Here's a preview of blog, I am planning to publish soon, about this
issue at [1]. It has a bit more details.

[1] 
https://www.blogger.com/blogger.g?blogID=5253679863234367862#editor/target=post;postID=4019325618679658571;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=0;src=postname

Yes, postgres_fdw is very conservative or event pessimistic regarding 
cost of joins.
It really assumes that there will be cross join with applied filter, 
which is not true in most cases.
It's a pity. especially taken in account that based on local statistic 
it is able to correctly predict number of rows in the result of join.
So if w take in account this estimated number of retrieved rows in 
calculation of join cost, then estimation is more correct and right plan 
(with remote joins) is chosen:


diff --git a/contrib/postgres_fdw/postgres_fdw.c 
b/contrib/postgres_fdw/postgres_fdw.c

index 78b0f43..84b30ce 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2812,9 +2812,8 @@ estimate_path_cost_size(PlannerInfo *root,
 * 4. Run time cost of applying nonpushable 
other clauses locally

 * on the result fetched from the foreign server.
 */
-   run_cost = fpinfo_i->rel_total_cost - 
fpinfo_i->rel_startup_cost;
-   run_cost += fpinfo_o->rel_total_cost - 
fpinfo_o->rel_startup_cost;

-   run_cost += nrows * join_cost.per_tuple;
+   run_cost = (fpinfo_i->rel_total_cost - 
fpinfo_i->rel_startup_cost) * retrieved_rows / fpinfo_i->rows ;
+   run_cost += (fpinfo_o->rel_total_cost - 
fpinfo_o->rel_startup_cost) * retrieved_rows / fpinfo_o->rows;
    nrows = clamp_row_est(nrows * 
fpinfo->joinclause_sel);

    run_cost += nrows * remote_conds_cost.per_tuple;
    run_cost += fpinfo->local_conds_cost.per_tuple 
* retrieved_rows;



I also tried to do something with first approach: speed up remote 
estimation. My idea was to use local estimation whenever possible and 
use remote estimation only for joins.
In case of joining two tables it cause sending only one EXPLAIN request 
to remote server. But for larger number of joined table amount of 
considered pathes and so number of remote 

Re: Wrong cost estimation for foreign tables join with use_remote_estimate disabled

2018-06-22 Thread Ashutosh Bapat
Sorry here's preview link [1]

[1] 
https://ashutoshpg.blogspot.com/b/post-preview?token=TCTIKGQBAAA.2iKpIUItkwZLkXiujvs0zad-DtDdKbwIdRFCGbac9--XbqcA-xnCdz4wmbD4hIaEHuyg5Xrz8eZq8ZNmw83yfQ.HXi__guM-7SzdIWi27QkjA=4019325618679658571=POST

On Fri, Jun 22, 2018 at 4:00 PM, Ashutosh Bapat
 wrote:
> On Fri, Jun 22, 2018 at 11:56 AM, Konstantin Knizhnik
>  wrote:
>>
>>
>> On 21.06.2018 20:08, Tom Lane wrote:
>>>
>>> Konstantin Knizhnik  writes:

 The following very simple test reduce the problem with wrong cost
 estimation:
 create foreign table t1_fdw(x integer, y integer) server pg_fdw options
 (table_name 't1', use_remote_estimate 'false');
 create foreign table t2_fdw(x integer) server pg_fdw options (table_name
 't2', use_remote_estimate 'false');
 It is possible to force Postgres to use correct plan by setting
 "fdw_startup_cost" to some very large value (1 for example).
 ...
 Also correct plan is used when use_remote_estimate is true.
>>>
>>> If you are unhappy about the results with use_remote_estimate off, don't
>>> run it that way.  The optimizer does not have a crystal ball.
>>
>>
>> As I wrote, use_remote_estimate can not be used because in this case query
>> compilation time is unacceptable (10 seconds, while time of query execution
>> itself is ~200msec).
>> So the problem can be addressed in two ways:
>>
>> 1. Try to reduce time of remote estimation. I wonder why postgres_fdw sends
>> so much queries to remote server. For join of two tables there are 7
>> queries.
>> I suspect that for ~20 joined tables in the original query number of calls
>> is more than hundred,  so on wonder that it takes so much time.
>> 2. Try to make optimizer make better estimation of join cost based on local
>> statistic (please notice that ANALYZE is explicitly called for all foreign
>> tables and number of rows in the result was correctly calculated).
>>
>
> I think estimate_path_cost_size() is too pessimistic about how many
> times the join conditions are evaluated (Sorry, I have written that
> code when I was worked on join pushdown for postgres_fdw.)
>
> /* Estimate of number of rows in cross product */
> nrows = fpinfo_i->rows * fpinfo_o->rows;
>
> and somewhere down in the code
>run_cost += nrows * join_cost.per_tuple;
>
> It assumes that the join conditions are run on the cross-product of
> the joining tables. In reality that never happens for large tables. In
> such cases the optimizer will choose either hash or merge join, which
> will apply join conditions only on a small portion of cross-product.
> But the reason it was written that way was the local server can not
> estimate the fraction of cross product on which the join conditions
> will be applied. May be we could assume that the join conditions will
> be applied to only 1% of the cross product, i.e. run_cost +=
> clamp_rows(nrows/100) * join_cost.per_tuple. With this change I think
> the cost of remote plan will be less than local plan.
>
> Here's a preview of blog, I am planning to publish soon, about this
> issue at [1]. It has a bit more details.
>
> [1] 
> https://www.blogger.com/blogger.g?blogID=5253679863234367862#editor/target=post;postID=4019325618679658571;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=0;src=postname
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Wrong cost estimation for foreign tables join with use_remote_estimate disabled

2018-06-22 Thread Ashutosh Bapat
On Fri, Jun 22, 2018 at 11:56 AM, Konstantin Knizhnik
 wrote:
>
>
> On 21.06.2018 20:08, Tom Lane wrote:
>>
>> Konstantin Knizhnik  writes:
>>>
>>> The following very simple test reduce the problem with wrong cost
>>> estimation:
>>> create foreign table t1_fdw(x integer, y integer) server pg_fdw options
>>> (table_name 't1', use_remote_estimate 'false');
>>> create foreign table t2_fdw(x integer) server pg_fdw options (table_name
>>> 't2', use_remote_estimate 'false');
>>> It is possible to force Postgres to use correct plan by setting
>>> "fdw_startup_cost" to some very large value (1 for example).
>>> ...
>>> Also correct plan is used when use_remote_estimate is true.
>>
>> If you are unhappy about the results with use_remote_estimate off, don't
>> run it that way.  The optimizer does not have a crystal ball.
>
>
> As I wrote, use_remote_estimate can not be used because in this case query
> compilation time is unacceptable (10 seconds, while time of query execution
> itself is ~200msec).
> So the problem can be addressed in two ways:
>
> 1. Try to reduce time of remote estimation. I wonder why postgres_fdw sends
> so much queries to remote server. For join of two tables there are 7
> queries.
> I suspect that for ~20 joined tables in the original query number of calls
> is more than hundred,  so on wonder that it takes so much time.
> 2. Try to make optimizer make better estimation of join cost based on local
> statistic (please notice that ANALYZE is explicitly called for all foreign
> tables and number of rows in the result was correctly calculated).
>

I think estimate_path_cost_size() is too pessimistic about how many
times the join conditions are evaluated (Sorry, I have written that
code when I was worked on join pushdown for postgres_fdw.)

/* Estimate of number of rows in cross product */
nrows = fpinfo_i->rows * fpinfo_o->rows;

and somewhere down in the code
   run_cost += nrows * join_cost.per_tuple;

It assumes that the join conditions are run on the cross-product of
the joining tables. In reality that never happens for large tables. In
such cases the optimizer will choose either hash or merge join, which
will apply join conditions only on a small portion of cross-product.
But the reason it was written that way was the local server can not
estimate the fraction of cross product on which the join conditions
will be applied. May be we could assume that the join conditions will
be applied to only 1% of the cross product, i.e. run_cost +=
clamp_rows(nrows/100) * join_cost.per_tuple. With this change I think
the cost of remote plan will be less than local plan.

Here's a preview of blog, I am planning to publish soon, about this
issue at [1]. It has a bit more details.

[1] 
https://www.blogger.com/blogger.g?blogID=5253679863234367862#editor/target=post;postID=4019325618679658571;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=0;src=postname

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Wrong cost estimation for foreign tables join with use_remote_estimate disabled

2018-06-22 Thread Etsuro Fujita

(2018/06/22 18:49), Amit Langote wrote:

On 2018/06/22 18:15, Etsuro Fujita wrote:

I'd vote for #2.  One idea for that is to introduce CREATE FOREIGN INDEX
to have information on remote indexes on the local side, which I proposed
before.  I have been putting it on hold since then, though.


Sorry to hijack this thread, but I'd like to say that CREATE FOREIGN INDEX
would be nice, as that would also let us lift certain restrictions on
partitioned table indexes [1].


Agreed.  I think that would be useful to support INSERT ... ON CONFLICT 
fully not only on single foreign tables but partitioned tables 
containing foreign partitions.


Best regards,
Etsuro Fujita



Re: Incorrect errno used with %m for backend code

2018-06-22 Thread Ashutosh Sharma
On Fri, Jun 22, 2018 at 2:44 PM, Michael Paquier  wrote:
> On Fri, Jun 22, 2018 at 01:00:45PM +0530, Ashutosh Sharma wrote:
>> It seems like in case of few system calls for e.g. write system call,
>> errno is not set even if the number of bytes written is smaller than
>> the bytes requested and for such cases we explicitly set an errno to
>> ENOSPC. Something like this,
>>
>> /*
>>  * if write didn't set errno, assume problem is no disk space
>>  */
>> errno = save_errno ? save_errno : ENOSPC;
>>
>> Shouldn't we do similar handling in your patch as well or please let
>> me know if i am missing something here.
>
> Yeah, I can see at least three of them in twophase.c.  Let's fix those
> as well at the same time.

Okay, thanks for the confirmation. Few of them are also there in
origin.c and snapbuild.c files.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: [HACKERS] GUC for cleanup indexes threshold.

2018-06-22 Thread Alexander Korotkov
On Wed, Jun 20, 2018 at 12:00 PM Alexander Korotkov
 wrote:
> On Wed, Jun 20, 2018 at 8:32 AM Masahiko Sawada  wrote:
> > On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov
> > > Ok.  I've rephrased comment a bit.  Also, you created "index vacuum"
> > > subsection in the "resource usage" section.  I think it's not
> > > appropriate for this option to be in "resource usage".  Ideally it
> > > should be grouped with other vacuum options, but we don't have single
> > > section for that.  "Autovacuum" section is also not appropriate,
> > > because this guc works not only for autovacuum.  So, most semantically
> > > close options, which affects vacuum in general, are
> > > vacuum_freeze_min_age, vacuum_freeze_table_age,
> > > vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age,
> > > which are located in "client connection defaults" section.  So, I
> > > decided to move this GUC into this section.  I also change the section
> > > in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT.
> >
> > Agreed. So should we move it to 19.11 Client Connection Defaults in
> > doc as well? And I think it's better if this option places next to
> > other vacuum options for finding easier. Attached patch changes them
> > so. Please review it.
>
> Right, thank you.  Looks good for me.
> I'm going to commit this if no objections.

Pushed.

Regarding maximum value for vacuum_cleanup_index_scale_factor.  I
think introducing special value with "never cleanup" meaning would be
overkill for post feature freeze enhancement.  So, I propose to just
increase maximum value for both GUC and reloption.  See the attached
patch.  It also changes calculations _bt_vacuum_needs_cleanup() for
better handling of large values (just some kind of overflow paranoia).

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


vacuum_cleanup_index_scale_factor-max-2.patch
Description: Binary data


Re: Wrong cost estimation for foreign tables join with use_remote_estimate disabled

2018-06-22 Thread Amit Langote
On 2018/06/22 18:15, Etsuro Fujita wrote:
> I'd vote for #2.  One idea for that is to introduce CREATE FOREIGN INDEX
> to have information on remote indexes on the local side, which I proposed
> before.  I have been putting it on hold since then, though.

Sorry to hijack this thread, but I'd like to say that CREATE FOREIGN INDEX
would be nice, as that would also let us lift certain restrictions on
partitioned table indexes [1].

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4eaa5372754




Re: Possible bug in logical replication.

2018-06-22 Thread Michael Paquier
On Fri, Jun 22, 2018 at 04:33:12PM +0900, Kyotaro HORIGUCHI wrote:
> pg_advance_replication_slots can advance uninitialized physical
> slots and that might not be good. (Logical slots always have
> initial invalid values in thw two lsn columns.)

The current logic is careful that users willing to move to a new
position cannot move in the future, but the logic is visibly wanted to
accept past values.  Petr, what do you think?  KeepLogSeg() won't return
negative values so some applications may take advantage of that.  Or
should advancing be simply disabled for non-initialized slots? 
--
Michael


signature.asc
Description: PGP signature


Re: add default parallel query to v10 release notes? (Re: [PERFORM] performance drop after upgrade (9.6 > 10))

2018-06-22 Thread Amit Kapila
On Wed, Jun 20, 2018 at 8:43 PM, Bruce Momjian  wrote:
> On Thu, May 24, 2018 at 08:00:25PM -0500, Justin Pryzby wrote:
>
> So I did some research on this, particularly to find out how it was
> missed in the PG 10 release notes.  It turns out that
> max_parallel_workers_per_gather has always defaulted to 2 in head, and
> this was changed to default to 0 in the 9.6 branch:
>
> commit f85b1a84152f7bf019fd7a2c5eede97867dcddbb
> Author: Robert Haas 
> Date:   Tue Aug 16 08:09:15 2016 -0400
>
> Disable parallel query by default.
>
> Per discussion, set the default value of 
> max_parallel_workers_per_gather
> to 0 in 9.6 only.  We'll leave it enabled in master so that it 
> gets
> more testing and in the hope that it can be enable by default in 
> v10.
>
> Therefore, there was no commit to find in the PG 10 commit logs.  :-O
> Not sure how we can avoid this kind of problem in the future.
>
> The attached patch adds a PG 10.0 release note item about this change.
>

Your proposed text looks good to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Incorrect errno used with %m for backend code

2018-06-22 Thread Michael Paquier
On Fri, Jun 22, 2018 at 01:00:45PM +0530, Ashutosh Sharma wrote:
> It seems like in case of few system calls for e.g. write system call,
> errno is not set even if the number of bytes written is smaller than
> the bytes requested and for such cases we explicitly set an errno to
> ENOSPC. Something like this,
> 
> /*
>  * if write didn't set errno, assume problem is no disk space
>  */
> errno = save_errno ? save_errno : ENOSPC;
> 
> Shouldn't we do similar handling in your patch as well or please let
> me know if i am missing something here.

Yeah, I can see at least three of them in twophase.c.  Let's fix those
as well at the same time.
--
Michael


signature.asc
Description: PGP signature


Re: Wrong cost estimation for foreign tables join with use_remote_estimate disabled

2018-06-22 Thread Etsuro Fujita

Hi Konstantin,

(2018/06/22 15:26), Konstantin Knizhnik wrote:

On 21.06.2018 20:08, Tom Lane wrote:

Konstantin Knizhnik  writes:

The following very simple test reduce the problem with wrong cost
estimation:
create foreign table t1_fdw(x integer, y integer) server pg_fdw options
(table_name 't1', use_remote_estimate 'false');
create foreign table t2_fdw(x integer) server pg_fdw options (table_name
't2', use_remote_estimate 'false');
It is possible to force Postgres to use correct plan by setting
"fdw_startup_cost" to some very large value (1 for example).
...
Also correct plan is used when use_remote_estimate is true.

If you are unhappy about the results with use_remote_estimate off, don't
run it that way. The optimizer does not have a crystal ball.


As I wrote, use_remote_estimate can not be used because in this case
query compilation time is unacceptable (10 seconds, while time of query
execution itself is ~200msec).
So the problem can be addressed in two ways:

1. Try to reduce time of remote estimation. I wonder why postgres_fdw
sends so much queries to remote server. For join of two tables there are
7 queries.
I suspect that for ~20 joined tables in the original query number of
calls is more than hundred,  so on wonder that it takes so much time.
2. Try to make optimizer make better estimation of join cost based on
local statistic (please notice that ANALYZE is explicitly called for all
foreign tables and number of rows in the result was correctly calculated).


To make local estimates more accurate, I think we need other information 
on remote tables such as remote indexes.



What do you think: which of this two direction is more perspective? Or
it is better to address both of them?


I'd vote for #2.  One idea for that is to introduce CREATE FOREIGN INDEX 
to have information on remote indexes on the local side, which I 
proposed before.  I have been putting it on hold since then, though.


Best regards,
Etsuro Fujita



Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-22 Thread Amit Kapila
On Thu, Jun 21, 2018 at 7:21 PM, Tom Lane  wrote:
> Amit Kapila  writes:
>> On Thu, Jun 21, 2018 at 11:51 AM, Amit Khandekar  
>> wrote:
>>> After list_concat, the subpaths no longer has only non-partial paths,
>>> which it is supposed to have. So it anyways should not be used in the
>>> subsequent code in that function. So I think the following change
>>> should be good.
>>> -   foreach(l, subpaths)
>>> +   foreach(l, pathnode->subpaths)
>
> Patch LGTM.
>

Okay, pushed and updated Open Items page as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



utilities to rebuild commit logs from wal

2018-06-22 Thread Chris Travers
Before we reinvent the wheel here, does anyone know of utilities to build
commit logs from wal segments?  Or even to just fill with commits?

I figure it is worth asking before I write one.


Re: WAL prefetch

2018-06-22 Thread Konstantin Knizhnik



On 21.06.2018 19:57, Tomas Vondra wrote:



On 06/21/2018 04:01 PM, Konstantin Knizhnik wrote:

I continue my experiments with WAL prefetch.
I have embedded prefetch in Postgres: now walprefetcher is started 
together with startup process and is able to help it to speedup 
recovery.

The patch is attached.

Unfortunately result is negative (at least at my desktop: SSD, 16Gb 
RAM). Recovery with prefetch is 3 times slower than without it.

What I am doing:

Configuration:
 max_wal_size=min_wal_size=10Gb,
 shared)buffers = 1Gb
Database:
  pgbench -i -s 1000
Test:
  pgbench -c 10 -M prepared -N -T 100 -P 1
  pkill postgres
  echo 3 > /proc/sys/vm/drop_caches
  time pg_ctl -t 1000 -D pgsql -l logfile start

Without prefetch it is 19 seconds (recovered about 4Gb of WAL), with 
prefetch it is about one minute. About 400k blocks are prefetched.

CPU usage is small (<20%), both processes as in "Ds" state.



Based on a quick test, my guess is that the patch is broken in several 
ways. Firstly, with the patch attached (and wal_prefetch_enabled=on, 
which I think is needed to enable the prefetch) I can't even restart 
the server, because pg_ctl restart just hangs (the walprefetcher 
process gets stuck in WaitForWAL, IIRC).


I have added an elog(LOG,...) to walprefetcher.c, right before the 
FilePrefetch call, and (a) I don't see any actual prefetch calls 
during recovery but (b) I do see the prefetch happening during the 
pgbench. That seems a bit ... wrong?


Furthermore, you've added an extra

    signal_child(BgWriterPID, SIGHUP);

to SIGHUP_handler, which seems like a bug too. I don't have time to 
investigate/debug this further.


regards


Sorry, updated version of the patch is attached.
Please also notice that you can check number of prefetched pages using 
pg_stat_activity() - it is reported for walprefetcher process.
Concerning the fact that you have no see prefetches at recovery time: 
please check that min_wal_size and max_wal_size are large enough and 
pgbench (or whatever else)

committed large enough changes so that recovery will take some time.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 1b000a2..9730b42 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -879,13 +879,6 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	return true;
 }
 
-#ifdef FRONTEND
-/*
- * Functions that are currently not needed in the backend, but are better
- * implemented inside xlogreader.c because of the internal facilities available
- * here.
- */
-
 /*
  * Find the first record with an lsn >= RecPtr.
  *
@@ -1004,9 +997,6 @@ out:
 	return found;
 }
 
-#endif			/* FRONTEND */
-
-
 /* 
  * Functions for decoding the data and block references in a record.
  * 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 52fe55e..7847311 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -652,7 +652,7 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * in walsender.c but for small differences (such as lack of elog() in
  * frontend).  Probably these should be merged at some point.
  */
-static void
+void
 XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
 		 Size count)
 {
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7e34bee..e492715 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -32,6 +32,7 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
+#include "postmaster/walprefetcher.h"
 #include "replication/walreceiver.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
@@ -335,6 +336,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case WalReceiverProcess:
 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
 break;
+			case WalPrefetcherProcess:
+statmsg = pgstat_get_backend_desc(B_WAL_PREFETCHER);
+break;
 			default:
 statmsg = "??? process";
 break;
@@ -462,6 +466,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			WalReceiverMain();
 			proc_exit(1);		/* should never return */
 
+		case WalPrefetcherProcess:
+			/* don't set signals, walprefetcher has its own agenda */
+			WalPrefetcherMain();
+			proc_exit(1);		/* should never return */
+
 		default:
 			elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
 			proc_exit(1);
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..13e5066 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include 

Lifetime of commit timestamps

2018-06-22 Thread Kyotaro HORIGUCHI
Hello.

I don't find any description in the documentation about the
guaranteed lifetime of commit timestamps. I think they are
preserved until corresponding xid goes beyond the freeze horizen,
even though they are actually preserved longer for several
reasons.

If it is not, I think such description is required in
pg_xact_commit_timestamp().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5dce8ef178..633e488cec 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18029,7 +18029,8 @@ SELECT collation for ('foo' COLLATE "de_DE");
 These functions mainly provide information about when the transactions
 were committed. They only provide useful data when
  configuration option is enabled
-and only for transactions that were committed after it was enabled.
+and only for transactions that were committed after it was enabled. Commit
+timestamps for frozen tuples are removed at vacuum time.

 



答复: Incorrect comments in commit_ts.c

2018-06-22 Thread shao bret
Thanks, I have seen the code change.

Br.
Bret.


HighGo Software
Website:http://www.highgo.com/?l=en


发送自 Windows 10 版邮件应用

发件人: Michael Paquier
发送时间: 2018年6月22日 12:32
收件人: shao bret
抄送: 
pgsql-hackers@lists.postgresql.org
主题: Re: Incorrect comments in commit_ts.c

On Fri, Jun 22, 2018 at 03:45:01AM +, shao bret wrote:
> I think there is some mistake for this comment.  It should be
> ‘commitTS page’, not ‘CLOG page’.

You are right, so pushed.
--
Michael



Re: Possible bug in logical replication.

2018-06-22 Thread Kyotaro HORIGUCHI
Hello.

At Thu, 14 Jun 2018 16:06:43 -0400, Alvaro Herrera  
wrote in <20180614200643.3my362zmfiwitrni@alvherre.pgsql>
> Can somebody (Arseny, Konstantin, horiguti, Sawada) please confirm that
> Michaël's commit fixes the reported bug?

pg_advance_replication_slots can advance uninitialized physical
slots and that might not be good. (Logical slots always have
initial invalid values in thw two lsn columns.)

About scanning from restart_lsn in the advancing function, I
think I confirmed that the value always comes from
XLogRecordBuffer.origptr, which comes from ReadRecPtr, not
EndRecPtr, which cannot be on page boundary.

FWIW, as the result, it looks fine for me also regarding the
issue on this thread.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Incorrect errno used with %m for backend code

2018-06-22 Thread Ashutosh Sharma
Hi,

It seems like in case of few system calls for e.g. write system call,
errno is not set even if the number of bytes written is smaller than
the bytes requested and for such cases we explicitly set an errno to
ENOSPC. Something like this,

/*
 * if write didn't set errno, assume problem is no disk space
 */
errno = save_errno ? save_errno : ENOSPC;

Shouldn't we do similar handling in your patch as well or please let
me know if i am missing something here.

On Fri, Jun 22, 2018 at 11:45 AM, Michael Paquier  wrote:
> Hi all,
>
> I have been reviewing the use of errno in the backend code when %m is
> used in the logs, and found more places than when I looked at improving
> the error messages for read() calls which bloat the errno value because
> of intermediate system calls.  As the problem is separate and should be
> back-patched, I have preferred beginning a new thread.
>
> A couple of places use CloseTransientFile without bothering much that
> this can overwrite errno.  I was tempted in keeping errno saved and kept
> if set to a non-zero value, but refrained from doing so as some callers
> may rely on the existing behavior, and the attached shows better the
> intention.
>
> Attached is a patch with everything I have spotted.  Any opinions or
> thoughts in getting this back-patched?
>
> Thanks,
> --
> Michael



Re: Continue work on changes to recovery.conf API

2018-06-22 Thread Craig Ringer
On 22 June 2018 at 02:48, Sergei Kornilov  wrote:

> I completely forgot attach patch, sorry. Attached now



Sergei,

It's great to see you picking this up.  I'll try to find time to review
this for the next CF cycle. Please poke me if you don't hear from me, I do
get distracted. It's long past time to get these changes in for good.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: libpq compression

2018-06-22 Thread Konstantin Knizhnik




On 22.06.2018 00:34, Nico Williams wrote:

On Thu, Jun 21, 2018 at 10:12:17AM +0300, Konstantin Knizhnik wrote:

On 20.06.2018 23:34, Robbie Harwood wrote:

Konstantin Knizhnik  writes:
Well, that's a design decision you've made.  You could put lengths on
chunks that are sent out - then you'd know exactly how much is needed.
(For instance, 4 bytes of network-order length followed by a complete
payload.)  Then you'd absolutely know whether you have enough to
decompress or not.

Do you really suggest to send extra header for each chunk of data?
Please notice that chunk can be as small as one message: dozen of bytes
because libpq is used for client-server communication with request-reply
pattern.

You must have lengths, yes, otherwise you're saying that the chosen
compression mechanism must itself provide framing.

I'm not that familiar with compression APIs and formats, but looking at
RFC1950 (zlib) for example I see no framing.

So I think you just have to have lengths.

Now, this being about compression, I understand that you might now want
to have 4-byte lengths, especially given that most messages will be
under 8KB.  So use a varint encoding for the lengths.

Nico
No explicit framing and lengths are needed in case of using streaming 
compression.
There can be certainly some kind of frames inside compression protocol 
itself, but it is intrinsic of compression algorithm.


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Push down Aggregates below joins

2018-06-22 Thread Antonin Houska
Heikki Linnakangas  wrote:

> Ah, cool! I missed that thread earlier. Yes, seems like we've been hacking on
> the same feature. Let's compare!
> 
> I've been using this paper as a guide:
> 
> "Including Group-By in Query Optimization", by Surajit Chaudhuri and Kyuseok
> Shim:
> https://pdfs.semanticscholar.org/3079/5447cec18753254edbbd7839f0afa58b2a39.pdf

> Using the terms from that paper, my patch does only "Invariant Grouping
> transormation", while yours can do "Simple Coalescing Grouping", which is more
> general. In layman terms, my patch can push the Aggregate below a join, while
> your patch can also split an Aggregate so that you do a partial aggregate
> below the join, and a final stage above it.

Thanks for the link. I've just checked the two approaches briefly so far.

> My thinking was to start with the simpler Invariant Grouping transformation
> first, and do the more advanced splitting into partial aggregates later, as
> a separate patch.

I think for this you need to make sure that no join duplicates already
processed groups. I tried to implement PoC of "unique keys" in v5 of my patch
[1], see 09_avoid_agg_finalization.diff. The point is that the final
aggregation can be replaced by calling pg_aggregate(aggfinalfn) function if
the final join generates only unique values of the GROUP BY expression.

Eventually I considered this an additional optimization of my approach and
postponed work on this to later, but I think something like this would be
necessary for your approach. As soon as you find out that a grouped relation
is joined to another relation in a way that duplicates the grouping
expression, you cannot proceed in creating grouped path.

> There's some difference in the way we're dealing with the grouped
> RelOptInfos. You're storing them completely separately, in PlannerInfo's new
> 'simple_grouped_rel_array' array and 'join_grouped_rel_list/hash'. I'm
> attaching each grouped RelOptInfo to its parent.

In the first version of my patch I added several fields to RelOptInfo
(reltarget for the grouped paths, separate pathlist, etc.) but some people
didn't like it. In the later versions I introduced a separate RelOptInfo for
grouped relations, but stored it in a separate list. Your approach might make
the patch a bit less invasive, i.e. we don't have to add those RelOptInfo
lists / arrays to standard_join_search() and subroutines.

> I got away with much less code churn in allpaths.c, indxpath.c,
> joinpath.c. You're adding new 'do_aggregate' flags to many functions. I'm not
> sure if you needed that because you do partial aggregation and I don't, but it
> would be nice to avoid it.

IIRC, the do_aggregate argument determines how the grouped join should be
created. If it's false, the planner joins a grouped relation (if it exists) to
non-grouped one. If it's true, it joins two non-grouped relations and applies
(partial) aggregation to the result.

> You're introducing a new GroupedVar expression to represent Aggrefs between
> the partial and final aggregates, while I'm just using Aggref directly, above
> the aggregate node. I'm not thrilled about introducing an new Var-like
> expression. We already have PlaceHolderVars, and I'm always confused on how
> those work. But maybe that's necessary to supprot partial aggregation?

The similarity of GroupedVar and PlaceHolderVar is that they are evaluated at
some place in the join tree and the result is only passed to the joins above
and eventually to the query target, w/o being evaluated again. In contrast,
generic expressions are evaluated in the query target (only the input Vars get
propagated from lower nodes), but that's not what we want for 2-stage
aggregation.

In my patch GroupedVar represents either the result of partial aggregation
(the value of the transient state) or a grouping expression which is more
complex than a plain column reference (Var expression).

> In the other thread, Robert Haas wrote:
> > Concretely, in your test query "SELECT p.i, avg(c1.v) FROM
> > agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent =
> > p.i GROUP BY p.i" you assume that it's OK to do a Partial
> > HashAggregate over c1.parent rather than p.i.  This will be false if,
> > say, c1.parent is of type citext and p.i is of type text; this will
> > get grouped together that shouldn't.  It will also be false if the
> > grouping expression is something like GROUP BY length(p.i::text),
> > because one value could be '0'::numeric and the other '0.00'::numeric.
> > I can't think of a reason why it would be false if the grouping
> > expressions are both simple Vars of the same underlying data type, but
> > I'm a little nervous that I might be wrong even about that case.
> > Maybe you've handled all of this somehow, but it's not obvious to me
> > that it has been considered.
> 
> Ah, I made the same mistake. I did consider the "GROUP BY length(o.i::text)",
> but not the cross-datatype case. I think we should punt on that for now, and
> only do 

Re: libpq compression

2018-06-22 Thread Konstantin Knizhnik




On 21.06.2018 20:14, Robbie Harwood wrote:

Konstantin Knizhnik  writes:


On 21.06.2018 17:56, Robbie Harwood wrote:

Konstantin Knizhnik  writes:

On 20.06.2018 23:34, Robbie Harwood wrote:

Konstantin Knizhnik  writes:

Well, that's a design decision you've made.  You could put lengths
on chunks that are sent out - then you'd know exactly how much is
needed.  (For instance, 4 bytes of network-order length followed by
a complete payload.)  Then you'd absolutely know whether you have
enough to decompress or not.

Do you really suggest to send extra header for each chunk of data?
Please notice that chunk can be as small as one message: dozen of
bytes because libpq is used for client-server communication with
request-reply pattern.

I want you to think critically about your design.  I *really* don't
want to design it for you - I have enough stuff to be doing.  But
again, the design I gave you doesn't necessarily need that - you just
need to properly buffer incomplete data.

Right now secure_read may return any number of available bytes. But in
case of using streaming compression, it can happen that available
number of bytes is not enough to perform decompression. This is why we
may need to try to fetch additional portion of data. This is how
zpq_stream is working now.

No, you need to buffer and wait until you're called again.  Which is to
say: decompress() shouldn't call secure_read().  secure_read() should
call decompress().



I this case I will have to implement this code twice: both for backend 
and frontend, i.e. for secure_read/secure_write and 
pqsecure_read/pqsecure_write.
Frankly speaking i was very upset by design of libpq communication layer 
in Postgres: there are two different implementations of almost the same 
stuff for cbackend and frontend.
I do not see any meaningful argument for it except "historical reasons". 
The better decision was to encapsulate socket communication layer (and 
some other system dependent stuff) in SAL (system abstraction layer) and 
use it both in backend and frontend.


By passing secure_read/pqsecure_read functions to zpq_stream I managed 
to avoid such code duplication at least for compression.



I do not understand how it is possible to implement in different way
and what is wrong with current implementation.

The most succinct thing I can say is: absolutely don't modify
pq_recvbuf().  I gave you pseudocode for how to do that.  All of your
logic should be *below* the secure_read/secure_write functions.

I cannot approve code that modifies pq_recvbuf() in the manner you
currently do.


Well. I understand you arguments.
But please also consider my argument above (about avoiding code 
duplication).


In any case, secure_read function is called only from pq_recvbuf() as 
well as pqsecure_read is called only from pqReadData.
So I do not think principle difference in handling compression in 
secure_read or pq_recvbuf functions and do not understand why it is

"destroying the existing model".

Frankly speaking, I will really like to destroy existed model, moving 
all system dependent stuff in Postgres to SAL and avoid this awful mix 
of code
sharing and duplication between backend and frontend. But it is a 
another story and I do not want to discuss it here.


If we are speaking about the "right design", then neither your 
suggestion, neither my implementation are good and I do not see 
principle differences between them.
The right approach is using "decorator pattern": this is how streams are 
designed in .Net/Java. You can construct pipe of "secure", "compressed" 
and whatever else streams.
Yes, it is first of all pattern for object-oriented approach and 
Postgres is implemented in C. But it is actually possible to use OO 
approach in pure C (X-Windows!).
But once again, this discussion may lead other too far away from the 
topic of libpq compression.


As far as I already wrote, the main  points of my design were:
1. Minimize changes in Postgres code
2. Avoid code duplication
3. Provide abstract (compression stream) which can be used somewhere 
else except libpq itself.






My idea was the following: client want to use compression. But
server may reject this attempt (for any reasons: it doesn't support
it, has no proper compression library, do not want to spend CPU for
decompression,...) Right now compression algorithm is
hardcoded. But in future client and server may negotiate to choose
proper compression protocol.  This is why I prefer to perform
negotiation between client and server to enable compression.  Well,
for negotiation you could put the name of the algorithm you want in
the startup.  It doesn't have to be a boolean for compression, and
then you don't need an additional round-trip.

Sorry, I can only repeat arguments I already mentioned:
- in future it may be possible to specify compression algorithm
- even with boolean compression option server may have some reasons to
reject client's request to use compression

Extra flexibility is always good 

Re: PATCH: backtraces for error messages

2018-06-22 Thread Craig Ringer
On 22 June 2018 at 08:48, Andres Freund  wrote:

> On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
> > On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, 
> > wrote:
> >
> > >
> > >
> > > On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier  >
> > > wrote:
> > >
> > >> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> > >> > I wrote it because I got sick of Assert(false) debugging, and I was
> > >> chasing
> > >> > down some "ERROR:  08P01: insufficient data left in message" errors.
> > >> Then I
> > >> > got kind of caught up in it... you know how it is.
> > >>
> > >> Yes, I know that feeling!  I have been using as well the Assert(false)
> > >> and the upgrade-to-PANIC tricks a couple of times, so being able to
> get
> > >> more easily backtraces would be really nice.
> > >>
> > >>
> > > Sometime back I'd suggested an idea to be able to dynamically manage
> log
> > > levels for elog messages [1].
> > >
> >
> >
> > Huge +1 from me on being able to selectively manage logging on a
> > module/subsystem, file, or line level.
> >
> > I don't think I saw the post.
> >
> > Such a thing would obviously make built in backtrace support much more
> > useful.
>
> I strongly suggest keeping these as separate as possible. Either is
> useful without the other, and both are nontrivial. Tackling them
> together imo makes it much more likely to get nowhere.
>
>
Totally agree, and it's why I raised this as its own thing.

Thanks.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: partition table and stddev() /variance() behaviour

2018-06-22 Thread Rajkumar Raghuwanshi
Thanks for commit. I have verified reported case. it is fixed now.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

On Fri, Jun 22, 2018 at 8:38 AM, David Rowley 
wrote:

> On 22 June 2018 at 03:30, Tom Lane  wrote:
> >> I think some coverage of the numerical aggregates is a good idea, so
> >> I've added some in the attached. I managed to get a parallel plan
> >> going with a query to onek, which is pretty cheap to execute. I didn't
> >> touch the bool aggregates. Maybe I should have done that too..?
> >
> > This sort of blunderbuss testing was exactly what I *didn't* want to do.
> > Not only is this adding about 20x as many cycles as we need (at least for
> > this specific numeric_poly_combine issue), but I'm quite afraid that the
> > float4 and/or float8 cases will show low-order-digit irreproducibility
> > in the buildfarm.
>
> okay. My sniper rifle was locked away for the evening. I decided it
> was best to sleep before any careful aiming was required.
>
> I see you've done the deed already. Thanks.
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: ERROR: ORDER/GROUP BY expression not found in targetlist

2018-06-22 Thread Rajkumar Raghuwanshi
Thanks for commit. I have verified reported case. it is fixed now.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

On Thu, Jun 21, 2018 at 1:54 AM, Tom Lane  wrote:

> I wrote:
> > Thanks for the report.  I traced through this, and the problem seems to
> > be that split_pathtarget_at_srfs() is neglecting to attach sortgroupref
> > labeling to the extra PathTargets it constructs.  I don't remember
> > whether that code is my fault or Andres', but I'll take a look at
> > fixing it.
>
> Here's a proposed patch for that.
>
> regards, tom lane
>
>


Speeding up INSERTs and UPDATEs to partitioned tables

2018-06-22 Thread David Rowley
Hi,

As part of my efforts to make partitioning scale better for larger
numbers of partitions, I've been looking at primarily INSERT VALUES
performance.  Here the overheads are almost completely in the
executor. Planning of this type of statement is very simple since
there is no FROM clause to process.

My benchmarks have been around a RANGE partitioned table with 10k leaf
partitions and no sub-partitioned tables. The partition key is a
timestamp column.

I've found that ExecSetupPartitionTupleRouting() is very slow indeed
and there are a number of things slow about it.  The biggest culprit
for the slowness is the locking of each partition inside of
find_all_inheritors().  For now, this needs to remain as we must hold
locks on each partition while performing RelationBuildPartitionDesc(),
otherwise, one of the partitions may get dropped out from under us.
There might be other valid reasons too, but please see my note at the
bottom of this email.

The locking is not the only slow thing. I found the following to also be slow:

1. RelationGetPartitionDispatchInfo uses a List and lappend() must
perform a palloc() each time a partition is added to the list.
2. A foreach loop is performed over leaf_parts to search for subplans
belonging to this partition. This seems pointless to do for INSERTs as
there's never any to find.
3. ExecCleanupTupleRouting() loops through the entire partitions
array. If a single tuple was inserted then all but one of the elements
will be NULL.
4. Tuple conversion map allocates an empty array thinking there might
be something to put into it. This is costly when the array is large
and pointless when there are no maps to store.
5. During get_partition_dispatch_recurse(), get_rel_relkind() is
called to determine if the partition is a partitioned table or a leaf
partition. This results in a slow relcache hashtable lookup.
6. get_partition_dispatch_recurse() also ends up just building the
indexes array with a sequence of numbers from 0 to nparts - 1 if there
are no sub-partitioned tables. Doing this is slow when there are many
partitions.

Besides the locking, the only thing that remains slow now is the
palloc0() for the 'partitions' array. In my test, it takes 0.6% of
execution time. I don't see any pretty ways to fix that.

I've written fixes for items 1-6 above.

I did:

1. Use an array instead of a List.
2. Don't do this loop. palloc0() the partitions array instead. Let
UPDATE add whatever subplans exist to the zeroed array.
3. Track what we initialize in a gapless array and cleanup just those
ones. Make this array small and increase it only when we need more
space.
4. Only allocate the map array when we need to store a map.
5. Work that out in relcache beforehand.
6. ditto

The only questionable thing I see is what I did for 6.  In partcache.c
I'm basically building an array of nparts containing 0 to nparts -1.
It seems a bit pointless, so perhaps there's a better way. I was also
a bit too tight to memcpy() that out of relcache, and just pointed
directly to it. That might be a no-go area.

I've attached 2 patches:

0001: implements items 1-6
0002: Is not intended for commit. It just a demo of where we could get
the performance if we were smarter about locking partitions. I've just
included this to show 0001's worth.

Performance

AWS: m5d.large fsync=off

Unpatched:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 2836
latency average = 21.162 ms
tps = 47.254409 (including connections establishing)
tps = 47.255756 (excluding connections establishing)

(yes, it's bad)

0001:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 3235
latency average = 18.548 ms
tps = 53.913121 (including connections establishing)
tps = 53.914629 (excluding connections establishing)

(a small improvement from 0001)

0001+0002:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 660079
latency average = 0.091 ms
tps = 11001.303764 (including connections establishing)
tps = 11001.602377 (excluding connections establishing)

(something to aspire towards)

0002 (only):

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 27682
latency average = 2.168 ms
tps = 461.350885 (including connections establishing)
tps = 461.363327 (excluding connections establishing)

(shows that doing 0002 alone does not fix all our 

Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-22 Thread Rajkumar Raghuwanshi
Thanks for commit. I have verified reported case. it is fixed now.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

On Thu, Jun 21, 2018 at 7:21 PM, Tom Lane  wrote:

> Amit Kapila  writes:
> > On Thu, Jun 21, 2018 at 11:51 AM, Amit Khandekar 
> wrote:
> >> After list_concat, the subpaths no longer has only non-partial paths,
> >> which it is supposed to have. So it anyways should not be used in the
> >> subsequent code in that function. So I think the following change
> >> should be good.
> >> -   foreach(l, subpaths)
> >> +   foreach(l, pathnode->subpaths)
>
> Patch LGTM.
>
> > Thanks, Tom, would you like to go-ahead and commit this change as this
> > is suggested by you or would you like me to take care of this or do
> > you want to wait for Robert's input?
>
> Please push --- I'm busy getting ready to leave for vacation ...
>
> regards, tom lane
>


Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-22 Thread Rajkumar Raghuwanshi
On Fri, Jun 22, 2018 at 11:15 AM, Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

> Hi,
>
> Off-list Ashutosh Bapat has suggested using a flag instead of counting
> number of
> dummy rels and then manipulating on it. That will be simple and smoother.
>
> I agree with his suggestion and updated my patch accordingly.
>
I have applied patch and checked reported issue. Patch applied cleanly and
issues not reproducible any more.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: Wrong cost estimation for foreign tables join with use_remote_estimate disabled

2018-06-22 Thread Konstantin Knizhnik




On 21.06.2018 20:08, Tom Lane wrote:

Konstantin Knizhnik  writes:

The following very simple test reduce the problem with wrong cost
estimation:
create foreign table t1_fdw(x integer, y integer) server pg_fdw options
(table_name 't1', use_remote_estimate 'false');
create foreign table t2_fdw(x integer) server pg_fdw options (table_name
't2', use_remote_estimate 'false');
It is possible to force Postgres to use correct plan by setting
"fdw_startup_cost" to some very large value (1 for example).
...
Also correct plan is used when use_remote_estimate is true.

If you are unhappy about the results with use_remote_estimate off, don't
run it that way.  The optimizer does not have a crystal ball.


As I wrote, use_remote_estimate can not be used because in this case 
query compilation time is unacceptable (10 seconds, while time of query 
execution itself is ~200msec).

So the problem can be addressed in two ways:

1. Try to reduce time of remote estimation. I wonder why postgres_fdw 
sends so much queries to remote server. For join of two tables there are 
7 queries.
I suspect that for ~20 joined tables in the original query number of 
calls is more than hundred,  so on wonder that it takes so much time.
2. Try to make optimizer make better estimation of join cost based on 
local statistic (please notice that ANALYZE is explicitly called for all 
foreign tables and number of rows in the result was correctly calculated).


What do you think: which of this two direction is more perspective? Or 
it is better to address both of them?


By the way, below is list of remote EXPLAIN statements performed by 
postgres_fdw for the mentioned above query when use_remote_estimate is on:


Breakpoint 1, get_remote_estimate (
    sql=0x1940008 "EXPLAIN SELECT x, y FROM public.t1 WHERE ((y = ANY 
('{1234567,1234577,1234667,1235567,1244567,1334567}'::integer[])))", 
conn=0x190e0d0,
    rows=0x7ffdd9e93388, width=0x7ffdd9e9337c, 
startup_cost=0x7ffdd9e93390, total_cost=0x7ffdd9e93398) at 
postgres_fdw.c:2984

2984    {
(gdb) cont
Continuing.

Breakpoint 1, get_remote_estimate (sql=0x196fa68 "EXPLAIN SELECT x FROM 
public.t2", conn=0x190e0d0, rows=0x7ffdd9e93388, width=0x7ffdd9e9337c,
    startup_cost=0x7ffdd9e93390, total_cost=0x7ffdd9e93398) at 
postgres_fdw.c:2984

2984    {
(gdb)
Continuing.

Breakpoint 1, get_remote_estimate (
    sql=0x19208f8 "EXPLAIN SELECT x, y FROM public.t1 WHERE ((y = ANY 
('{1234567,1234577,1234667,1235567,1244567,1334567}'::integer[]))) ORDER 
BY x ASC NULLS LAST", conn=0x190e0d0, rows=0x7ffdd9e932c8, 
width=0x7ffdd9e932bc, startup_cost=0x7ffdd9e932d0, 
total_cost=0x7ffdd9e932d8) at postgres_fdw.c:2984

2984    {
(gdb)
Continuing.

Breakpoint 1, get_remote_estimate (
    sql=0x19227b0 "EXPLAIN SELECT x, y FROM public.t1 WHERE SELECT 
null::integer)::integer) = x)) AND ((y = ANY 
('{1234567,1234577,1234667,1235567,1244567,1334567}'::integer[])))", 
conn=0x190e0d0, rows=0x7ffdd9e93348, width=0x7ffdd9e9333c, 
startup_cost=0x7ffdd9e93350, total_cost=0x7ffdd9e93358)

    at postgres_fdw.c:2984
2984    {
(gdb)
Continuing.

Breakpoint 1, get_remote_estimate (sql=0x19236c0 "EXPLAIN SELECT x FROM 
public.t2 ORDER BY x ASC NULLS LAST", conn=0x190e0d0, rows=0x7ffdd9e932c8,
    width=0x7ffdd9e932bc, startup_cost=0x7ffdd9e932d0, 
total_cost=0x7ffdd9e932d8) at postgres_fdw.c:2984

2984    {
(gdb)
Continuing.

Breakpoint 1, get_remote_estimate (sql=0x19247c0 "EXPLAIN SELECT x FROM 
public.t2 WHERE SELECT null::integer)::integer) = x))", conn=0x190e0d0,
    rows=0x7ffdd9e93348, width=0x7ffdd9e9333c, 
startup_cost=0x7ffdd9e93350, total_cost=0x7ffdd9e93358) at 
postgres_fdw.c:2984

2984    {
(gdb)
Continuing.

Breakpoint 1, get_remote_estimate (
    sql=0x19267d0 "EXPLAIN SELECT r1.x, r1.y, r2.x FROM (public.t1 r1 
INNER JOIN public.t2 r2 ON (((r1.x = r2.x)) AND ((r1.y = ANY 
('{1234567,1234577,1234667,1235567,1244567,1334567}'::integer[])", 
conn=0x190e0d0, rows=0x7ffdd9e93108, width=0x7ffdd9e930fc, 
startup_cost=0x7ffdd9e93110, total_cost=0x7ffdd9e93118)

    at postgres_fdw.c:2984

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: PANIC during crash recovery of a recently promoted standby

2018-06-22 Thread Michael Paquier
On Fri, Jun 22, 2018 at 02:34:02PM +0900, Kyotaro HORIGUCHI wrote:
> Hello, sorry for the absense and I looked the second patch.

Thanks for the review!

> At Fri, 22 Jun 2018 13:45:21 +0900, Michael Paquier
>  wrote in <20180622044521.gc5...@paquier.xyz>
>> long as crash recovery runs.  And XLogNeedsFlush() also has a similar
>> problem.
> 
> Here, on the other hand, this patch turns off
> updateMinRecoverypoint if minRecoverPoint is invalid when
> RecoveryInProgress() == true. Howerver RecovInProg() == true
> means archive recovery is already started and and
> minRecoveryPoint *should* be updated t for the
> condition. Actually minRecoverypoint is updated just below. If
> this is really right thing, I think that some explanation for the
> reason is required here.

LocalRecoveryInProgress is just a local copy of SharedRecoveryInProgress
so RecoveryInProgress also returns true if crash recovery is running.
But perhaps I am missing what you mean?  The point here is that redo can
call XLogNeedsFlush, no?

> In xlog_redo there still be "minRecoverypoint != 0", which ought
> to be described as "!XLogRecPtrIsInvalid(minRecoveryPoint)". (It
> seems the only one. Double negation is a bit uneasy but there are
> many instance of this kind of coding.)

It is possible to use directly a comparison with InvalidXLogRecPtr
instead of a double negation.

> # I'll go all-out next week.

Enjoy your vacations!
--
Michael


signature.asc
Description: PGP signature


Incorrect errno used with %m for backend code

2018-06-22 Thread Michael Paquier
Hi all,

I have been reviewing the use of errno in the backend code when %m is
used in the logs, and found more places than when I looked at improving
the error messages for read() calls which bloat the errno value because
of intermediate system calls.  As the problem is separate and should be
back-patched, I have preferred beginning a new thread.

A couple of places use CloseTransientFile without bothering much that
this can overwrite errno.  I was tempted in keeping errno saved and kept
if set to a non-zero value, but refrained from doing so as some callers
may rely on the existing behavior, and the attached shows better the
intention. 

Attached is a patch with everything I have spotted.  Any opinions or
thoughts in getting this back-patched?

Thanks,
--
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 65194db70e..e5571a67d6 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1241,12 +1241,17 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	 */
 	if (fstat(fd, ))
 	{
+		int			save_errno = errno;
+
 		CloseTransientFile(fd);
 		if (give_warnings)
+		{
+			errno = save_errno;
 			ereport(WARNING,
 	(errcode_for_file_access(),
 	 errmsg("could not stat two-phase state file \"%s\": %m",
 			path)));
+		}
 		return NULL;
 	}
 
@@ -1274,13 +1279,18 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_READ);
 	if (read(fd, buf, stat.st_size) != stat.st_size)
 	{
+		int			save_errno = errno;
+
 		pgstat_report_wait_end();
 		CloseTransientFile(fd);
 		if (give_warnings)
+		{
+			errno = save_errno;
 			ereport(WARNING,
 	(errcode_for_file_access(),
 	 errmsg("could not read two-phase state file \"%s\": %m",
 			path)));
+		}
 		pfree(buf);
 		return NULL;
 	}
@@ -1663,16 +1673,22 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_WRITE);
 	if (write(fd, content, len) != len)
 	{
+		int			save_errno = errno;
+
 		pgstat_report_wait_end();
 		CloseTransientFile(fd);
+		errno = save_errno;
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not write two-phase state file: %m")));
 	}
 	if (write(fd, _crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
 	{
+		int			save_errno = errno;
+
 		pgstat_report_wait_end();
 		CloseTransientFile(fd);
+		errno = save_errno;
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not write two-phase state file: %m")));
@@ -1686,7 +1702,10 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
+		int			save_errno = errno;
+
 		CloseTransientFile(fd);
+		errno = save_errno;
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not fsync two-phase state file: %m")));
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adbd6a2126..1a419aa49b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3268,7 +3268,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
+		int			save_errno = errno;
+
 		close(fd);
+		errno = save_errno;
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not fsync file \"%s\": %m", tmppath)));
@@ -11675,8 +11678,10 @@ retry:
 	if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
 	{
 		char		fname[MAXFNAMELEN];
+		int			save_errno = errno;
 
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+		errno = save_errno;
 		ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
 (errcode_for_file_access(),
  errmsg("could not seek in log segment %s to offset %u: %m",
@@ -11688,9 +11693,11 @@ retry:
 	if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
 	{
 		char		fname[MAXFNAMELEN];
+		int			save_errno = errno;
 
 		pgstat_report_wait_end();
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+		errno = save_errno;
 		ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
 (errcode_for_file_access(),
  errmsg("could not read from log segment %s, offset %u: %m",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 52fe55e2af..4ecdc9220f 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -718,9 +718,10 @@ XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
 			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
 			{
 char		path[MAXPGPATH];
+int			save_errno = errno;
 
 XLogFilePath(path, tli, sendSegNo, segsize);
-
+errno = save_errno;
 ereport(ERROR,
 		(errcode_for_file_access(),
 		 errmsg("could not seek in log segment %s to offset %u: %m",
@@ -741,9 +742,10