Re: [HACKERS] BUG: pg_stat_statements query normalization issues with combined queries

2016-12-21 Thread Fabien COELHO


Hello Kyotaro-san,

[...] Agreed that copying statement string would be too much. But the 
new *Stmt node should somehow have also the end location of the 
statement since the user of a parse tree cannot look into another one.


Yes. I was thinking of adding a "length" field next to "location", where 
appropriate.



So I'm rather still in favor of my initial proposal, that is extend
the existing location information to statements, not only some tokens.


I thought that it's too much to let all types of parse node have
location but grepping the gram.y with "makeNode" pursuaded me to
agree with you.


Indeed...

After changing all *Stmt nodes, only several types of nodes seems 
missing it.


Yes. I'll try to put together a patch and submit it to the next CF.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [bug fix] Trivial ecpg bug which can cause memory overrun

2016-12-21 Thread Michael Meskes
Hi,

> While investigating some other issue, we found a trivial bug of
> ecpg.  The attached is a fix for that.
> ...

Thanks for spotting and fixing. Committed.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Ilegal type cast in _hash_doinsert()

2016-12-21 Thread Mithun Cy
There seems to be some base bug in _hash_doinsert().
+* XXX this is useless code if we are only storing hash keys.

+   */

+if (itemsz > HashMaxItemSize((Page) metap))

+ereport(ERROR,

+(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),

+ errmsg("index row size %zu exceeds hash maximum %zu",

+itemsz, HashMaxItemSize((Page) metap)),

+   errhint("Values larger than a buffer page cannot be
indexed.")));

 "metap" (HashMetaPage) is not a Page (they the entirely different
structure), so above casting is not legal. Added a patch to correct same by
passing actual Page which stores "HashMetaPage".

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


hash_doinsert_illegal_cast_01.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG: pg_stat_statements query normalization issues with combined queries

2016-12-21 Thread Kyotaro HORIGUCHI
At Wed, 21 Dec 2016 09:28:58 +0100 (CET), Fabien COELHO  
wrote in 
> 
> Hello Robert & Kyotaro,
> 
> >> I'm a little doubtful about your proposed fix.  However, I haven't
> >> studied the code, so I don't know what other approach might be better.
> 
> That is indeed the question!
> 
> The key point is that the parser parses several statements from a
> string, but currently there is no clue about where the queries was
> found in the string. Only a parser may know about what is being parsed
> so can generate the location information. Now the possible solutions I
> see are:
> 
>  - the string is split per query before parsing, but this requires at
>least a lexer... it looks pretty uneffective to have another lexing
>phase involved, even if an existing lexer is reused.

I don't see this is promising. Apparently a waste of CPU cycles.

>  - the parser processes one query at a time and keep the "remaining
>unparse string" in some state that can be queried to check up to
>where it proceeded, but then the result must be stored somewhere
>and it would make sense that it would be in the statement anyway,
>just the management of the location information would be outside
>the parser. Also that would add the cost of relaunching the parser,
>not sure how bad or insignificant this is. This is (2) below.

I raised this as a spoiler, I see this is somewhat too invasive
for the problem to be solved.

>  - the parser still generates a List but keep track of the location
>of statements doing so, somehow... propably as I outlined.

Yeah, this seems most reasonable so far. It seems to me to be
better that the statement location is conveyed as a part of a
parse tree so as not to need need a side channel for location.

I'd like to rename debug_query_string to more reasonable name if
we are allowed but perhaps not.

> The query string information can be at some way of pointing in the
> initial
> string, or the substring itself that could be extracted at some point.
> I initially suggested the former because this is already what the
> parser does for some nodes, and because it seems simpler to do so.
> 
> Extracting the string instead would suggest that the location of
> tokens
> within this statement are relative to this string rather than the
> initial one, but that involves a lot more changes and it is easier to
> miss something doing so.

Agreed that copying statement string would be too much. But the
new *Stmt node should somehow have also the end location of the
statement since the user of a parse tree cannot look into another
one.


> > That will work and doable, but the more fundamental issue here
> > seems to be that post_parse_analyze_hook or other similar hooks
> > are called with a Query incosistent with query_debug_string.
> 
> Sure.
> 
> > It is very conveniently used but the name seems to me to be suggesting
> > that such usage is out of purpose. I'm not sure, though.
> >
> > A similar behavior is shown in error messages but this harms far
> > less than the case of pg_stat_statements.
> >
> > ERROR:  column "b" does not exist
> > LINE 1: ...elect * from t where a = 1; select * from t where b = 2;
> > com...
> > ^
> 
> Ah, I wrote this piece of code a long time ago:-) The location is
> relative to the full string, see comment above, changing that would
> involve much
> more changes, and I'm not sure whether it is desirable.
> 
> Also, think of:
> 
> SELECT * FROM foo; DROP TABLE foo; SELECT * FROM foo;
> 
> Maybe the context to know which "SELECT * FROM foo" generates an error
> should be kept.
> 
> > 1. Let all of the parse node have location in
> >  debug_query_string. (The original proposal)
> 
> Yep.
> 
> > 2. Let the parser return once for each statement (like psql
> >   parser)
> 
> I'm not sure it does... "\;" generates ";" in the output and the psql
> lexer keeps on lexing.
> 
> > and corresponding query string is stored in a
> >   varialble other than debug_query_string.
> 
> I think that would involve many changes because of the way postgres is
> written, the list is expected and returned by quite a few
> functions. Moreover query rewriting may generate several queries out
> of one anyway, so the list would be kept.
> 
> So I'm rather still in favor of my initial proposal, that is extend
> the existing location information to statements, not only some tokens.

I thought that it's too much to let all types of parse node have
location but grepping the gram.y with "makeNode" pursuaded me to
agree with you. After changing all *Stmt nodes, only several
types of nodes seems missing it.
 
regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2016-12-21 Thread Amit Kapila
On Wed, Dec 21, 2016 at 9:26 PM, Robert Haas  wrote:
> On Tue, Dec 20, 2016 at 2:25 PM, Mithun Cy  wrote:
>> -- I think if it is okay, I can document same for each member of 
>> HashMetaPageData whether to read from cached from meta page or directly from 
>> current meta page. Below briefly I have commented for each member. If you 
>> suggest I can go with that approach, I will produce a neat patch for same.
>
> Plain text emails are preferred on this list.
>
> I don't have any confidence in this approach.  I'm not sure exactly
> what needs to be changed here, but what you're doing right now is just
> too error-prone.  There's a cached metapage available, and you've got
> code accessing directly, and that's OK except when it's not, and maybe
> we can add some comments to explain, but I don't think that's going to
> be good enough to really make it clear and maintainable.  We need some
> kind of more substantive safeguard to prevent the cached metapage data
> from being used in unsafe ways -- and while we're at it, we should try
> to use it in as many of the places where it *is* safe as possible.  My
> suggestion for a separate structure was one idea; another might be
> providing some kind of API that's always used to access the metapage
> cache.  Or maybe there's a third option.
>

This metapage cache can be validated only when we have a bucket in
which we have stored the maxbucket value.  I think what we can do to
localize the use of metapage cache is to write a new API which will
return a bucket page locked in specified mode based on hashkey.
Something like Buffer _hash_get_buc_buffer_from_hashkey(hashkey,
lockmode).  I think this will make metpage cache access somewhat
similar to what we have in btree where we use cache to access
rootpage.  Will something like that address your concern?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical decoding on standby

2016-12-21 Thread Craig Ringer
On 22 December 2016 at 13:43, Michael Paquier  wrote:

> So, for 0001:
> --- a/src/test/perl/PostgresNode.pm
> +++ b/src/test/perl/PostgresNode.pm
> @@ -93,6 +93,7 @@ use RecursiveCopy;
>  use Socket;
>  use Test::More;
>  use TestLib ();
> +use pg_lsn qw(parse_lsn);
>  use Scalar::Util qw(blessed);
> This depends on 0002, so the order should be reversed.

Will do. That was silly.

I think I should probably also move the standby tests earlier, then
add a patch to update them when the results change.

> +sub lsn
> +{
> +   my $self = shift;
> +   return $self->safe_psql('postgres', 'select case when
> pg_is_in_recovery() then pg_last_xlog_replay_location() else
> pg_current_xlog_insert_location() end as lsn;');
> +}
> The design of the test should be in charge of choosing which value it
> wants to get, and the routine should just blindly do the work. More
> flexibility is more useful to design tests. So it would be nice to
> have one routine able to switch at will between 'flush', 'insert',
> 'write', 'receive' and 'replay modes to get values from the
> corresponding xlog functions.

Fair enough. I can amend that.

> -   die "error running SQL: '$$stderr'\nwhile running '@psql_params'"
> +   die "error running SQL: '$$stderr'\nwhile running
> '@psql_params' with sql '$sql'"
>   if $ret == 3;
> That's separate from this patch, and definitely useful.

Yeah. Slipped through. I don't think it really merits a separate patch
though tbh.

> +   if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) {
> +   die "valid modes are restart, confirmed_flush";
> +   }
> +   if (!defined($target_lsn)) {
> +   $target_lsn = $self->lsn;
> +   }
> That's not really the style followed by the perl scripts present in
> the code regarding the use of the brackets. Do we really need to care
> about the object type checks by the way?

Brackets: will look / fix.

Type checks (not in quoted snippet above): that's a convenience to let
you pass a PostgresNode instance or a string name. Maybe there's a
more idiomatic Perl-y way to write it. My Perl is pretty dire.

> Regarding wait_for_catchup, there are two ways to do things. Either
> query the standby like in the way 004_timeline_switch.pl does it or
> the way this routine does. The way of this routine looks more
> straight-forward IMO, and other tests should be refactored to use it.
> In short I would make the target LSN a mandatory argument, and have
> the caller send a standby's application_name instead of a PostgresNode
> object, the current way to enforce the value of $standby_name being
> really confusing.

Hm, ok. I'll take a look. Making LSN mandatory so you have to pass
$self->lsn is ok with me.

> +   my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1,
> 'replay' => 1 );
> What's actually the point of 'sent'?

Pretty useless, but we expose it in Pg, so we might as well in the tests.

> +   my @fields = ('plugin', 'slot_type', 'datoid', 'database',
> 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
> +   my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ',
> @fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name =
> '$slot_name'");
> +   $result = undef if $result eq '';
> +   # hash slice, see http://stackoverflow.com/a/16755894/398670 .
> Couldn't this portion be made more generic? Other queries could
> benefit from that by having a routine that accepts as argument an
> array of column names for example.

Yeah, probably. I'm not sure where it should live though - TestLib.pm ?

Not sure if there's an idomatic way to  pass a string (in this case
queyr) in Perl with a placeholder for interpolation of values (in this
case columns). in Python you'd pass it with pre-defined
%(placeholders)s for %.

> Now looking at 0002
> One whitespace:
> $ git diff HEAD~1 --check
> src/test/perl/pg_lsn.pm:139: trailing whitespace.
> +=cut

Will fix.

> pg_lsn sounds like a fine name, now we are more used to camel case for
> module names. And routines are written as lower case separated by an
> underscore.

Unsure what the intent of this is.

> +++ b/src/test/perl/t/002_pg_lsn.pl
> @@ -0,0 +1,68 @@
> +use strict;
> +use warnings;
> +use Test::More tests => 42;
> +use Scalar::Util qw(blessed);
> Most of the tests added don't have a description. This makes things
> harder to debug when tracking an issue.
>
> It may be good to begin using this module within the other tests in
> this patch as well. Now do we actually need it? Most of the existing
> tests I recall rely on the backend's operators for the pg_lsn data
> type, so this is actually duplicating an exiting facility. And all the
> values are just passed as-is.

I added it mainly for ordered tests of whether some expected lsn had
passed/increased. But maybe it makes sense to just call into the
server and let it evaluate such tests.

> +++ b/src/test/perl/t/001_load.pl
> @@ -0,0 +1,9 @@
> +use strict;
> +use warnings;
> +use Test::More tests => 5;
> I can guess the mea

Re: [HACKERS] WAL consistency check facility

2016-12-21 Thread Kuntal Ghosh
On Wed, Dec 21, 2016 at 10:53 PM, Robert Haas  wrote:
> On Mon, Nov 28, 2016 at 11:31 PM, Michael Paquier
>  wrote:
>> Moved to CF 2017-01, as no committers have showed up yet :(
>
> Seeing no other volunteers, here I am.
>
Thanks Robert for looking into the patch.

> On a first read-through of this patch -- I have not studied it in
> detail yet -- this looks pretty good to me.  One concern is that this
> patch adds a bit of code to XLogInsert(), which is a very hot piece of
> code.  Conceivably, that might produce a regression even when this is
> disabled; if so, we'd probably need to make it a build-time option.  I
> hope that's not necessary, because I think it would be great to
> compile this into the server by default, but we better make sure it's
> not a problem.  A bulk load into an existing table might be a good
> test case.
>
I'll do this test and post the results.

> +if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
> +{
> +/* Caller specified a bogus block_id. Do nothing. */
> +continue;
> +}
>
> Why would the caller do something so dastardly?
>
Sorry, it's my bad. I've copied the code from somewhere else, but forgot
to modify the comment. It should be something like
/* block_id is not used. */

I'll modify the above along with other suggested changes.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [bug fix] Trivial ecpg bug which can cause memory overrun

2016-12-21 Thread Tsunakawa, Takayuki
Hello,

While investigating some other issue, we found a trivial bug of ecpg.  The 
attached is a fix for that.

If you specify an input file which ends with "." (e.g. run "ecpg file."), ecpg 
writes one byte past the end of the allocated memory.

In addition, the following statement is misleading.  Some people may think that 
file.ec.c will be converted to a.ec.c.  But the actual behavior is that it is 
converted to file.c.  So I clarified the paragraph a bit.

"If the extension of the input file is not .pgc, then the output file name is 
computed by appending .c to the full file name."

Regards
Takayuki Tsunakawa



ecpg_outfile.patch
Description: ecpg_outfile.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Joe Conway
On 12/21/2016 09:20 PM, Tom Lane wrote:
> Joe Conway  writes:
>> On 12/21/2016 04:22 PM, Tom Lane wrote:
>>> In short, yes, please copy that bit into dblink.
> 
>> The attached should do the trick I think.
> 
> I see that you need to pass the PGconn into dblink_res_error() now, but
> what's the point of the new "bool fail" parameter?

That part isn't new -- we added it sometime prior to 9.2:

8<--
if (fail)
level = ERROR;
else
level = NOTICE;
8<--

It allows dblink to throw a NOTICE on remote errors rather than an
actual ERROR, e.g. for an autonomous transaction.

From the docs (9.2 in this case)
8<--
fail_on_error

If true (the default when omitted) then an error thrown on the
remote side of the connection causes an error to also be thrown locally.
If false, the remote error is locally reported as a NOTICE, and the
function returns no rows.
8<--

>> You think it is reasonable to backpatch this part too?
> 
> Yes, definitely.

Ok, will do.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Logical decoding on standby

2016-12-21 Thread Michael Paquier
On Tue, Dec 20, 2016 at 4:03 PM, Petr Jelinek
 wrote:
> That's about approach, but since there are prerequisite patches in the
> patchset that don't really depend on the approach I will comment about
> them as well.
>
> 0001 and 0002 add testing infrastructure and look fine to me, possibly
> committable.
>
> But in 0003 I don't understand following code:
>> + if (endpos != InvalidXLogRecPtr && !do_start_slot)
>> + {
>> + fprintf(stderr,
>> + _("%s: cannot use --create-slot or --drop-slot 
>> together with --endpos\n"),
>> + progname);
>> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
>> + progname);
>> + exit(1);
>> + }
>
> Why is --create-slot and --endpos not allowed together?
>
> 0004 again looks good but depends on 0003.
>
> 0005 is timeline following which is IMHO ready for committer, as is 0006
> and 0008 and I still maintain the opinion that these should go in soon.
>
> 0007 is unfinished as you said in your mail (missing option to specify
> behavior). And the last one 0009 is the implementation discussed above,
> which I think needs rework. IMHO 0007 and 0009 should be ultimately merged.
>
> I think parts of this could be committed separately and are ready for
> committer IMHO, but there is no way in CF application to mark only part
> of patch-set for committer to attract the attention.

Craig has pinged me about looking at 0001, 0002, 0004 and 0006 as
those involve the TAP infrastructure.

So, for 0001:
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -93,6 +93,7 @@ use RecursiveCopy;
 use Socket;
 use Test::More;
 use TestLib ();
+use pg_lsn qw(parse_lsn);
 use Scalar::Util qw(blessed);
This depends on 0002, so the order should be reversed.

+sub lsn
+{
+   my $self = shift;
+   return $self->safe_psql('postgres', 'select case when
pg_is_in_recovery() then pg_last_xlog_replay_location() else
pg_current_xlog_insert_location() end as lsn;');
+}
The design of the test should be in charge of choosing which value it
wants to get, and the routine should just blindly do the work. More
flexibility is more useful to design tests. So it would be nice to
have one routine able to switch at will between 'flush', 'insert',
'write', 'receive' and 'replay modes to get values from the
corresponding xlog functions.

-   die "error running SQL: '$$stderr'\nwhile running '@psql_params'"
+   die "error running SQL: '$$stderr'\nwhile running
'@psql_params' with sql '$sql'"
  if $ret == 3;
That's separate from this patch, and definitely useful.

+   if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) {
+   die "valid modes are restart, confirmed_flush";
+   }
+   if (!defined($target_lsn)) {
+   $target_lsn = $self->lsn;
+   }
That's not really the style followed by the perl scripts present in
the code regarding the use of the brackets. Do we really need to care
about the object type checks by the way?

Regarding wait_for_catchup, there are two ways to do things. Either
query the standby like in the way 004_timeline_switch.pl does it or
the way this routine does. The way of this routine looks more
straight-forward IMO, and other tests should be refactored to use it.
In short I would make the target LSN a mandatory argument, and have
the caller send a standby's application_name instead of a PostgresNode
object, the current way to enforce the value of $standby_name being
really confusing.

+   my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1,
'replay' => 1 );
What's actually the point of 'sent'?

+   my @fields = ('plugin', 'slot_type', 'datoid', 'database',
'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
+   my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ',
@fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name =
'$slot_name'");
+   $result = undef if $result eq '';
+   # hash slice, see http://stackoverflow.com/a/16755894/398670 .
Couldn't this portion be made more generic? Other queries could
benefit from that by having a routine that accepts as argument an
array of column names for example.

Now looking at 0002
One whitespace:
$ git diff HEAD~1 --check
src/test/perl/pg_lsn.pm:139: trailing whitespace.
+=cut

pg_lsn sounds like a fine name, now we are more used to camel case for
module names. And routines are written as lower case separated by an
underscore.

+++ b/src/test/perl/t/002_pg_lsn.pl
@@ -0,0 +1,68 @@
+use strict;
+use warnings;
+use Test::More tests => 42;
+use Scalar::Util qw(blessed);
Most of the tests added don't have a description. This makes things
harder to debug when tracking an issue.

It may be good to begin using this module within the other tests in
this patch as well. Now do we actually need it? Most of the existing
tests I recall rely on the backend's operators for the pg_lsn data
type, so this is actually duplicating an exiting facilit

Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Tom Lane
Joe Conway  writes:
> On 12/21/2016 04:22 PM, Tom Lane wrote:
>> In short, yes, please copy that bit into dblink.

> The attached should do the trick I think.

I see that you need to pass the PGconn into dblink_res_error() now, but
what's the point of the new "bool fail" parameter?

> You think it is reasonable to backpatch this part too?

Yes, definitely.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2016-12-21 Thread Craig Ringer
On 22 December 2016 at 09:55, Craig Ringer  wrote:

> Updated.
>
> If you think it's better to just take XidGenLock again, let me know.

Here's a further update that merges in Robert's changes from the patch
posted upthread.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From bc0fea43200ae861e2b8562f5e9f81fa73d207f9 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 22 Dec 2016 13:07:00 +0800
Subject: [PATCH] Introduce txid_status(bigint) to get status of an xact

If an application loses its connection while a COMMIT request is in
flight, the backend crashes mid-commit, etc, then an application may
not be sure whether or not a commit completed successfully or was
rolled back. While two-phase commit solves this it does so at a
considerable overhead, so introduce a lighter alternative.

txid_status(bigint) lets an application determine the status of a a
commit based on an xid-with-epoch as returned by txid_current() or
similar. Status may be committed, aborted, in-progress (including
prepared xacts) or null if the xact is too old for its commit status
to still be retained because it has passed the wrap-around epoch
boundary.

Applications must call txid_current() in their transactions to make
much use of this since PostgreSQL does not automatically report an xid
to the client when one is assigned.

Introduces TransactionIdInRecentPast(...) for the use of other
functions that need similar logic in future.

There was previously no way to look up an arbitrary xid without
running the risk of having clog truncated out from under you. This
hasn't been a problem because anything looking up xids in clog knows
they're protected by datminxid, but that's not the case for arbitrary
user-supplied XIDs. clog is truncated before we advance oldestXid so
taking XidGenLock is insufficient, and there's no way to look up a
SLRU with soft-failure. So we introduce a new pendingOldestXid member
of ShmemVariableCache and advance it under the new ClogTruncationLock
LWLock before beginning clog truncation.

Craig Ringer, Robert Haas
---
 doc/src/sgml/func.sgml   |  27 +
 src/backend/access/transam/varsup.c  |  24 
 src/backend/commands/vacuum.c|   3 +
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/backend/utils/adt/txid.c | 197 +++
 src/include/access/transam.h |   6 +
 src/include/catalog/pg_proc.h|   2 +
 src/include/utils/builtins.h |   1 +
 src/test/regress/expected/txid.out   |  68 +++
 src/test/regress/sql/txid.sql|  38 ++
 10 files changed, 367 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 47fcb30..c51dca5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17143,6 +17143,10 @@ SELECT collation for ('foo' COLLATE "de_DE");
 txid_visible_in_snapshot

 
+   
+txid_status
+   
+

 The functions shown in 
 provide server transaction information in an exportable form.  The main
@@ -17193,6 +17197,11 @@ SELECT collation for ('foo' COLLATE "de_DE");
boolean
is transaction ID visible in snapshot? (do not use with subtransaction ids)
   
+  
+   txid_status(bigint)
+   txid_status
+   report the status of the given xact - committed, aborted, in progress, or NULL if the xid is too old
+  
  
 

@@ -17263,6 +17272,24 @@ SELECT collation for ('foo' COLLATE "de_DE");

 

+txid_status(bigint) reports the commit status of a recent
+transaction.  Applications may use it to determine whether a transaction
+committed or aborted when the application and database server become
+disconnected while a COMMIT is in progress.
+The status of a transaction will be reported as either
+in progress,
+committed, or aborted, provided that the
+transaction is recent enough that the system retains the commit status
+of that transaction.  If is old enough that no references to that
+transaction survive in the system and the commit status information has
+been discarded, this function will return NULL.  Note that prepared
+transactions are reported as in progress; applications must
+check pg_prepared_xacts if they
+need to determine whether the xid is a prepared transaction.
+   
+
+   
 The functions shown in 
 provide information about transactions that have been already committed.
 These functions mainly provide information about when the transactions
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2f7e645..120e1ce 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -258,6 +258,30 @@ ReadNewTransactionId(void)
 	return xid;
 }
 
+
+/*
+ * Because vac_truncate_clog calls SetTransactionIdLimit to advance oldestXid
+ * only after truncating the c

Re: [HACKERS] Parallel Index Scans

2016-12-21 Thread Amit Kapila
On Thu, Dec 22, 2016 at 9:49 AM, Amit Kapila  wrote:
> On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
>  wrote:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, passed
>> Implements feature:   tested, passed
>> Spec compliant:   tested, passed
>> Documentation:tested, passed
>>
>> Hi, thank you for the patch.
>> Results are very promising. Do you see any drawbacks of this feature or 
>> something that requires more testing?
>>
>
> I think you can focus on the handling of array scan keys for testing.
> In general, one of my colleagues has shown interest in testing this
> patch and I think he has tested as well but never posted his findings.
> I will request him to share his findings and what kind of tests he has
> done, if any.
>
>> I'm willing to oo a review.
>
> Thanks, that will be helpful.
>
>
>> I saw the discussion about parameters in the thread above. And I agree that 
>> we'd better concentrate
>> on the patch itself and add them later if necessary.
>>
>> 1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of 
>> xs_temp_snap flag?
>>
>> +   if (scan->xs_temp_snap)
>> +   UnregisterSnapshot(scan->xs_snapshot);
>>
>
> I agree with what Rober has told in his reply.
>

Typo.
/Rober/Robert Haas

Thanks to Michael Paquier for noticing it and informing me offline.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multi-level partitions and partition-wise joins

2016-12-21 Thread Ashutosh Bapat
On Wed, Dec 21, 2016 at 10:25 PM, Robert Haas  wrote:
> On Wed, Dec 21, 2016 at 6:36 AM, Ashutosh Bapat
>  wrote:
>> I am starting this as a separate thread for this since the declarative
>> partitioning thread has many issues reported and it's better to keep
>> this discussion separate from the issues reported on that thread.
>>
>> While expanding inheritance, any inheritance hierarchy is flattened
>> out including partition hierarchy. Partition-wise joins can be
>> employed if the joining tables have the same partitioning scheme and
>> have equi-join clauses on the partition keys. If two multi-level
>> partitioned tables are joined, the partition-wise join can be
>> percolated down to the levels up to which the partition schemes match
>> and suitable clauses are available. E.g. if two multi-level
>> partitioned table have matching partitioning schemes at the top-most
>> level, but not below that, we may join the topmost level partitions
>> pair-wise, but not partitions on the lower levels. In general, we may
>> use partition-wise join for the matching parts of partition hierarchy
>> and in the parts that do not match, use join between append relations.
>> Not always it will be efficient to execute partition-wise joins upto
>> the last levels of partition hierarchy, even if partition-wise join
>> can be employed. It might be possible that executing partition-wise
>> joins for only certain parts of partition hierarchy is efficient and
>> join of appends is efficient in the rest of the parts.
>>
>> In order to decide whether partition-wise join is efficient for a join
>> between given partitioned partition, we need to identify its
>> subpartitions. Similarly when a join between partitioned partition can
>> not use partition-wise join but some other partitions can, we need to
>> identify the subpartitions of that partition, so that they can be
>> appended together before joining. That information is lost while
>> expanding RTE. It looks like we need to retain partitioning hierarchy
>> in order to implement partition-wise joins between multi-level
>> partitioned tables.
>>
>> An earlier version of Amit's partition support patches had code to
>> retain partitioning hierarchy but it seems it was removed per
>> discussion at [1]. I agree with that decision.
>
> I can't quite figure out what the point of this email is.  What did
> you want to discuss?
>

Sorry for sending mail before adding points to discuss.

Given the scenario described above, it looks like we have to retain
partition hierarchy in the form of inheritance hierarchy in order to
implement partition-wise joins for multi-leveled partition tables. Is
that the right thing to do? PFA a patch retained by Amit Langote to
translate partition hierarchy into inheritance hierarchy. Is this
something on the right direction?

Any other options I can think of like maintaining a tree of
partitioning schemes, either means that we can not plan partition-wise
joins for part of partition hierarchy e.g. matching whole partitioning
scheme tree OR it means that we have to add append plans to partition
relations corresponding to partitioned partitions, which is not
correct since leaf child relations can not have append paths. Any
suggestions?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


decl-part-inh-refactor-inh-plan.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2016-12-21 Thread Amit Kapila
On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
 wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
>
> Hi, thank you for the patch.
> Results are very promising. Do you see any drawbacks of this feature or 
> something that requires more testing?
>

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

> I'm willing to oo a review.

Thanks, that will be helpful.


> I saw the discussion about parameters in the thread above. And I agree that 
> we'd better concentrate
> on the patch itself and add them later if necessary.
>
> 1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of 
> xs_temp_snap flag?
>
> +   if (scan->xs_temp_snap)
> +   UnregisterSnapshot(scan->xs_snapshot);
>

I agree with what Rober has told in his reply.  We do same way for
heap, refer heap_endscan().

> I must say that I'm quite new with all this parallel stuff. If you give me a 
> link,
> where to read about snapshots for parallel workers, my review will be more 
> helpful.
>

You can read transam/README.parallel.  Refer "State Sharing" portion
of README to learn more about it.

> Anyway, it would be great to have more comments about it in the code.
>

We are sharing snapshot to ensure that reads in both master backend
and worker backend can use the same snapshot.  There is no harm in
adding comments, but I think it is better to be consistent with
similar heapam code.  After reading README.parallel, if you still feel
that we should add more comments in the code, then we can definitely
do that.

> 2. Don't you mind to rename 'amestimateparallelscan' to let's say 
> 'amparallelscan_spacerequired'
> or something like this?

Sure, I am open to other names, but IMHO, lets keep "estimate" in the
name to keep it consistent with other parallel stuff. Refer
execParallel.c to see how widely this word is used.

> As far as I understand there is nothing to estimate, we know this size
> for sure. I guess that you've chosen this name because of 
> 'heap_parallelscan_estimate'.
> But now it looks similar to 'amestimate' which refers to indexscan cost for 
> optimizer.
> That leads to the next question.
>

Do you mean 'amcostestimate'?  If you want we can rename it
amparallelscanestimate to be consistent with amcostestimate.

> 3. Are there any changes in cost estimation?
>

Yes.

> I didn't find related changes in the patch.
> Parallel scan is expected to be faster and optimizer definitely should know 
> that.
>

You can find the relavant changes in
parallel_index_opt_exec_support_v2.patch, refer cost_index().

> 4. +uint8   ps_pageStatus;  /* state of scan, see below */
> There is no desciption below. I'd make the comment more helpful:
> /* state of scan. See possible flags values in nbtree.h */

makes sense. Will change.

> And why do you call it pageStatus? What does it have to do with page?
>

During scan this tells us whether next page is available for scan.
Another option could be to name it as scanStatus, but not sure if that
is better.  Do you think if we add a comment like "indicates whether
next page is available for scan" for this variable then it will be
clear?

> 5. Comment for _bt_parallel_seize() says:
> "False indicates that we have reached the end of scan for
>  current scankeys and for that we return block number as P_NONE."
>
>  What is the reason to check (blkno == P_NONE) after checking (status == 
> false)
>  in _bt_first() (see code below)? If comment is correct
>  we'll never reach _bt_parallel_done()
>
> +   blkno = _bt_parallel_seize(scan, &status);
> +   if (status == false)
> +   {
> +   BTScanPosInvalidate(so->currPos);
> +   return false;
> +   }
> +   else if (blkno == P_NONE)
> +   {
> +   _bt_parallel_done(scan);
> +   BTScanPosInvalidate(so->currPos);
> +   return false;
> +   }
>

The first time master backend or worker hits last page (calls this
API), it will return P_NONE and after that any worker tries to fetch
next page, it will return status as false.  I think we can expand a
comment to explain it clearly.  Let me know, if you need more
clarification, I can explain it in detail.

> 6. To avoid code duplication, I would wrap this into the function
>
> +   /* initialize moreLeft/moreRight appropriately for scan direction */
> +   if (ScanDirectionIsForward(dir))
> +   {
> +   so->currPos.moreLeft = false;
> +  

Re: [HACKERS] Hang in pldebugger after git commit : 98a64d0

2016-12-21 Thread Amit Kapila
On Wed, Dec 21, 2016 at 9:46 PM, Robert Haas  wrote:
> On Tue, Dec 20, 2016 at 11:32 PM, Amit Kapila  wrote:
>> Ashutosh Sharma has helped to test that pldebugger issue is fixed with
>> attached version.
>
> Committed.
>

Thanks!

> Patch by Amit Kapial.  Reported and tested by Ashutosh Sharma, who
> also provided some analysis.  Further analysis by Michael Paquier.

In commit note, my name is not spelled correctly, but I think there is
nothing we can do about it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-21 Thread Michael Paquier
On Thu, Dec 22, 2016 at 6:41 AM, David Steele  wrote:
> On 12/21/16 4:28 PM, Andres Freund wrote:
>
>> Working on committing this (tomorrow morning, not tonight).  There's
>> some relatively minor things I want to change:

Thanks for looking at this patch.

>> - I don't like the name XLogSetFlags() - it's completely unclear what
>>   that those flags refer to - it could just as well be replay
>>   related. XLogSetRecordFlags()?
>
> That sounds a bit more clear.

Fine for me.

>> - Similarly I don't like the name "progress LSN" much. What does
>>   "progress" really mean in that". Maybe "consistency LSN"?
>
> Yes, please.  I think that really cuts to the core of what the patch is
> about.  Progress made perfect sense to me, but consistency is always the
> goal, and what we are saying here is that this is the last xlog record that
> is required to achieve consistency.  Anything that happens to be after it is
> informational only.

Fine as well.

>> - It's currently required to avoid triggering archive timeouts and
>>   checkpoints triggering each other, but I'm nervous marking all xlog
>>   switches as unimportant. I think it'd be better to only mark timeout
>>   triggered switches as such.
>
> That seems fine to me.  If the system is truly idle that might trigger one
> more xlog switch that is needed, but it seems like a reasonable compromise.

On a long-running embedded system the difference won't matter much. So
I guess I'm fine with this bit as well.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-21 Thread Michael Paquier
On Thu, Dec 22, 2016 at 6:28 AM, Andres Freund  wrote:
> A mime-type of invalid/octet-stream? That's an, uh, odd choice.

Indeed. I am not sure what kind of accident happened here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2016-12-21 Thread Craig Ringer
On 22 December 2016 at 07:49, Craig Ringer  wrote:
> On 22 December 2016 at 00:30, Robert Haas  wrote:
>
>> That makes everything that happens between when we acquire that lock
>> and when we release it non-interruptible, which seems undesirable.  I
>> think that extra copy of oldestXid is a nicer approach.
>
> That's a side-effect I didn't realise. Given that, yes, I agree.
>
> Since we don't truncate clog much, do you think it's reasonable to
> just take XidGenLock again before we proceed? I'm reluctant to add
> another acquisition of a frequently contested lock for something 99.9%
> of the codebase won't care about, so I think it's probably better to
> add a new LWLock, and I'll resubmit on that basis, but figure it's
> worth asking.

Updated.

If you think it's better to just take XidGenLock again, let me know.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From b10ee9a6abb2cc6867b4b78e91b819fe33ae0119 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Wed, 21 Dec 2016 15:37:29 +0800
Subject: [PATCH] Introduce txid_status(bigint) to get status of an xact

If an application loses its connection while a COMMIT request is in
flight, the backend crashes mid-commit, etc, then an application may
not be sure whether or not a commit completed successfully or was
rolled back. While two-phase commit solves this it does so at a
considerable overhead, so introduce a lighter alternative.

txid_status(bigint) lets an application determine the status of a a
commit based on an xid-with-epoch as returned by txid_current() or
similar. Status may be committed, aborted, in-progress (including
prepared xacts) or null if the xact is too old for its commit status
to still be retained because it has passed the wrap-around epoch
boundary.

Applications must call txid_current() in their transactions to make
much use of this since PostgreSQL does not automatically report an xid
to the client when one is assigned.

Introduces TransactionIdInRecentPast(...) for the use of other
functions that need similar logic in future.

There was previously no way to look up an arbitrary xid without
running the risk of having clog truncated out from under you. This
hasn't been a problem because anything looking up xids in clog knows
they're protected by datminxid, but that's not the case for arbitrary
user-supplied XIDs. clog is truncated before we advance oldestXid so
taking XidGenLock is insufficient, and there's no way to look up a
SLRU with soft-failure. So we introduce ClogTruncationLock to guard
against concurrent clog truncation.
---
 doc/src/sgml/func.sgml   |  31 +++
 src/backend/access/transam/varsup.c  |  24 ++
 src/backend/commands/vacuum.c|   3 +
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/backend/utils/adt/txid.c | 141 +++
 src/include/access/transam.h |   6 ++
 src/include/catalog/pg_proc.h|   2 +
 src/include/utils/builtins.h |   1 +
 src/test/regress/expected/txid.out   |  68 +++
 src/test/regress/sql/txid.sql|  38 +
 10 files changed, 315 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 47fcb30..3123232 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17143,6 +17143,10 @@ SELECT collation for ('foo' COLLATE "de_DE");
 txid_visible_in_snapshot

 
+   
+txid_status
+   
+

 The functions shown in 
 provide server transaction information in an exportable form.  The main
@@ -17193,6 +17197,11 @@ SELECT collation for ('foo' COLLATE "de_DE");
boolean
is transaction ID visible in snapshot? (do not use with subtransaction ids)
   
+  
+   txid_status(bigint)
+   txid_status
+   report the status of the given xact - committed, aborted, in progress, or NULL if the xid is too old
+  
  
 

@@ -17263,6 +17272,28 @@ SELECT collation for ('foo' COLLATE "de_DE");

 

+txid_status(bigint) reports the commit status of a recent
+transaction.  Applications may use it to determine whether a transaction
+committed or aborted when the application and/or database server crashed or
+lost connection while a COMMIT command was in progress.
+The status of a transaction will be reported as one of:
+
+ 'in progress'
+ 'committed'
+ 'aborted'
+ NULL if xid too old
+
+PostgreSQL discards the commit status transactions after no references to
+the transaction survive in other active transactions, tables, replication
+slots, etc. This means that the status of older transactions cannot be
+determined.  txid_status(bigint) returns NULL if a
+transaction is too old to look up.  Prepared transactions are reported as
+in progress; applications must check pg_prepared_xacts if they
+need to determine if the xid is a prepared t

[HACKERS] Potential data loss of 2PC files

2016-12-21 Thread Michael Paquier
Hi all,

2PC files are created using RecreateTwoPhaseFile() in two places currently:
- at replay on a XLOG_XACT_PREPARE record.
- At checkpoint with CheckPointTwoPhase().

Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
that the 2PC files find their way into disk. But one piece is missing:
the parent directory pg_twophase is not fsync'd. At replay this is
more sensitive if there is a PREPARE record followed by a checkpoint
record. If there is a power failure after the checkpoint completes
there is a risk to lose 2PC status files here.

It seems to me that we really should have CheckPointTwoPhase() call
fsync() on pg_twophase to be sure that no files are lost here. There
is no point to do this operation in RecreateTwoPhaseFile() as if there
are many 2PC transactions to replay performance would be impacted, and
we don't care about the durability of those files until a checkpoint
moves the redo pointer. I have drafted the patch attached to address
this issue.

I am adding that as well to the next CF for consideration.

Thoughts?
--
Michael


2pc-loss.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Joe Conway
On 12/21/2016 04:22 PM, Tom Lane wrote:
> Joe Conway  writes:
>> I did notice that postgres_fdw has the following stanza that I don't see
>> in dblink:
> 
>> 8<--
>> /*
>>  * If we don't get a message from the PGresult, try the PGconn.  This
>>  * is needed because for connection-level failures, PQexec may just
>>  * return NULL, not a PGresult at all.
>>  */
>> if (message_primary == NULL)
>>  message_primary = PQerrorMessage(conn);
>> 8<--
> 
>> I wonder if the original issue on pgsql-bugs was a connection-level
>> failure rather than OOM? Seems like dblink ought to do the same.
> 
> Oooh ... I had thought that code was in both, which was why I was having
> a hard time explaining the OP's failure.  But I see you're right,
> which provides a very straightforward explanation for the report.
> I believe that if libpq is unable to malloc a PGresult, it will return
> NULL but put an "out of memory" message into the PGconn's error buffer.
> I had supposed that we'd capture and report the latter, but as the
> dblink code stands, it won't.
> 
> In short, yes, please copy that bit into dblink.

The attached should do the trick I think. You think it is reasonable to
backpatch this part too?

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index ee45cd2..db3085a 100644
*** a/contrib/dblink/dblink.c
--- b/contrib/dblink/dblink.c
*** static Relation get_rel_from_relname(tex
*** 112,118 
  static char *generate_relation_name(Relation rel);
  static void dblink_connstr_check(const char *connstr);
  static void dblink_security_check(PGconn *conn, remoteConn *rconn);
! static void dblink_res_error(const char *conname, PGresult *res, const char *dblink_context_msg, bool fail);
  static char *get_connect_string(const char *servername);
  static char *escape_param_str(const char *from);
  static void validate_pkattnums(Relation rel,
--- 112,119 
  static char *generate_relation_name(Relation rel);
  static void dblink_connstr_check(const char *connstr);
  static void dblink_security_check(PGconn *conn, remoteConn *rconn);
! static void dblink_res_error(PGconn *conn, const char *conname, PGresult *res,
! 			 const char *dblink_context_msg, bool fail);
  static char *get_connect_string(const char *servername);
  static char *escape_param_str(const char *from);
  static void validate_pkattnums(Relation rel,
*** dblink_open(PG_FUNCTION_ARGS)
*** 427,433 
  	res = PQexec(conn, buf.data);
  	if (!res || PQresultStatus(res) != PGRES_COMMAND_OK)
  	{
! 		dblink_res_error(conname, res, "could not open cursor", fail);
  		PG_RETURN_TEXT_P(cstring_to_text("ERROR"));
  	}
  
--- 428,434 
  	res = PQexec(conn, buf.data);
  	if (!res || PQresultStatus(res) != PGRES_COMMAND_OK)
  	{
! 		dblink_res_error(conn, conname, res, "could not open cursor", fail);
  		PG_RETURN_TEXT_P(cstring_to_text("ERROR"));
  	}
  
*** dblink_close(PG_FUNCTION_ARGS)
*** 496,502 
  	res = PQexec(conn, buf.data);
  	if (!res || PQresultStatus(res) != PGRES_COMMAND_OK)
  	{
! 		dblink_res_error(conname, res, "could not close cursor", fail);
  		PG_RETURN_TEXT_P(cstring_to_text("ERROR"));
  	}
  
--- 497,503 
  	res = PQexec(conn, buf.data);
  	if (!res || PQresultStatus(res) != PGRES_COMMAND_OK)
  	{
! 		dblink_res_error(conn, conname, res, "could not close cursor", fail);
  		PG_RETURN_TEXT_P(cstring_to_text("ERROR"));
  	}
  
*** dblink_fetch(PG_FUNCTION_ARGS)
*** 599,605 
  		(PQresultStatus(res) != PGRES_COMMAND_OK &&
  		 PQresultStatus(res) != PGRES_TUPLES_OK))
  	{
! 		dblink_res_error(conname, res, "could not fetch from cursor", fail);
  		return (Datum) 0;
  	}
  	else if (PQresultStatus(res) == PGRES_COMMAND_OK)
--- 600,607 
  		(PQresultStatus(res) != PGRES_COMMAND_OK &&
  		 PQresultStatus(res) != PGRES_TUPLES_OK))
  	{
! 		dblink_res_error(conn, conname, res,
! 		 "could not fetch from cursor", fail);
  		return (Datum) 0;
  	}
  	else if (PQresultStatus(res) == PGRES_COMMAND_OK)
*** dblink_record_internal(FunctionCallInfo
*** 750,757 
  if (PQresultStatus(res) != PGRES_COMMAND_OK &&
  	PQresultStatus(res) != PGRES_TUPLES_OK)
  {
! 	dblink_res_error(conname, res, "could not execute query",
! 	 fail);
  	/* if fail isn't set, we'll return an empty query result */
  }
  else
--- 752,759 
  if (PQresultStatus(res) != PGRES_COMMAND_OK &&
  	PQresultStatus(res) != PGRES_TUPLES_OK)
  {
! 	dblink_res_error(conn, conname, res,
! 	 "could not execute query", fail);
  	/* if fail isn't set, we'll return an empty query result */
  }
  else
*** materializeQueryResult(FunctionCallInfo
*** 996,1002 
  			PGresult   *res1 = res;
  
  			res = NULL;
! 			dblink_res_error(conname, res1, "cou

Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-21 Thread Amit Langote
On 2016/12/22 0:31, Robert Haas wrote:
> On Tue, Dec 20, 2016 at 12:22 PM, Alvaro Herrera
>  wrote:
>> Robert Haas wrote:
>>> Implement table partitioning.
>>
>> I thought it was odd to use rd_rel->reloftype as a boolean in
>> ATExecAttachPartition, but apparently we do it elsewhere too, so let's
>> leave that complaint for another day.
> 
> Ugh.  I agree - that's bad style.

Agreed, fixed in the attached patch.

>> What I also found off in the same function is that we use
>> SearchSysCacheCopyAttName() on each attribute and then don't free the
>> result, and don't ever use the returned tuple either.  A simple fix, I
>> thought, just remove the "Copy" and add a ReleaseSysCache().
> 
> Or use SearchSysCachExists.

Done, too.

>> But then I
>> noticed this whole thing is rather strange -- why not pass a boolean
>> flag down to CreateInheritance() and from there to
>> MergeAttributesIntoExisting() to implement the check?  That seems less
>> duplicative.
> 
> Hmm, that would be another way to do it.

MergeAttributesIntoExisting() is called by ATExecAddInherit() as well,
where we don't want to check that.  Sure, we can only check if
child_is_partition, but I thought it'd be better to keep the shared code
(between regular inheritance and partitioning) as close to the old close
as possible.

Attached patch also fixes a couple of other minor issues.

Thanks,
Amit
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 115b98313e..ee79b726f2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -12996,7 +12996,6 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 			   *existConstraint;
 	SysScanDesc scan;
 	ScanKeyData skey;
-	HeapTuple	tuple;
 	AttrNumber	attno;
 	int			natts;
 	TupleDesc	tupleDesc;
@@ -13018,7 +13017,7 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
  errmsg("\"%s\" is already a partition",
 		RelationGetRelationName(attachRel;
 
-	if (attachRel->rd_rel->reloftype)
+	if (OidIsValid(attachRel->rd_rel->reloftype))
 		ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
  errmsg("cannot attach a typed table as partition")));
@@ -13119,9 +13118,10 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 		if (attribute->attisdropped)
 			continue;
 
-		/* Find same column in parent (matching on column name). */
-		tuple = SearchSysCacheCopyAttName(RelationGetRelid(rel), attributeName);
-		if (!HeapTupleIsValid(tuple))
+		/* Try to find the column in parent (matching on column name) */
+		if (!SearchSysCacheExists2(ATTNAME,
+   ObjectIdGetDatum(RelationGetRelid(rel)),
+   CStringGetDatum(attributeName)))
 			ereport(ERROR,
 	(errcode(ERRCODE_DATATYPE_MISMATCH),
 	 errmsg("table \"%s\" contains column \"%s\" not found in parent \"%s\"",
@@ -13167,7 +13167,6 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 	 * There is a case in which we cannot rely on just the result of the
 	 * proof.
 	 */
-	tupleDesc = RelationGetDescr(attachRel);
 	attachRel_constr = tupleDesc->constr;
 	existConstraint = NIL;
 	if (attachRel_constr != NULL)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SET NOT NULL [NOT VALID / CONCURRENTLY]?

2016-12-21 Thread Joel Jacobson
Attached is the function SET_NOT_NULL(_Schema name, _Table name, _Column
name) which does the following:

1. LOCK TABLE %I.%I IN ACCESS EXCLUSIVE MODE
just like the normal DDL commands would do

2. SELECT EXISTS (SELECT 1 FROM %I.%I WHERE %I IS NULL)
which is fast if there is an index on the column

3. UPDATE pg_catalog.pg_attribute SET attnotnull = TRUE
WHERE attrelid = %L::oid
AND   attname  = %L

Pragmatically, would this be a safe approach?


On Wed, Dec 21, 2016 at 6:53 PM, Joel Jacobson  wrote:

> If you are fully confident you have no NULL values,
> e.g. if you have all your logics in db functions and you validate all
> INSERTs to a table won't pass any NULL values,
> and you have checked all the rows in a table are NOT NULL for the column,
> would it be completely crazy to just set pg_attribute.attnotnull to
> TRUE for the column?
>
> Is anything else happening "under the hood" than just locking all rows
> and verifying there are no NULL rows, and then setting attnotnull to
> TRUE?
>
>
> On Wed, Dec 21, 2016 at 6:37 PM, Craig Ringer 
> wrote:
> > On 21 December 2016 at 19:01, Joel Jacobson  wrote:
> >
> >> Similar to what we (Trustly) did when we sponsored the FOR KEY LOCK
> >> feature to improve concurrency,
> >> we would be very interested in also sponsoring this feature, as it
> >> would mean a great lot to us.
> >> I don't know if this is the right forum trying to find someone/some
> >> company to sign up for the task,
> >> please let me know if I should mail to some other list. Thanks.
> >
> > You'll probably get mail off list.
> >
> > For what it's worth, there's a bit of a complexity here. PostgreSQL
> > doesn't model NOT NULL as a true CONSTRAINT. Instead it's a column
> > attribute. I suspect we would need to change that in order to allow a
> > NOT VALID NOT NULL constraint to be created.
> >
> > That's at least partly why the docs say that "option NOT VALID [...]
> > is currently only allowed for foreign key and CHECK constraints".
> >
> > Note that "[VALIDATE] acquires only a SHARE UPDATE EXCLUSIVE lock on
> > the table being altered" so it's already suitable for what you need.
> > The challenge is making it possible to create a NOT VALID constraint
> > for NOT NULL.
> >
> > --
> >  Craig Ringer   http://www.2ndQuadrant.com/
> >  PostgreSQL Development, 24x7 Support, Training & Services
>
>
>
> --
> Joel Jacobson
>
> Mobile: +46703603801
> Trustly.com | Newsroom | LinkedIn | Twitter
>



-- 
Joel Jacobson

Mobile: +46703603801
*Trustly.com  | Newsroom
 | LinkedIn
 | **Twitter
*


* *


set_not_null.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Tom Lane
Joe Conway  writes:
> I did notice that postgres_fdw has the following stanza that I don't see
> in dblink:

> 8<--
> /*
>  * If we don't get a message from the PGresult, try the PGconn.  This
>  * is needed because for connection-level failures, PQexec may just
>  * return NULL, not a PGresult at all.
>  */
> if (message_primary == NULL)
>   message_primary = PQerrorMessage(conn);
> 8<--

> I wonder if the original issue on pgsql-bugs was a connection-level
> failure rather than OOM? Seems like dblink ought to do the same.

Oooh ... I had thought that code was in both, which was why I was having
a hard time explaining the OP's failure.  But I see you're right,
which provides a very straightforward explanation for the report.
I believe that if libpq is unable to malloc a PGresult, it will return
NULL but put an "out of memory" message into the PGconn's error buffer.
I had supposed that we'd capture and report the latter, but as the
dblink code stands, it won't.

In short, yes, please copy that bit into dblink.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Joe Conway
On 12/21/2016 10:08 AM, Tom Lane wrote:
> I wrote:
 I propose that we should change that string to "could not obtain message
 string for error on connection "foo"", or something along that line.
> 
> BTW, looking closer, I notice that the dblink case already has
> 
>   errcontext("Error occurred on dblink connection named \"%s\": %s.",
>  dblink_context_conname, dblink_context_msg)));
> 
> so we probably don't need the connection name in the primary error
> message.  Now I think "could not obtain message string for remote error"
> would be a sufficient message.
> 
> In the postgres_fdw case, I'd be inclined to use the same replacement
> primary message.  Maybe we should think about adding the server name
> to the errcontext there, but that seems like an independent improvement.

Committed that way.

I did notice that postgres_fdw has the following stanza that I don't see
in dblink:

8<--
/*
 * If we don't get a message from the PGresult, try the PGconn.  This
 * is needed because for connection-level failures, PQexec may just
 * return NULL, not a PGresult at all.
 */
if (message_primary == NULL)
message_primary = PQerrorMessage(conn);
8<--

I wonder if the original issue on pgsql-bugs was a connection-level
failure rather than OOM? Seems like dblink ought to do the same.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Why does plpython delay composite type resolution?

2016-12-21 Thread Jim Nasby

On 12/21/16 8:39 AM, Tom Lane wrote:

Jim Nasby  writes:

On 12/21/16 1:55 AM, Andreas Karlsson wrote:

Does your patch handle "ALTER TYPE name ADD ATTRIBUTE ..."? My immediate
guess would be that it could be a cache invalidation thing.



Won't that only happen at end of transaction?


No.

BEGIN;
SELECT plpython_function();
ALTER TYPE ...;
SELECT plpython_function();
COMMIT;

For that matter, the plpython function could execute the ALTER itself
through SPI, or call another function that does so.

(I'm not claiming that the existing code, either in plpython or other
PLs, necessarily handles such all scenarios nicely.  But we shouldn't
make it worse.)


Hmm... so I guess the only way we could safely handle this is if any 
caching of type info happened via fcinfo->flinfo->fn_extra? Would it 
also work if we verified pg_type.(tid,xmin) hadn't changed? (That's what 
plpython currently does to verify a row in pg_procedure hasn't changed.)

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2016-12-21 Thread Craig Ringer
On 22 December 2016 at 00:30, Robert Haas  wrote:

> That makes everything that happens between when we acquire that lock
> and when we release it non-interruptible, which seems undesirable.  I
> think that extra copy of oldestXid is a nicer approach.

That's a side-effect I didn't realise. Given that, yes, I agree.

Since we don't truncate clog much, do you think it's reasonable to
just take XidGenLock again before we proceed? I'm reluctant to add
another acquisition of a frequently contested lock for something 99.9%
of the codebase won't care about, so I think it's probably better to
add a new LWLock, and I'll resubmit on that basis, but figure it's
worth asking.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-21 Thread Michael Paquier
On Wed, Dec 21, 2016 at 10:37 PM, Stas Kelvich  wrote:
> On 21 Dec 2016, at 19:56, Michael Paquier  wrote:
>> That's indeed way simpler than before. Have you as well looked at the
>> most simple approach discussed? That would be just roughly replacing
>> the pg_fsync() calls currently in RecreateTwoPhaseFile() by a save
>> into a list as you are doing, then issue them all checkpoint.Even for
>
>> 2PC files that are created and then removed before the next
>> checkpoint, those will likely be in system cache.
>
>   Yes, I tried that as well. But in such approach another bottleneck arises
> —
> new file creation isn’t very cheap operation itself. Dual xeon with 100
> backends
> quickly hit that, and OS routines about file creation occupies first places
> in perf top. Probably that depends on filesystem (I used ext4), but avoiding
> file creation when it isn’t necessary seems like cleaner approach.
>   On the other hand it is possible to skip file creation by reusing files,
> for example
> naming them by dummy PGPROC offset, but that would require some changes
> to places that right now looks only at filenames.

Interesting. Thanks for looking at it. Your latest approach looks more
promising based on that then.

>> And this saves lookups at the WAL segments
>> still present in pg_xlog, making the operation at checkpoint much
>> faster with many 2PC files to process.
>
> ISTM your reasoning about filesystem cache applies here as well, but just
> without spending time on file creation.

True. The more spread the checkpoints and 2PC files, the more risk to
require access to disk. Memory's cheap anyway. What was the system
memory? How many checkpoints did you trigger for how many 2PC files
created? Perhaps it would be a good idea to look for the 2PC files
from WAL records in a specific order. Did you try to use
dlist_push_head instead of dlist_push_tail? This may make a difference
on systems where WAL segments don't fit in system cache as the latest
files generated would be looked at first for 2PC data.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG?] pg_event_trigger_ddl_commands() error with ALTER TEXT SEARCH CONFIGURATION

2016-12-21 Thread Artur Zakirov
Thank you for your comments, Stephen.

2016-12-21 20:34 GMT+03:00 Stephen Frost :
>
> Did you happen to look at adding a regression test for this to
> test_ddl_deparse?

Of course. I updated the patch.

>
>> This patch only fixes the bug. But I think I also can do a patch which
>> will give pg_ts_config_map entries with
>> pg_event_trigger_ddl_commands() if someone wants. It can be done using
>> new entry in the CollectedCommandType structure maybe.
>
> While that sounds like a good idea, it seems like it's more a feature
> addition rather than a bugfix, no?
>

Yes, agree with you. It should be added as a separate patch. I think I
will deal with it.


-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c
index b24011371c..2b84848cf5 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -1215,10 +1215,10 @@ AlterTSConfiguration(AlterTSConfigurationStmt *stmt)
 	/* Update dependencies */
 	makeConfigurationDependencies(tup, true, relMap);
 
-	InvokeObjectPostAlterHook(TSConfigMapRelationId,
+	InvokeObjectPostAlterHook(TSConfigRelationId,
 			  HeapTupleGetOid(tup), 0);
 
-	ObjectAddressSet(address, TSConfigMapRelationId, cfgId);
+	ObjectAddressSet(address, TSConfigRelationId, cfgId);
 
 	heap_close(relMap, RowExclusiveLock);
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index fd4eff4907..b7a4c8e531 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1479,7 +1479,8 @@ ProcessUtilitySlow(ParseState *pstate,
 break;
 
 			case T_AlterTSConfigurationStmt:
-address = AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree);
+AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree);
+commandCollected = true;
 break;
 
 			case T_AlterTableMoveAllStmt:
diff --git a/src/test/modules/test_ddl_deparse/Makefile b/src/test/modules/test_ddl_deparse/Makefile
index 8ea6f39afd..3a57a95c84 100644
--- a/src/test/modules/test_ddl_deparse/Makefile
+++ b/src/test/modules/test_ddl_deparse/Makefile
@@ -23,6 +23,7 @@ REGRESS = test_ddl_deparse \
 	comment_on \
 	alter_function \
 	alter_sequence \
+	alter_ts_config \
 	alter_type_enum \
 	opfamily \
 	defprivs \
diff --git a/src/test/modules/test_ddl_deparse/expected/alter_ts_config.out b/src/test/modules/test_ddl_deparse/expected/alter_ts_config.out
new file mode 100644
index 00..afc352fc5f
--- /dev/null
+++ b/src/test/modules/test_ddl_deparse/expected/alter_ts_config.out
@@ -0,0 +1,8 @@
+--
+-- ALTER TEXT SEARCH CONFIGURATION
+--
+CREATE TEXT SEARCH CONFIGURATION en (copy=english);
+NOTICE:  DDL test: type simple, tag CREATE TEXT SEARCH CONFIGURATION
+ALTER TEXT SEARCH CONFIGURATION en
+  ALTER MAPPING FOR host, email, url, sfloat WITH simple;
+NOTICE:  DDL test: type alter text search configuration, tag ALTER TEXT SEARCH CONFIGURATION
diff --git a/src/test/modules/test_ddl_deparse/sql/alter_ts_config.sql b/src/test/modules/test_ddl_deparse/sql/alter_ts_config.sql
new file mode 100644
index 00..ac13e21dda
--- /dev/null
+++ b/src/test/modules/test_ddl_deparse/sql/alter_ts_config.sql
@@ -0,0 +1,8 @@
+--
+-- ALTER TEXT SEARCH CONFIGURATION
+--
+
+CREATE TEXT SEARCH CONFIGURATION en (copy=english);
+
+ALTER TEXT SEARCH CONFIGURATION en
+  ALTER MAPPING FOR host, email, url, sfloat WITH simple;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-21 Thread Peter Geoghegan
On Wed, Dec 21, 2016 at 10:21 AM, Peter Geoghegan  wrote:
> On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas  wrote:
>> 3. Just live with the waste of space.
>
> I am loathe to create a special case for the parallel interface too,
> but I think it's possible that *no* caller will ever actually need to
> live with this restriction at any time in the future.

I just realized that you were actually talking about the waste of
space in workers here, as opposed to the theoretical waste of space
that would occur in the leader should there ever be a parallel
randomAccess tuplesort caller.

To be clear, I am totally against allowing a waste of logtape.c temp
file space in *workers*, because that implies a cost that will most
certainly be felt by users all the time.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-21 Thread David Steele

On 12/21/16 4:40 PM, Andres Freund wrote:

On 2016-12-21 16:35:28 -0500, Robert Haas wrote:



What I think "progress LSN"
is getting at -- actually fairly well -- is whether we're getting
anything *important* done, not whether we are consistent.  I don't
mind changing the name, but not to consistency LSN.


Well, progress could just as well be replay. Or the actual insertion
point. Or up to where we've written out. Or synced out. Or
replicated

Open to other suggestions - I'm not really happy with consistency LSN,
but definitely unhappy with progress LSN.


MinConsistencyLSN?

--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-21 Thread David G. Johnston
On Wed, Dec 21, 2016 at 2:40 PM, Andres Freund  wrote:

> That's imo pretty much what progress LSN currently describes. Have there
> been any records which are important for durability/consistency and
> hence need to be archived and such.
>

The above, to me, describes a "milestone LSN"...​

David J.


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-21 Thread David Steele

Hi Andres,

On 12/21/16 4:28 PM, Andres Freund wrote:


Working on committing this (tomorrow morning, not tonight).  There's
some relatively minor things I want to change:

- I don't like the name XLogSetFlags() - it's completely unclear what
  that those flags refer to - it could just as well be replay
  related. XLogSetRecordFlags()?


That sounds a bit more clear.


- Similarly I don't like the name "progress LSN" much. What does
  "progress" really mean in that". Maybe "consistency LSN"?


Yes, please.  I think that really cuts to the core of what the patch is 
about.  Progress made perfect sense to me, but consistency is always the 
goal, and what we are saying here is that this is the last xlog record 
that is required to achieve consistency.  Anything that happens to be 
after it is informational only.



- It's currently required to avoid triggering archive timeouts and
  checkpoints triggering each other, but I'm nervous marking all xlog
  switches as unimportant. I think it'd be better to only mark timeout
  triggered switches as such.


That seems fine to me.  If the system is truly idle that might trigger 
one more xlog switch that is needed, but it seems like a reasonable 
compromise.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-21 Thread Andres Freund
On 2016-12-21 16:35:28 -0500, Robert Haas wrote:
> On Wed, Dec 21, 2016 at 4:28 PM, Andres Freund  wrote:
> > - Similarly I don't like the name "progress LSN" much. What does
> >   "progress" really mean in that". Maybe "consistency LSN"?
> 
> Whoa.  -1 from me for "consistency LSN".  Consistency has to with
> whether the cluster has recovered up to the minimum recovery point or
> whatever -- that is -- questions like "am i going to run into torn
> pages?" and "should I expect some heap tuples to maybe be missing
> index tuples, or the other way around?".

That's imo pretty much what progress LSN currently describes. Have there
been any records which are important for durability/consistency and
hence need to be archived and such.


> What I think "progress LSN"
> is getting at -- actually fairly well -- is whether we're getting
> anything *important* done, not whether we are consistent.  I don't
> mind changing the name, but not to consistency LSN.

Well, progress could just as well be replay. Or the actual insertion
point. Or up to where we've written out. Or synced out. Or
replicated

Open to other suggestions - I'm not really happy with consistency LSN,
but definitely unhappy with progress LSN.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-21 Thread Robert Haas
On Wed, Dec 21, 2016 at 4:28 PM, Andres Freund  wrote:
> - Similarly I don't like the name "progress LSN" much. What does
>   "progress" really mean in that". Maybe "consistency LSN"?

Whoa.  -1 from me for "consistency LSN".  Consistency has to with
whether the cluster has recovered up to the minimum recovery point or
whatever -- that is -- questions like "am i going to run into torn
pages?" and "should I expect some heap tuples to maybe be missing
index tuples, or the other way around?".  What I think "progress LSN"
is getting at -- actually fairly well -- is whether we're getting
anything *important* done, not whether we are consistent.  I don't
mind changing the name, but not to consistency LSN.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-12-21 Thread Andres Freund
Hi,

A mime-type of invalid/octet-stream? That's an, uh, odd choice.

Working on committing this (tomorrow morning, not tonight).  There's
some relatively minor things I want to change:

- I don't like the name XLogSetFlags() - it's completely unclear what
  that those flags refer to - it could just as well be replay
  related. XLogSetRecordFlags()?
- Similarly I don't like the name "progress LSN" much. What does
  "progress" really mean in that". Maybe "consistency LSN"?
- It's currently required to avoid triggering archive timeouts and
  checkpoints triggering each other, but I'm nervous marking all xlog
  switches as unimportant. I think it'd be better to only mark timeout
  triggered switches as such.

Otherwise this seems to look good.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring replay lag

2016-12-21 Thread Thomas Munro
On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao  wrote:
> I agree that the capability to measure the remote_apply lag is very useful.
> Also I want to measure the remote_write and remote_flush lags, for example,
> in order to diagnose the cause of replication lag.

Good idea.  I will think about how to make that work.  There was a
proposal to make writing and flushing independent[1].  I'd like that
to go in.  Then the write_lag and flush_lag could diverge
significantly, and it would be nice to be able to see that effect as
time (though you could already see it with LSN positions).

> For that, what about maintaining the pairs of send-timestamp and LSN in
> *sender side* instead of receiver side? That is, walsender adds the pairs
> of send-timestamp and LSN into the buffer every sampling period.
> Whenever walsender receives the write, flush and apply locations from
> walreceiver, it calculates the write, flush and apply lags by comparing
> the received and stored LSN and comparing the current timestamp and
> stored send-timestamp.

I thought about that too, but I couldn't figure out how to make the
sampling work.  If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.

You could fix that by making the standby send a reply *every time* it
applies some WAL (like it does for transactions committing with
synchronous_commit = remote_apply, though that is only for commit
records), but then we'd be generating a lot of recovery->walreceiver
communication and standby->primary network traffic, even for people
who don't otherwise need it.  It seems unacceptable.

Or you could fix that by setting the XACT_COMPLETION_APPLY_FEEDBACK
bit in the xl_xinfo.xinfo for selected transactions, as a way to ask
the standby to send a reply when that commit record is applied, but
that only works for commit records.  One of my goals was to be able to
report lag accurately even between commits (very large data load
transactions etc).

Or you could fix that by sending a list of 'interesting LSNs' to the
standby, as a way to ask it to send a reply when those LSNs are
applied.  Then you'd need a circular buffer of (LSN, time) pairs in
the primary AND a circular buffer of LSNs in the standby to remember
which locations should generate a reply.  This doesn't seem to be an
improvement.

That's why I thought that the standby should have the (LSN, time)
buffer: it decides which samples to record in its buffer, using LSN
and time provided by the sending server, and then it can send replies
at exactly the right times.  The LSNs don't have to be commit records,
they're just arbitrary points in the WAL stream which we attach
timestamps to.  IPC and network overhead is minimised, and accuracy is
maximised.

> As a bonus of this approach, we don't need to add the field into the replay
> message that walreceiver can very frequently send back. Which might be
> helpful in terms of networking overhead.

For the record, these replies are only sent approximately every
replay_lag_sample_interval (with variation depending on replay speed)
and are only 42 bytes with the new field added.

[1] 
https://www.postgresql.org/message-id/CA%2BU5nMJifauXvVbx%3Dv3UbYbHO3Jw2rdT4haL6CCooEDM5%3D4ASQ%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor correction in alter_table.sgml

2016-12-21 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> So maybe something like
>> 
>> All the forms of ALTER TABLE that act on a single table,
>> except RENAME and SET SCHEMA, can be combined into a
>> list of multiple alterations to be applied together.

> Committed and back-patch'd that way.

BTW, so far as the HEAD version of this patch goes, I notice that
ATTACH PARTITION and DETACH PARTITION were recently added to the
list of exceptions.  But they're not exceptions according to this
wording: they act on more than one table (the parent and the
partition), no?  So we could simplify the sentence some more by
removing them again.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor correction in alter_table.sgml

2016-12-21 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> So maybe something like
> 
>   All the forms of ALTER TABLE that act on a single table,
>   except RENAME and SET SCHEMA, can be combined into a
>   list of multiple alterations to be applied together.

Committed and back-patch'd that way.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_dump vs. TRANSFORMs

2016-12-21 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > That's a good point, we might be doing things wrong in other places in
> > the code by using FirstNormalObjectId on pre-8.1 servers.
> 
> > What I suggest then is an independent patch which uses a different
> > variable than FirstNormalObjectId and sets it correctly based on the
> > version of database that we're connecting to,
> 
> +1

Done.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-21 Thread Peter Geoghegan
On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas  wrote:
> 3. Just live with the waste of space.

I am loathe to create a special case for the parallel interface too,
but I think it's possible that *no* caller will ever actually need to
live with this restriction at any time in the future. I am strongly
convinced that adopting tuplesort.c for parallelism should involve
partitioning [1]. With that approach, even randomAccess callers will
not want to read at random for one big materialized tape, since that's
at odds with the whole point of partitioning, which is to remove any
dependencies between workers quickly and early, so that as much work
as possible is pushed down into workers. If a merge join were
performed in a world where we have this kind of partitioning, we
definitely wouldn't require one big materialized tape that is
accessible within each worker.

What are the chances of any real user actually having to live with the
waste of space at some point in the future?

> Another tangentially-related problem I just realized is that we need
> to somehow handle the issues that tqueue.c does when transferring
> tuples between backends -- most of the time there's no problem, but if
> anonymous record types are involved then tuples require "remapping".
> It's probably harder to provoke a failure in the tuplesort case than
> with parallel query per se, but it's probably not impossible.

Thanks for pointing that out. I'll look into it.

BTW, I discovered a bug where there is very low memory available
within each worker -- tuplesort.c throws an error within workers
immediately. It's just a matter of making sure that they at least have
64KB of workMem, which is a pretty straightforward fix. Obviously it
makes no sense to use so little memory in the first place; this is a
corner case.

[1] 
https://www.postgresql.org/message-id/CAM3SWZR+ATYAzyMT+hm-Bo=1l1smtjbndtibwbtktyqs0dy...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Tom Lane
I wrote:
>>> I propose that we should change that string to "could not obtain message
>>> string for error on connection "foo"", or something along that line.

BTW, looking closer, I notice that the dblink case already has

  errcontext("Error occurred on dblink connection named \"%s\": %s.",
 dblink_context_conname, dblink_context_msg)));

so we probably don't need the connection name in the primary error
message.  Now I think "could not obtain message string for remote error"
would be a sufficient message.

In the postgres_fdw case, I'd be inclined to use the same replacement
primary message.  Maybe we should think about adding the server name
to the errcontext there, but that seems like an independent improvement.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-21 Thread Andrew Borodin
2016-12-21 20:42 GMT+05:00 Robert Haas :
> This whole subthread seems like a distraction to me.  I find it hard
> to believe that this test case would be stable enough to survive the
> buildfarm where, don't forget, we have things like
> CLOBBER_CACHE_ALWAYS machines where queries take 100x longer to run.
> But even if it is, surely we can pick a less contrived test case.  So
> why worry about this?

David Fetter's test is deterministic and shall pass no matter how slow
and unpredictable perfromance is on a server.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-21 Thread David Fetter
On Wed, Dec 21, 2016 at 10:42:18AM -0500, Robert Haas wrote:
> On Wed, Dec 21, 2016 at 10:29 AM, David Fetter  wrote:
> > On Wed, Dec 21, 2016 at 06:31:52PM +0800, Craig Ringer wrote:
> >> On 21 December 2016 at 14:26, Andrew Borodin  wrote:
> >>
> >> > I'm not sure every platform supports microsecond sleeps
> >>
> >> Windows at least doesn't by default, unless that changed in Win2k12
> >> and Win8 with the same platform/kernel improvements that delivered
> >> https://msdn.microsoft.com/en-us/library/hh706895(v=vs.85).aspx . I'm
> >> not sure. On older systems sleeps are 1ms to 15ms.
> >
> > Apparently, as of 2011, there were ways to do this.  It's not crystal
> > clear to me just how reliable they are.
> >
> > http://stackoverflow.com/questions/9116618/cpp-windows-is-there-a-sleep-function-in-microseconds
> 
> This whole subthread seems like a distraction to me.  I find it hard
> to believe that this test case would be stable enough to survive the
> buildfarm where, don't forget, we have things like
> CLOBBER_CACHE_ALWAYS machines where queries take 100x longer to run.
> But even if it is, surely we can pick a less contrived test case.
> So why worry about this?

I wasn't super worried about the actual sleep times, but I was having
trouble puzzling out what the test was actually doing, so I rewrote it
with what I thought of as more clarity.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Joe Conway
On 12/21/2016 09:27 AM, Tom Lane wrote:
> Joe Conway  writes:
>> On 12/21/2016 08:49 AM, Tom Lane wrote:
>>> I propose that we should change that string to "could not obtain message
>>> string for error on connection "foo"", or something along that line.
>>>
>>> postgres_fdw has the same disease.  It wouldn't have the notion of a named
>>> connection, but maybe we could insert the foreign server name instead.
> 
>> Seems reasonable to me. I can work on it if you'd like. Do you think
>> this should be backpatched?
> 
> If you have time for it, please do, I have lots on my plate already.

Ok, will do.

> I'd vote for back-patching; the benefits of a clearer error message
> are obvious, and it hardly seems likely that any existing applications
> are depending on this specific error string.

Agreed.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [ patch ] pg_dump: new --custom-fetch-table and --custom-fetch-value parameters

2016-12-21 Thread Stephen Frost
Andrea,

* Andrea Urbani (matfan...@mail.com) wrote:
>I had a problem with a Postgresql 9.3.5 on 32 bit linux, old 2.6.26
>kernel:

Ok, though, to be clear, this is a feature request, so we wouldn't
back-patch adding this to pg_dump.

>I have solve it adding two new parameters, --custom-fetch-table and
>--custom-fetch-value, to fetch less records for the specified table(s).

Giving the user the ability to change the fetch size sounds interesting,
though do we really need to specify it per table?  What about just a
--fetch-size=X option?

>This does not completely solve the problem, but it helps you to get more
>chance to be able to dump your database.

That is certainly a worthwhile goal.

>    pg_dump --dbname=healthorganizer --username=hor --column-inserts
>--custom-fetch-table='"tDocumentsFiles"' --custom-fetch-value=25

I don't particularly like the use of 'custom' in the name of the option,
seems like it's just a noise word and not really necessary.

>I haven't tested the documentation: too many problems while building it
>(also the original version, without my changes; probably I have bogus
>tools... and too less time to check/try...).
>Attached the patches for the master and REL9_6_STABLE.

I agree the documentation can be a bit of a pain, but there's a lot of
issues with the patch itself when it comes to the project style.  The
indentation doesn't look like it's all correct, and multi-line comments
should be of the form:

/*
 * text here
 */

Lastly, it'd be good to have this patch added to
https://commitfest.postgresql.org to have it formally reviewed in the
commitfest cycle coming up in January.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [BUG?] pg_event_trigger_ddl_commands() error with ALTER TEXT SEARCH CONFIGURATION

2016-12-21 Thread Stephen Frost
Artur,

* Artur Zakirov (a.zaki...@postgrespro.ru) wrote:
> 2016-11-19 21:28 GMT+03:00 Michael Paquier :
> > On Thu, Nov 17, 2016 at 1:17 PM, Alvaro Herrera
> >  wrote:
> >> It's a bug.  You're right that we need to handle the object class
> >> somewhere.  Perhaps I failed to realize that tsconfigs could get
> >> altered.
> >
> > It seems to me that the thing to be careful of here is how a new
> > OBJECT_TSCONFIGURATIONMAP should use ObjectAddress. It does not seem
> > that complicated, but it needs some work.
> > --
> > Michael
> 
> After some investigation it seems to me that OBJECT_TSCONFIGURATIONMAP
> can't be added for pg_ts_config_map. Because this catalog hasn't Oids.

I started looking into this, in part because it's a bug fix.

> But this bug can be easily fixed (patch attached). I think in
> AlterTSConfiguration() TSConfigRelationId should be used instead of
> TSConfigMapRelationId. Secondly, in ProcessUtilitySlow() we can use
> commandCollected = true. Because configuration entry is added in
> EventTriggerCollectAlterTSConfig() into
> currentEventTriggerState->commandList.

We should definitely be using TSConfigRelationId there, as the tuple we
have the OID of is from pg_ts_config, not from pg_ts_config_map.  You're
also right that we need to set commandCollected = true since the
MakConfigurationMapping() and DropConfigurationMapping() functions get
called from AlterTSConfiguration and, as you say, they call
EventTriggerCollectAlterTSConfig().

Looks like the InvokeObjectPostAlterHook() call has been around since
9.3, so we'll need to back-patch it that far, while the other changes go
back to 9.5.

Did you happen to look at adding a regression test for this to
test_ddl_deparse?

> This patch only fixes the bug. But I think I also can do a patch which
> will give pg_ts_config_map entries with
> pg_event_trigger_ddl_commands() if someone wants. It can be done using
> new entry in the CollectedCommandType structure maybe.

While that sounds like a good idea, it seems like it's more a feature
addition rather than a bugfix, no?

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Tom Lane
Joe Conway  writes:
> On 12/21/2016 08:49 AM, Tom Lane wrote:
>> I propose that we should change that string to "could not obtain message
>> string for error on connection "foo"", or something along that line.
>>
>> postgres_fdw has the same disease.  It wouldn't have the notion of a named
>> connection, but maybe we could insert the foreign server name instead.

> Seems reasonable to me. I can work on it if you'd like. Do you think
> this should be backpatched?

If you have time for it, please do, I have lots on my plate already.

I'd vote for back-patching; the benefits of a clearer error message
are obvious, and it hardly seems likely that any existing applications
are depending on this specific error string.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-12-21 Thread Robert Haas
On Mon, Nov 28, 2016 at 11:31 PM, Michael Paquier
 wrote:
> Moved to CF 2017-01, as no committers have showed up yet :(

Seeing no other volunteers, here I am.

On a first read-through of this patch -- I have not studied it in
detail yet -- this looks pretty good to me.  One concern is that this
patch adds a bit of code to XLogInsert(), which is a very hot piece of
code.  Conceivably, that might produce a regression even when this is
disabled; if so, we'd probably need to make it a build-time option.  I
hope that's not necessary, because I think it would be great to
compile this into the server by default, but we better make sure it's
not a problem.  A bulk load into an existing table might be a good
test case.

Aside from that, I think the biggest issue here is that the masking
functions are virtually free of comments, whereas I think they should
have extensive and detailed comments.  For example, in heap_mask, you
have this:

+page_htup->t_infomask =
+HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;

For something like this, you could write "We want to ignore
differences in hint bits, since they can be set by SetHintBits without
emitting WAL.  Force them all to be set so that we don't notice
discrepancies."  Actually, though, I think that you could be a bit
more nuanced here: HEAP_XMIN_COMMITTED + HEAP_XMIN_INVALID =
HEAP_XMIN_FROZEN, so maybe what you should do is clear
HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID but only clear the others if
one is set but not both.

Anyway, leaving that aside, I think every single change that gets
masked in every single masking routine needs a similar comment,
explaining why that change can happen on the master without also
happening on the standby and hopefully referring to the code that
makes that unlogged change.

I think wal_consistency_checking, as proposed by Peter, is better than
wal_consistency_check, as implemented.

Having StartupXLOG() call pfree() on the masking buffers is a waste of
code.  The process is going to exit anyway.

+ "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",

Primary error messages aren't capitalized.

+if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+{
+/* Caller specified a bogus block_id. Do nothing. */
+continue;
+}

Why would the caller do something so dastardly?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Joe Conway
On 12/21/2016 08:49 AM, Tom Lane wrote:
> We have a report in pgsql-general of a dblink query failing with
>   ERROR: unknown error
> This is, to say the least, unhelpful.  And it violates our error
> message style guidelines.
> 
> Where that is coming from is a situation where we've failed to extract
> any primary message from a remote error.  (I theorize that today's report
> is triggered by an out-of-memory situation, but that's only an unsupported
> guess at the moment.)
> 
> I propose that we should change that string to "could not obtain message
> string for error on connection "foo"", or something along that line.
> 
> postgres_fdw has the same disease.  It wouldn't have the notion of a named
> connection, but maybe we could insert the foreign server name instead.
> 
> A possible objection is that if we really are on the hairy edge of OOM,
> trying to construct such error strings might push us over the edge and
> then you get "out of memory" instead.  But IMO that's not worse; it
> could be argued to be a more useful description of the problem.
> 
> Comments?

Seems reasonable to me. I can work on it if you'd like. Do you think
this should be backpatched?

Joe


-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


[HACKERS] [ patch ] pg_dump: new --custom-fetch-table and --custom-fetch-value parameters

2016-12-21 Thread Andrea Urbani

Hello to everybody,

I had a problem with a Postgresql 9.3.5 on 32 bit linux, old 2.6.26 kernel:

Program: pg_dump
Problem: if you have tables with big blob fields and you try to dump them with --inserts you could get errors like
    pg_dump: [archiver (db)] query failed: lost synchronization with server: got message type "D", length 847712348
    pg_dump: [archiver (db)] query was: FETCH 100 FROM _pg_dump_cursor
    or
    pg_dump: [archiver (db)] query failed: ERROR:  out of memory
    DETAIL:  Failed on request of size 1073741823.
    pg_dump: [archiver (db)] query was: FETCH 100 FROM _pg_dump_cursor
    
I have solve it adding two new parameters, --custom-fetch-table and --custom-fetch-value, to fetch less records for the specified table(s).
This does not completely solve the problem, but it helps you to get more chance to be able to dump your database.

    pg_dump --dbname=healthorganizer --username=hor --column-inserts --custom-fetch-table='"tDocumentsFiles"' --custom-fetch-value=25

I haven't tested the documentation: too many problems while building it (also the original version, without my changes; probably I have bogus tools... and too less time to check/try...).

Attached the patches for the master and REL9_6_STABLE.

I hope it will help somebody else.

 

Bye
Andrea
matfan...@users.sf.net

 


9_6.patch
Description: Binary data


master.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Tom Lane
"David G. Johnston"  writes:
> On Wed, Dec 21, 2016 at 9:49 AM, Tom Lane  wrote:
>> A possible objection is that if we really are on the hairy edge of OOM,
>> trying to construct such error strings might push us over the edge

> What am I missing here?  Constructing said string occurs on the local end
> of the connection but the memory problem exists on the remote end.

Um, that's not clear, in fact it's not likely.  The backend usually
manages to tell you so if it's out of memory --- or, worst case, it
crashes trying, in which case the local libpq ought to gin up a report
about loss of connection.  So the root cause I'm suspecting is that
the local libpq was unable to obtain memory for a PGresult or something
like that.  That theory has some holes of its own, because libpq also
keeps some cards up its sleeve that usually let it report out-of-memory
successfully, but it's the best I can do with the info at hand.

In any case, the point of the error style guidelines is that it's *always*
possible to do better than "unknown error"; now that it's been proven
that this case is reachable in the field, we should try harder.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Tom Lane
I wrote:
> We have a report in pgsql-general of a dblink query failing with
>   ERROR: unknown error

Er, fingers faster than brain this morning.  Actually that report is in
pgsql-bugs, specifically bug #14471:
https://www.postgresql.org/message-id/20161221094443.25614.47807%40wrigleys.postgresql.org

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread David G. Johnston
On Wed, Dec 21, 2016 at 9:49 AM, Tom Lane  wrote:

> A possible objection is that if we really are on the hairy edge of OOM,
> trying to construct such error strings might push us over the edge


What am I missing here?  Constructing said string occurs on the local end
of the connection but the memory problem exists on the remote end.

David J.


Re: [HACKERS] multi-level partitions and partition-wise joins

2016-12-21 Thread Robert Haas
On Wed, Dec 21, 2016 at 6:36 AM, Ashutosh Bapat
 wrote:
> I am starting this as a separate thread for this since the declarative
> partitioning thread has many issues reported and it's better to keep
> this discussion separate from the issues reported on that thread.
>
> While expanding inheritance, any inheritance hierarchy is flattened
> out including partition hierarchy. Partition-wise joins can be
> employed if the joining tables have the same partitioning scheme and
> have equi-join clauses on the partition keys. If two multi-level
> partitioned tables are joined, the partition-wise join can be
> percolated down to the levels up to which the partition schemes match
> and suitable clauses are available. E.g. if two multi-level
> partitioned table have matching partitioning schemes at the top-most
> level, but not below that, we may join the topmost level partitions
> pair-wise, but not partitions on the lower levels. In general, we may
> use partition-wise join for the matching parts of partition hierarchy
> and in the parts that do not match, use join between append relations.
> Not always it will be efficient to execute partition-wise joins upto
> the last levels of partition hierarchy, even if partition-wise join
> can be employed. It might be possible that executing partition-wise
> joins for only certain parts of partition hierarchy is efficient and
> join of appends is efficient in the rest of the parts.
>
> In order to decide whether partition-wise join is efficient for a join
> between given partitioned partition, we need to identify its
> subpartitions. Similarly when a join between partitioned partition can
> not use partition-wise join but some other partitions can, we need to
> identify the subpartitions of that partition, so that they can be
> appended together before joining. That information is lost while
> expanding RTE. It looks like we need to retain partitioning hierarchy
> in order to implement partition-wise joins between multi-level
> partitioned tables.
>
> An earlier version of Amit's partition support patches had code to
> retain partitioning hierarchy but it seems it was removed per
> discussion at [1]. I agree with that decision.

I can't quite figure out what the point of this email is.  What did
you want to discuss?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-12-21 Thread Robert Haas
On Wed, Dec 21, 2016 at 5:33 AM, Amit Langote
 wrote:
> Breaking changes into multiple commits/patches does not seem to work for
> adding regression tests.  So, I've combined multiple patches into a single
> patch which is now patch 0002 in the attached set of patches.

Ugh, seriously?  It's fine to combine closely related bug fixes but
not all of these are.  I don't see why you can't add some regression
tests in one patch and then add some more in the next patch.

Meanwhile, committed the latest 0001 and the elog() patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Getting rid of "unknown error" in dblink and postgres_fdw

2016-12-21 Thread Tom Lane
We have a report in pgsql-general of a dblink query failing with
ERROR: unknown error
This is, to say the least, unhelpful.  And it violates our error
message style guidelines.

Where that is coming from is a situation where we've failed to extract
any primary message from a remote error.  (I theorize that today's report
is triggered by an out-of-memory situation, but that's only an unsupported
guess at the moment.)

I propose that we should change that string to "could not obtain message
string for error on connection "foo"", or something along that line.

postgres_fdw has the same disease.  It wouldn't have the notion of a named
connection, but maybe we could insert the foreign server name instead.

A possible objection is that if we really are on the hairy edge of OOM,
trying to construct such error strings might push us over the edge and
then you get "out of memory" instead.  But IMO that's not worse; it
could be argued to be a more useful description of the problem.

Comments?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2016-12-21 Thread Robert Haas
On Wed, Dec 21, 2016 at 3:02 AM, Craig Ringer  wrote:
> Instead, I've added a new LWLock, ClogTruncationLock, for that
> purpose. vac_truncate_clog() takes it if it decides to attempt clog
> truncation. This lock is held throughout the whole process of clog
> truncation and oldestXid advance, so there's no need for a new
> pendingOldestXid field in ShmemVariableCache. We just take the lock
> then look at oldestXid, knowing that it's guaranteed to correspond to
> an existing clog page that won't go away while we're looking.
> ClogTruncationLock is utterly uncontested so it's going to have no
> meaningful impact compared to all the work we do scanning the clog to
> decide whether we're even going to try truncating it, etc.

That makes everything that happens between when we acquire that lock
and when we release it non-interruptible, which seems undesirable.  I
think that extra copy of oldestXid is a nicer approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical decoding on standby

2016-12-21 Thread Robert Haas
On Tue, Dec 20, 2016 at 10:06 PM, Craig Ringer  wrote:
> On 20 December 2016 at 15:03, Petr Jelinek  
> wrote:
>>> The biggest change in this patch, and the main intrusive part, is that
>>> procArray->replication_slot_catalog_xmin is no longer directly used by
>>> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
>>> added, with a corresponding CheckPoint field.
>>> [snip]
>>
>> If this mechanism would not be needed most of the time, wouldn't it be
>> better to not have it and just have a way to ask physical slot about
>> what's the current reserved catalog_xmin (in most cases the standby
>> should actually know what it is since it's sending the hs_feedback, but
>> first slot created on such standby may need to check).
>
> Yes, and that was actually my originally preferred approach, though
> the one above does offer the advantage that if something goes wrong we
> can detect it and fail gracefully. Possibly not worth the complexity
> though.
>
> Your approach requires us to make very sure that hot_standby_feedback
> does not get turned off by user or become ineffective once we're
> replicating, though, since we won't have any way to detect when needed
> tuples are removed. We'd probably just bail out with relcache/syscache
> lookup errors, but I can't guarantee we wouldn't crash if we tried
> logical decoding on WAL where needed catalogs have been removed.

I dunno, Craig, I think your approach sounds more robust.  It's not
very nice to introduce a bunch of random prohibitions on what works
with what, and it doesn't sound like it's altogether watertight
anyway.  Incorporating an occasional, small record into the WAL stream
to mark the advancement of the reserved catalog_xmin seems like a
cleaner and safer solution.  We certainly do NOT want to find out
about corruption only because of random relcache/syscache lookup
failures, let alone crashes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hang in pldebugger after git commit : 98a64d0

2016-12-21 Thread Robert Haas
On Tue, Dec 20, 2016 at 11:32 PM, Amit Kapila  wrote:
> Ashutosh Sharma has helped to test that pldebugger issue is fixed with
> attached version.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw bug in 9.6

2016-12-21 Thread Ashutosh Bapat
Some review comments

1. postgres_fdw doesn't push down semi and anti joins so you may want to
discount these two too.
+   jointype == JOIN_SEMI ||
+   jointype == JOIN_ANTI);

2. We should try to look for other not-so-cheap paths if the cheapest one is
paramterized. You might want to use get_cheapest_path_for_pathkeys() to find a
suitable unparameterized path by passing NULL for required_outer and NIL for
pathkeys, that's a very strange usage, but I think it will serve the purpose.
On the thread we discussed that we should save the epq_path create for lower
join and use it here. That may be better than searching for a path.
+/* Give up if the cheapest-total-cost paths are parameterized. */
+if (!bms_is_empty(PATH_REQ_OUTER(outer_path)) ||
+!bms_is_empty(PATH_REQ_OUTER(inner_path)))
+return NULL;

3. Adding new members to JoinPathExtraData may save some work for postgres_fdw
and other FDWs which would use CreateLocalJoinPath(), but it will add few bytes
to the structure even when there is no "FULL foreign join which requires EPQ"
involved in the query. That may not be so much of memory overhead since the
structure is used locally to add_paths_to_joinrel(), but it might be something
to think about. Instead, what if we call select_mergejoin_clauses() within
CreateLocalJoinPath() to get that information?

3. Talking about saving some CPU cycles - if a clauseless full join can be
implemented only using merge join, probably that's the only path available in
the list of paths for the given relation. Instead of building the same path
anew, should we use the existing path like GetExistingLocalJoinPath() for that
case? In fact, I am wondering whether we should look for an existing nestloop
path for all joins except full, in which case we should look for merge or hash
join path. We go on building a new path, if only there isn't an existing one.
That will certainly save some CPU cycles spent in costing the path.

4. Following comment mentions only hash join, but the code creates a merge join
or hash join path.
 * If the jointype is JOIN_FULL, try to create a hashjoin join path from

5. Per comment below a clauseless full join can be implemented using only merge
join. Why are we checking extra->mergejoin_allowed? Shouldn't it be true here?
/*
 * Special corner case: for "x FULL JOIN y ON true", there will be no
 * join clauses at all.  Note that mergejoin is our only join type
 * that supports FULL JOIN without any join clauses, and in that case
 * it doesn't require the input paths to be well ordered, so generate
 * a clauseless mergejoin path from the cheapest-total-cost paths.
 */
if (extra->mergejoin_allowed && !extra->mergeclause_list)

Rethinking about the problem, the error comes because the inner or outer plan
of a merge join plan doesn't have pathkeys required by the merge join. This
happens because the merge join path had foreign path with pathkeys as inner or
outer path and corresponding fdw_outerpath didn't have those pathkeys. I am
wondering whether the easy and possibly correct solution here is to not replace
a ForeignPath with fdw_outerpath in GetExistingLocalJoinPath()? If we don't do
that, there won't be error building merge join plan and
postgresRecheckForeignScan() would correctly route the EPQ checks to the local
plan available as outer plan. Attached patch with that code removed.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/foreign/foreign.c b/src/backend/foreign/foreign.c
index 242d6d2..872d2a1 100644
--- a/src/backend/foreign/foreign.c
+++ b/src/backend/foreign/foreign.c
@@ -775,30 +775,6 @@ GetExistingLocalJoinPath(RelOptInfo *joinrel)
 		if (!joinpath)
 			continue;
 
-		/*
-		 * If either inner or outer path is a ForeignPath corresponding to a
-		 * pushed down join, replace it with the fdw_outerpath, so that we
-		 * maintain path for EPQ checks built entirely of local join
-		 * strategies.
-		 */
-		if (IsA(joinpath->outerjoinpath, ForeignPath))
-		{
-			ForeignPath *foreign_path;
-
-			foreign_path = (ForeignPath *) joinpath->outerjoinpath;
-			if (foreign_path->path.parent->reloptkind == RELOPT_JOINREL)
-joinpath->outerjoinpath = foreign_path->fdw_outerpath;
-		}
-
-		if (IsA(joinpath->innerjoinpath, ForeignPath))
-		{
-			ForeignPath *foreign_path;
-
-			foreign_path = (ForeignPath *) joinpath->innerjoinpath;
-			if (foreign_path->path.parent->reloptkind == RELOPT_JOINREL)
-joinpath->innerjoinpath = foreign_path->fdw_outerpath;
-		}
-
 		return (Path *) joinpath;
 	}
 	return NULL;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2016-12-21 Thread Robert Haas
On Tue, Dec 20, 2016 at 2:25 PM, Mithun Cy  wrote:
> -- I think if it is okay, I can document same for each member of 
> HashMetaPageData whether to read from cached from meta page or directly from 
> current meta page. Below briefly I have commented for each member. If you 
> suggest I can go with that approach, I will produce a neat patch for same.

Plain text emails are preferred on this list.

I don't have any confidence in this approach.  I'm not sure exactly
what needs to be changed here, but what you're doing right now is just
too error-prone.  There's a cached metapage available, and you've got
code accessing directly, and that's OK except when it's not, and maybe
we can add some comments to explain, but I don't think that's going to
be good enough to really make it clear and maintainable.  We need some
kind of more substantive safeguard to prevent the cached metapage data
from being used in unsafe ways -- and while we're at it, we should try
to use it in as many of the places where it *is* safe as possible.  My
suggestion for a separate structure was one idea; another might be
providing some kind of API that's always used to access the metapage
cache.  Or maybe there's a third option.  But this, the way it is
right now, is just too ad-hoc, even with more comments.  IMHO, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : Parallel Merge Join

2016-12-21 Thread Dilip Kumar
On Wed, Dec 21, 2016 at 8:39 PM, Robert Haas  wrote:
> Committed the refactoring patch with some mild cosmetic adjustments.

Thanks..
>
> As to the second patch:
>
> +if (jointype == JOIN_UNIQUE_INNER)
> +jointype = JOIN_INNER;
>
> Isn't this dead code?  save_jointype might that value, but jointype won't.

Yes, it is.

I have fixed this, and also observed that comment for
try_partial_mergejoin_path header was having some problem, fixed that
too.

>
> Apart from that and some cosmetic issues it looks basically OK to me
> on a first read-through.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


parallel_mergejoin_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-21 Thread Robert Haas
On Wed, Dec 21, 2016 at 10:29 AM, David Fetter  wrote:
> On Wed, Dec 21, 2016 at 06:31:52PM +0800, Craig Ringer wrote:
>> On 21 December 2016 at 14:26, Andrew Borodin  wrote:
>>
>> > I'm not sure every platform supports microsecond sleeps
>>
>> Windows at least doesn't by default, unless that changed in Win2k12
>> and Win8 with the same platform/kernel improvements that delivered
>> https://msdn.microsoft.com/en-us/library/hh706895(v=vs.85).aspx . I'm
>> not sure. On older systems sleeps are 1ms to 15ms.
>
> Apparently, as of 2011, there were ways to do this.  It's not crystal
> clear to me just how reliable they are.
>
> http://stackoverflow.com/questions/9116618/cpp-windows-is-there-a-sleep-function-in-microseconds

This whole subthread seems like a distraction to me.  I find it hard
to believe that this test case would be stable enough to survive the
buildfarm where, don't forget, we have things like
CLOBBER_CACHE_ALWAYS machines where queries take 100x longer to run.
But even if it is, surely we can pick a less contrived test case.  So
why worry about this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2016-12-21 Thread Robert Haas
Thanks for reviewing!  A few quick thoughts from me since I write a
bunch of the design for this patch.

On Wed, Dec 21, 2016 at 10:16 AM, Anastasia Lubennikova
 wrote:
> 1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of 
> xs_temp_snap flag?
>
> +   if (scan->xs_temp_snap)
> +   UnregisterSnapshot(scan->xs_snapshot);
>
> I must say that I'm quite new with all this parallel stuff. If you give me a 
> link,
> where to read about snapshots for parallel workers, my review will be more 
> helpful.
> Anyway, it would be great to have more comments about it in the code.

I suspect it would be better to keep those two things formally
separate, even though they may always be the same right now.

> 2. Don't you mind to rename 'amestimateparallelscan' to let's say 
> 'amparallelscan_spacerequired'
> or something like this? As far as I understand there is nothing to estimate, 
> we know this size
> for sure. I guess that you've chosen this name because of 
> 'heap_parallelscan_estimate'.
> But now it looks similar to 'amestimate' which refers to indexscan cost for 
> optimizer.
> That leads to the next question.

"estimate" is being used this way quite widely now, in places like
ExecParallelEstimate.  So if we're going to change the terminology we
should do it broadly.

> 3. Are there any changes in cost estimation? I didn't find related changes in 
> the patch.
> Parallel scan is expected to be faster and optimizer definitely should know 
> that.

Generally the way that's reflected in the optimized is by having the
parallel scan have a lower row count.  See cost_seqscan() for an
example.

In general, you'll probably find a lot of parallels between this patch
and ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, which is probably a good
thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-21 Thread Robert Haas
On Tue, Dec 20, 2016 at 12:22 PM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> Implement table partitioning.
>
> I thought it was odd to use rd_rel->reloftype as a boolean in
> ATExecAttachPartition, but apparently we do it elsewhere too, so let's
> leave that complaint for another day.

Ugh.  I agree - that's bad style.

> What I also found off in the same function is that we use
> SearchSysCacheCopyAttName() on each attribute and then don't free the
> result, and don't ever use the returned tuple either.  A simple fix, I
> thought, just remove the "Copy" and add a ReleaseSysCache().

Or use SearchSysCachExists.

> But then I
> noticed this whole thing is rather strange -- why not pass a boolean
> flag down to CreateInheritance() and from there to
> MergeAttributesIntoExisting() to implement the check?  That seems less
> duplicative.

Hmm, that would be another way to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background contrib module proposal

2016-12-21 Thread David Fetter
On Wed, Dec 21, 2016 at 06:31:52PM +0800, Craig Ringer wrote:
> On 21 December 2016 at 14:26, Andrew Borodin  wrote:
> 
> > I'm not sure every platform supports microsecond sleeps
> 
> Windows at least doesn't by default, unless that changed in Win2k12
> and Win8 with the same platform/kernel improvements that delivered
> https://msdn.microsoft.com/en-us/library/hh706895(v=vs.85).aspx . I'm
> not sure. On older systems sleeps are 1ms to 15ms.

Apparently, as of 2011, there were ways to do this.  It's not crystal
clear to me just how reliable they are.

http://stackoverflow.com/questions/9116618/cpp-windows-is-there-a-sleep-function-in-microseconds

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2016-12-21 Thread Anastasia Lubennikova
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi, thank you for the patch.
Results are very promising. Do you see any drawbacks of this feature or 
something that requires more testing?
I'm willing to oo a review. I hadn't do benchmarks yet, but I've read the patch 
and here are some
notes and questions about it.

I saw the discussion about parameters in the thread above. And I agree that 
we'd better concentrate
on the patch itself and add them later if necessary.

1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of 
xs_temp_snap flag?

+   if (scan->xs_temp_snap)
+   UnregisterSnapshot(scan->xs_snapshot);

I must say that I'm quite new with all this parallel stuff. If you give me a 
link,
where to read about snapshots for parallel workers, my review will be more 
helpful.
Anyway, it would be great to have more comments about it in the code.

2. Don't you mind to rename 'amestimateparallelscan' to let's say 
'amparallelscan_spacerequired'
or something like this? As far as I understand there is nothing to estimate, we 
know this size
for sure. I guess that you've chosen this name because of 
'heap_parallelscan_estimate'.
But now it looks similar to 'amestimate' which refers to indexscan cost for 
optimizer.
That leads to the next question.

3. Are there any changes in cost estimation? I didn't find related changes in 
the patch.
Parallel scan is expected to be faster and optimizer definitely should know 
that.

4. +uint8   ps_pageStatus;  /* state of scan, see below */
There is no desciption below. I'd make the comment more helpful:
/* state of scan. See possible flags values in nbtree.h */
And why do you call it pageStatus? What does it have to do with page?

5. Comment for _bt_parallel_seize() says:
"False indicates that we have reached the end of scan for
 current scankeys and for that we return block number as P_NONE."

 What is the reason to check (blkno == P_NONE) after checking (status == false)
 in _bt_first() (see code below)? If comment is correct
 we'll never reach _bt_parallel_done()

+   blkno = _bt_parallel_seize(scan, &status);
+   if (status == false)
+   {
+   BTScanPosInvalidate(so->currPos);
+   return false;
+   }
+   else if (blkno == P_NONE)
+   {
+   _bt_parallel_done(scan);
+   BTScanPosInvalidate(so->currPos);
+   return false;
+   }

6. To avoid code duplication, I would wrap this into the function

+   /* initialize moreLeft/moreRight appropriately for scan direction */
+   if (ScanDirectionIsForward(dir))
+   {
+   so->currPos.moreLeft = false;
+   so->currPos.moreRight = true;
+   }
+   else
+   {
+   so->currPos.moreLeft = true;
+   so->currPos.moreRight = false;
+   }
+   so->numKilled = 0;  /* just paranoia */
+   so->markItemIndex = -1; /* ditto */

And after that we can also get rid of _bt_parallel_readpage() which only
bring another level of indirection to the code.

7. Just a couple of typos I've noticed:

 * Below flags are used indicate the state of parallel scan.
 * Below flags are used TO indicate the state of parallel scan.

* On success, release lock and pin on buffer on success.
* On success release lock and pin on buffer.

8. I didn't find a description of the feature in documentation.
Probably we need to add a paragraph to the "Parallel Query" chapter. 

I will send another review of performance until the end of the week.

The new status of this patch is: Waiting on Author

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : Parallel Merge Join

2016-12-21 Thread Robert Haas
On Tue, Dec 13, 2016 at 10:04 AM, Dilip Kumar  wrote:
> On Tue, Dec 13, 2016 at 6:42 AM, Dilip Kumar  wrote:
>>> This patch is hard to read because it is reorganizing a bunch of code
>>> as well as adding new functionality.  Perhaps you could separate it
>>> into two patches, one with the refactoring and then the other with the
>>> new functionality.
>>
>> Okay, I can do that.
>
> I have created two patches as per the suggestion.
>
> 1. mergejoin_refactoring_v2.patch --> Move functionality of
> considering various merge join path into new function.
> 2. parallel_mergejoin_v2.patch -> This adds the functionality of
> supporting partial mergejoin paths. This will apply on top of
> mergejoin_refactoring_v2.patch.

Committed the refactoring patch with some mild cosmetic adjustments.

As to the second patch:

+if (jointype == JOIN_UNIQUE_INNER)
+jointype = JOIN_INNER;

Isn't this dead code?  save_jointype might that value, but jointype won't.

Apart from that and some cosmetic issues it looks basically OK to me
on a first read-through.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why does plpython delay composite type resolution?

2016-12-21 Thread Tom Lane
Jim Nasby  writes:
> On 12/21/16 1:55 AM, Andreas Karlsson wrote:
>> Does your patch handle "ALTER TYPE name ADD ATTRIBUTE ..."? My immediate
>> guess would be that it could be a cache invalidation thing.

> Won't that only happen at end of transaction?

No.

BEGIN;
SELECT plpython_function();
ALTER TYPE ...;
SELECT plpython_function();
COMMIT;

For that matter, the plpython function could execute the ALTER itself
through SPI, or call another function that does so.

(I'm not claiming that the existing code, either in plpython or other
PLs, necessarily handles such all scenarios nicely.  But we shouldn't
make it worse.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rethinking our fulltext phrase-search implementation

2016-12-21 Thread Tom Lane
Artur Zakirov  writes:
> Otherwise it seems that queries like 'a <-> (b & c)' will always return 
> false. Then we need maybe some warning message.

Well, the query as written is pointless, but it could be useful with
something other than "b" and "c" as the AND-ed terms.  In this usage
"&" is equivalent to "<0>", which we know has corner-case uses.

I'm not inclined to issue any sort of warning for unsatisfiable queries.
We don't issue a warning when a SQL WHERE condition collapses to constant
FALSE, and that seems like exactly the same sort of situation.

It strikes me though that the documentation should point this out.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why does plpython delay composite type resolution?

2016-12-21 Thread Jim Nasby

On 12/21/16 1:55 AM, Andreas Karlsson wrote:

On 12/21/2016 04:14 AM, Jim Nasby wrote:

Why do functions that accept composite types delay type resolution until
execution? I have a naive patch that speeds up plpy.execute() by 8% by
caching interred python strings for the dictionary key names (which are
repeated over and over). The next step is to just pre-allocate those
strings as appropriate for the calling context, but it's not clear how
to handle that for input arguments.


Does your patch handle "ALTER TYPE name ADD ATTRIBUTE ..."? My immediate
guess would be that it could be a cache invalidation thing.


Won't that only happen at end of transaction?

After reading the tuple queue code I'm wondering if part of the issue is 
anonymous records, though that doesn't make much sense since plpython 
doesn't support those.


Given the lackluster support for arrays and composites in plpython, I 
suspect this is just a wart that hasn't been removed yet...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Faster methods for getting SPI results

2016-12-21 Thread Jim Nasby

On 12/20/16 10:14 PM, Jim Nasby wrote:

It would be a lot more efficient if we could just grab datums from the
executor and make a single copy into plpython (or R), letting the PL
deal with all the memory management overhead.

I briefly looked at using SPI cursors to do just that, but that looks
even worse: every fetch is executed in a subtransaction, and every fetch
creates an entire tuplestore even if it's just going to return a single
value. (But hey, we never claimed cursors were fast...)

Is there any way to avoid all of this? I'm guessing one issue might be
that we don't want to call an external interpreter while potentially
holding page pins, but even then couldn't we just copy a single tuple at
a time and save a huge amount of palloc overhead?


AFAICT that's exactly how DestRemote works: it grabs a raw slot from the 
executor, makes sure it's fully expanded, and sends it on it's way via 
pq_send*(). So presumably the same could be done for SPI, by creating a 
new CommandDest (ISTM none of the existing ones would do what we want).


I'm not sure what the API for this should look like. One possibility is 
to have SPI_execute and friends accept a flag that indicates not to 
build a tupletable. I don't think a query needs to be read-only to allow 
for no tuplestore, so overloading read_only seems like a bad idea.


Another option is to treat this as a "lightweight cursor" that only 
allows forward fetches. One nice thing about that option is it leaves 
open the possibility of using a small tuplestore for each "fetch", 
without all the overhead that a full blown cursor has. This assumes 
there are some use cases where you want to operate on relatively small 
sets of tuples at a time, but you don't need to materialize the whole 
thing in one shot.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgstattuple documentation clarification

2016-12-21 Thread Andrew Dunstan



On 12/20/2016 11:41 PM, Robert Haas wrote:

On Tue, Dec 20, 2016 at 10:01 AM, Tom Lane  wrote:

Andrew Dunstan  writes:

Recently a client was confused because there was a substantial
difference between the reported table_len of a table and the sum of the
corresponding tuple_len, dead_tuple_len and free_space. The docs are
fairly silent on this point, and I agree that in the absence of
explanation it is confusing, so I propose that we add a clarification
note along the lines of:
 The table_len will always be greater than the sum of the tuple_len,
 dead_tuple_len and free_space. The difference is accounted for by
 page overhead and space that is not free but cannot be attributed to
 any particular tuple.
Or perhaps we should be more explicit and refer to the item pointers on
the page.

I find "not free but cannot be attributed to any particular tuple"
to be entirely useless weasel wording, not to mention wrong with
respect to item pointers in particular.

Perhaps we should start counting the item pointers in tuple_len.
We'd still have to explain about page header overhead, but that
would be a pretty small and fixed-size discrepancy.

It's pretty weird to count unused or dead line pointers as part of
tuple_len, and it would screw things up for anybody trying to
calculate the average width of their tuples, which is an entirely
reasonable thing to want to do.  I think if we're going to count item
pointers as anything, it needs to be some new category -- either item
pointers specifically, or an "other stuff" bucket.




Yes, I agree. In any case, before we change anything can we agree on a 
description of what we currently do?


Here's a second attempt:

   The table_len will always be greater than the sum of the tuple_len,
   dead_tuple_len and free_space. The difference is accounted for by
   fixed page overhead, the per-page table of pointers to tuples, and
   padding to ensure that tuples are correctly aligned.

I don't think any of that is weaselish :-)

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-21 Thread Robert Haas
On Wed, Dec 21, 2016 at 7:04 AM, Heikki Linnakangas  wrote:
>> If the worker is always completely finished with the tape before the
>> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
>> the tape and overwrite it like any other?
>
> Currently, the logical tape code assumes that all tapes in a single
> LogicalTapeSet are allocated from the same BufFile. The logical tape's
> on-disk format contains block numbers, to point to the next/prev block of
> the tape [1], and they're assumed to refer to the same file. That allows
> reusing space efficiently during the merge. After you have read the first
> block from tapes A, B and C, you can immediately reuse those three blocks
> for output tape D.

I see.  Hmm.

> Now, if you read multiple tapes, from different LogicalTapeSet, hence backed
> by different BufFiles, you cannot reuse the space from those different tapes
> for a single output tape, because the on-disk format doesn't allow referring
> to blocks in other files. You could reuse the space of *one* of the input
> tapes, by placing the output tape in the same LogicalTapeSet, but not all of
> them.
>
> We could enhance that, by using "filename + block number" instead of just
> block number, in the pointers in the logical tapes. Then you could spread
> one logical tape across multiple files. Probably not worth it in practice,
> though.

OK, so the options as I understand them are:

1. Enhance the logical tape set infrastructure in the manner you
mention, to support filename (or more likely a proxy for filename) +
block number in the logical tape pointers.  Then, tapes can be
transferred from one LogicalTapeSet to another.

2. Enhance the BufFile infrastructure to support some notion of a
shared BufFile so that multiple processes can be reading and writing
blocks in the same BufFile.  Then, extend the logical tape
infrastruture so that we also have the notion of a shared LogicalTape.
This means that things like ltsGetFreeBlock() need to be re-engineered
to handle concurrency with other backends.

3. Just live with the waste of space.

I would guess that (1) is easier than (2).  Also, (2) might provoke
contention while writing tapes that is otherwise completely
unnecessary.  It seems silly to have multiple backends fighting over
the same end-of-file pointer for the same file when they could just
write to different files instead.

Another tangentially-related problem I just realized is that we need
to somehow handle the issues that tqueue.c does when transferring
tuples between backends -- most of the time there's no problem, but if
anonymous record types are involved then tuples require "remapping".
It's probably harder to provoke a failure in the tuplesort case than
with parallel query per se, but it's probably not impossible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Add pgstathashindex() to get hash index table statistics.

2016-12-21 Thread Ashutosh Sharma
Hi All,

I have introduced a new function 'pgstathashindex()' inside pgstatuple
extension to view the statistics related to hash index table. I could
have used 'pgstattuple()' function to view hash index stats instead of
adding this new function but there are certain limitations when using
pgstattuple() for hash indexes. Firstly, it doesn't work if a hash
index contains zero or new pages which is very common in case of hash
indexes. Secondly, it doesn't provide information about different
types of pages in hash index and its count. Considering these points,
I have thought of introducing this function. Attached is the patch for
the same. Please have a look and let me your feedback. I would also
like to mention that this idea basically came from my colleague Kuntal
Ghosh and i  implemented it. I have also created a commit-fest entry
for this submission. Thanks.

With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From a82d350b48b9daee017d2f31dee136809333ea82 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Wed, 21 Dec 2016 18:39:47 +0530
Subject: [PATCH] Add pgstathashindex() to pgstattuple extension v1

This allows us to see the hash index table statistics.
We could have directly used pgstattuple() to see the
hash table statistics but it doesn't include some of
the stats related to hash index like number of bucket
pages, overflow pages, zero pages etc. Moreover, it
can't be used if hash index table contains a zero page.

Patch by Ashutosh. Needs review.
---
 contrib/pgstattuple/Makefile  |   8 +-
 contrib/pgstattuple/pgstatindex.c | 225 ++
 contrib/pgstattuple/pgstattuple--1.4.sql  |  15 ++
 contrib/pgstattuple/pgstattuple--1.5--1.6.sql |  22 +++
 contrib/pgstattuple/pgstattuple.control   |   2 +-
 doc/src/sgml/pgstattuple.sgml | 108 +
 6 files changed, 375 insertions(+), 5 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstattuple--1.5--1.6.sql

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 294077d..a1601ec 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -4,10 +4,10 @@ MODULE_big	= pgstattuple
 OBJS		= pgstattuple.o pgstatindex.o pgstatapprox.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.4.sql pgstattuple--1.4--1.5.sql \
-	pgstattuple--1.3--1.4.sql pgstattuple--1.2--1.3.sql \
-	pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql \
-	pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.5--1.6.sql pgstattuple--1.4--1.5.sql \
+	pgstattuple--1.4.sql pgstattuple--1.3--1.4.sql \
+	pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql	\
+	pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index d9a722a..a3c8f23 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -29,6 +29,7 @@
 
 #include "access/gin_private.h"
 #include "access/heapam.h"
+#include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "catalog/namespace.h"
@@ -36,6 +37,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -53,6 +55,7 @@ PG_FUNCTION_INFO_V1(pgstatindexbyid);
 PG_FUNCTION_INFO_V1(pg_relpages);
 PG_FUNCTION_INFO_V1(pg_relpagesbyid);
 PG_FUNCTION_INFO_V1(pgstatginindex);
+PG_FUNCTION_INFO_V1(pgstathashindex);
 
 PG_FUNCTION_INFO_V1(pgstatindex_v1_5);
 PG_FUNCTION_INFO_V1(pgstatindexbyid_v1_5);
@@ -60,11 +63,17 @@ PG_FUNCTION_INFO_V1(pg_relpages_v1_5);
 PG_FUNCTION_INFO_V1(pg_relpagesbyid_v1_5);
 PG_FUNCTION_INFO_V1(pgstatginindex_v1_5);
 
+PG_FUNCTION_INFO_V1(pgstathashindex_v1_6);
+
 Datum pgstatginindex_internal(Oid relid, FunctionCallInfo fcinfo);
+Datum pgstathashindex_internal(Oid relid, FunctionCallInfo fcinfo);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
 #define IS_GIN(r) ((r)->rd_rel->relam == GIN_AM_OID)
+#define IS_HASH(r) ((r)->rd_rel->relam == HASH_AM_OID)
+
+#define HASH_HEAD_BLKNO HASH_METAPAGE + 1
 
 /* 
  * A structure for a whole btree index statistics
@@ -101,7 +110,29 @@ typedef struct GinIndexStat
 	int64		pending_tuples;
 } GinIndexStat;
 
+/* 
+ * A structure for a whole HASH index statistics
+ * used by pgstathashindex().
+ * 
+ */
+typedef struct HashIndexStat
+{
+	uint32	version;
+	uint32	total_pages;
+	uint32	bucket_pages;
+	uint32	overflow_pages;
+	uint32	bitmap_pages;
+	uint32	zero_pages;
+	uint64	ntuples;
+	uint16	ffactor;
+	uint64	live_items;
+	uint64	dead_items;
+	uint64	free_space;
+} HashIndexStat;
+
 static Datum pgstatindex_impl(Relation rel, Functi

Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-21 Thread Craig Ringer
On 21 December 2016 at 21:23, Simon Riggs  wrote:

> Valid bug, but this still ain't right. We don't want to turn feedback
> on until HS is active, but we do want to turn it off once whether or
> not HS is active yet.

I just posted an update, though I forgot to add you to the direct Cc
list. It passes the provided test script.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-21 Thread Stas Kelvich

> On 21 Dec 2016, at 19:56, Michael Paquier  wrote:
> 
> That's indeed way simpler than before. Have you as well looked at the
> most simple approach discussed? That would be just roughly replacing
> the pg_fsync() calls currently in RecreateTwoPhaseFile() by a save
> into a list as you are doing, then issue them all checkpoint.Even for
> 2PC files that are created and then removed before the next
> checkpoint, those will likely be in system cache.

  Yes, I tried that as well. But in such approach another bottleneck arises —
new file creation isn’t very cheap operation itself. Dual xeon with 100 backends
quickly hit that, and OS routines about file creation occupies first places in
perf top. Probably that depends on filesystem (I used ext4), but avoiding
file creation when it isn’t necessary seems like cleaner approach.
  On the other hand it is possible to skip file creation by reusing files, for 
example
naming them by dummy PGPROC offset, but that would require some changes
to places that right now looks only at filenames.

> This removes as well
> the need to have XlogReadTwoPhaseData() work in crash recovery, which
> makes me a bit nervous.

Hm, do you have any particular bad scenario for that case in you mind?

> And this saves lookups at the WAL segments
> still present in pg_xlog, making the operation at checkpoint much
> faster with many 2PC files to process.

ISTM your reasoning about filesystem cache applies here as well, but just 
without
spending time on file creation.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-21 Thread Craig Ringer
On 21 December 2016 at 21:03, Ants Aasma  wrote:

> There was a !HotStandbyActive() check there previously, I was not sure
> if it was only to avoid sending useless messages or was necessary
> because something was not initialized otherwise. Looks like it is not
> needed and can be removed. Revised patch that does so attached.

Ah, see, I'm blind. Too much multi-tasking. Turns out patch review
with a toddler's help is hard ;)

HotStandbyActive() call is actually just checking if we're still in
crash or archive recovery, per

/*
 * SharedHotStandbyActive indicates if we're still in crash or archive
 * recovery.  Protected by info_lck.
 */
boolSharedHotStandbyActive;


so it is appropriate to defer any hot standby feedback until then. By
that point we've definitely finished setting up replication slots for
one thing, and know for sure if we have something useful to say and
won't send the wrong thing.

It looks to me like we should continue to bail out if !HotStandbyActive() . The

Assert(!master_has_standby_xmin)

can go, since there's now a valid case where master_has_standby_xmin
can be true before HotStandbyActive() is true.

Here's a revised version. I haven't run it against your tests yet.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 5d86d42e30ccb3dd3e1b227b102384762d66e9dd Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Wed, 21 Dec 2016 21:28:32 +0800
Subject: [PATCH] Reset xmin if hot_standby_feedback is turned off while
 shutdown

If the user turns hot_standby_feedback off while we're shut down or the
walreciever is restarting, we won't remember that we sent an xmin to the
server. This never used to matter since the xmin was tied to the walsender
backend PGPROC, but now that we have replication slots it's persistent across
connections.

Always send at least one hot_standby_feedback message after walreceiver
startup, even if hot_standby_feedback is off, to ensure any saved
slot xmin from a prior connection is cleared.

Ants Aasma, Craig Ringer
---
 src/backend/replication/walreceiver.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index cc3cf7d..11de996 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1157,7 +1157,10 @@ XLogWalRcvSendReply(bool force, bool requestReply)
  * in case they don't have a watch.
  *
  * If the user disables feedback, send one final message to tell sender
- * to forget about the xmin on this standby.
+ * to forget about the xmin on this standby. We also send this message
+ * on first connect because a previous connection might have set xmin
+ * on a replication slot. (If we're not using a slot it's harmless to
+ * send a feedback message explicitly setting InvalidTransactionId).
  */
 static void
 XLogWalRcvSendHSFeedback(bool immed)
@@ -1167,7 +1170,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 	uint32		nextEpoch;
 	TransactionId xmin;
 	static TimestampTz sendTime = 0;
-	static bool master_has_standby_xmin = false;
+	/* initially true so we always send at least one feedback message */
+	static bool master_has_standby_xmin = true;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -1192,14 +1196,17 @@ XLogWalRcvSendHSFeedback(bool immed)
 	}
 
 	/*
-	 * If Hot Standby is not yet active there is nothing to send. Check this
-	 * after the interval has expired to reduce number of calls.
+	 * If Hot Standby is not yet accepting connections there is nothing to
+	 * send. Check this after the interval has expired to reduce number of
+	 * calls.
+	 *
+	 * Bailing out here also ensures that we don't send feedback until we've
+	 * read our own replication slot state, so we don't tell the master to
+	 * discard needed xmin or catalog_xmin from any slots that may exist
+	 * on this replica.
 	 */
 	if (!HotStandbyActive())
-	{
-		Assert(!master_has_standby_xmin);
 		return;
-	}
 
 	/*
 	 * Make the expensive call to get the oldest xmin once we are certain
-- 
2.5.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-12-21 Thread Heikki Linnakangas

On 12/21/2016 03:22 PM, Robert Haas wrote:

On Wed, Dec 21, 2016 at 7:25 AM, Heikki Linnakangas  wrote:

For now, barring objections, I'm going to commit the first patch. It seems
like a worthwhile simplification in any case, especially for Peter's
Parallel tuplesort patch set.


Well, it removes more code than it adds.  That's definitely something.
And saving memory per-empty-tape is good, too.  A few random comments:

"seeked" is kind of a lame variable name.  How about "seekpos" or
"newpos" or something like that?


Ok.


 /*
  * Even in minimum memory, use at least a MINORDER merge.  On the other
  * hand, even when we have lots of memory, do not use more than a MAXORDER
- * merge.  Tapes are pretty cheap, but they're not entirely free.  Each
- * additional tape reduces the amount of memory available to build runs,
- * which in turn can cause the same sort to need more runs, which makes
- * merging slower even if it can still be done in a single pass.  Also,
- * high order merges are quite slow due to CPU cache effects; it can be
- * faster to pay the I/O cost of a polyphase merge than to perform a single
- * merge pass across many hundreds of tapes.
+ * merge.  Tapes are pretty cheap, but they're not entirely free.
High order
+ * merges are quite slow due to CPU cache effects; it can be faster to pay
+ * the I/O cost of a polyphase merge than to perform a single merge pass
+ * across many hundreds of tapes.
  */

I think you could leave this comment as-is.  You haven't zeroed out
the overhead of a tape, and I like those additional bits I crammed in
there.  :-)


Yes, quite right. That was a mishap in rebasing, that change to remove 
the comment really belongs to the pause/resume patch rather than the 1st 
one.


Thanks for the review!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-21 Thread Simon Riggs
On 21 December 2016 at 13:03, Ants Aasma  wrote:
> On Wed, Dec 21, 2016 at 2:09 PM, Craig Ringer  wrote:
>> On 21 December 2016 at 15:40, Ants Aasma  wrote:
>>
 So -1 on this part of the patch, unless there's something I've 
 misunderstood.
>>>
>>> Currently there was no feedback sent if hot standby was not active. I
>>> was not sure if it was safe to call GetOldestXmin() in that case.
>>> However I did not consider cascading replica slots wanting to hold
>>> back xmin, where resetting the parents xmin is indeed wrong. Do you
>>> know if GetOldestXmin() is safe at this point and we can just remove
>>> the HotStandbyActive() check? Otherwise I think the correct approach
>>> is to move the check and return inside the hot_standby_feedback case
>>> like this:
>>>
>>> if (hot_standby_feedback)
>>> {
>>> if (!HotStandbyActive())
>>>return;
>>
>> I feel like I'm missing something obvious here. If we force sending
>> hot standby feedback at least once, by assuming
>> master_has_standby_xmin = true at startup, why isn't that sufficient?
>> We'll send InvalidTransactionId if hot_standby_feedback is off. Isn't
>> that the point?
>>
>> It's safe to call GetOldestXmin pretty much as soon as we load the
>> recovery start checkpoint. It won't consider the state of replication
>> slots until later in startup, but that's a pre-existing flaw that
>> should be addressed separately.
>>
>> Why do we need to do more than master_has_standby_xmin = true ?
>
> There was a !HotStandbyActive() check there previously, I was not sure
> if it was only to avoid sending useless messages or was necessary
> because something was not initialized otherwise. Looks like it is not
> needed and can be removed. Revised patch that does so attached.

Valid bug, but this still ain't right. We don't want to turn feedback
on until HS is active, but we do want to turn it off once whether or
not HS is active yet.

We need a full detailed comment explaining this.

Fix it up and I'll commit. Thanks for the report.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-12-21 Thread Robert Haas
On Wed, Dec 21, 2016 at 7:25 AM, Heikki Linnakangas  wrote:
> For now, barring objections, I'm going to commit the first patch. It seems
> like a worthwhile simplification in any case, especially for Peter's
> Parallel tuplesort patch set.

Well, it removes more code than it adds.  That's definitely something.
And saving memory per-empty-tape is good, too.  A few random comments:

"seeked" is kind of a lame variable name.  How about "seekpos" or
"newpos" or something like that?

 /*
  * Even in minimum memory, use at least a MINORDER merge.  On the other
  * hand, even when we have lots of memory, do not use more than a MAXORDER
- * merge.  Tapes are pretty cheap, but they're not entirely free.  Each
- * additional tape reduces the amount of memory available to build runs,
- * which in turn can cause the same sort to need more runs, which makes
- * merging slower even if it can still be done in a single pass.  Also,
- * high order merges are quite slow due to CPU cache effects; it can be
- * faster to pay the I/O cost of a polyphase merge than to perform a single
- * merge pass across many hundreds of tapes.
+ * merge.  Tapes are pretty cheap, but they're not entirely free.
High order
+ * merges are quite slow due to CPU cache effects; it can be faster to pay
+ * the I/O cost of a polyphase merge than to perform a single merge pass
+ * across many hundreds of tapes.
  */

I think you could leave this comment as-is.  You haven't zeroed out
the overhead of a tape, and I like those additional bits I crammed in
there.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring replay lag

2016-12-21 Thread Fujii Masao
On Mon, Dec 19, 2016 at 8:13 PM, Thomas Munro
 wrote:
> On Mon, Dec 19, 2016 at 4:03 PM, Peter Eisentraut
>  wrote:
>> On 11/22/16 4:27 AM, Thomas Munro wrote:
>>> Thanks very much for testing!  New version attached.  I will add this
>>> to the next CF.
>>
>> I don't see it there yet.
>
> Thanks for the reminder.  Added here:  
> https://commitfest.postgresql.org/12/920/
>
> Here's a rebased patch.

I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.

For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.

As a bonus of this approach, we don't need to add the field into the replay
message that walreceiver can very frequently send back. Which might be
helpful in terms of networking overhead.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-21 Thread Ants Aasma
On Wed, Dec 21, 2016 at 2:09 PM, Craig Ringer  wrote:
> On 21 December 2016 at 15:40, Ants Aasma  wrote:
>
>>> So -1 on this part of the patch, unless there's something I've 
>>> misunderstood.
>>
>> Currently there was no feedback sent if hot standby was not active. I
>> was not sure if it was safe to call GetOldestXmin() in that case.
>> However I did not consider cascading replica slots wanting to hold
>> back xmin, where resetting the parents xmin is indeed wrong. Do you
>> know if GetOldestXmin() is safe at this point and we can just remove
>> the HotStandbyActive() check? Otherwise I think the correct approach
>> is to move the check and return inside the hot_standby_feedback case
>> like this:
>>
>> if (hot_standby_feedback)
>> {
>> if (!HotStandbyActive())
>>return;
>
> I feel like I'm missing something obvious here. If we force sending
> hot standby feedback at least once, by assuming
> master_has_standby_xmin = true at startup, why isn't that sufficient?
> We'll send InvalidTransactionId if hot_standby_feedback is off. Isn't
> that the point?
>
> It's safe to call GetOldestXmin pretty much as soon as we load the
> recovery start checkpoint. It won't consider the state of replication
> slots until later in startup, but that's a pre-existing flaw that
> should be addressed separately.
>
> Why do we need to do more than master_has_standby_xmin = true ?

There was a !HotStandbyActive() check there previously, I was not sure
if it was only to avoid sending useless messages or was necessary
because something was not initialized otherwise. Looks like it is not
needed and can be removed. Revised patch that does so attached.

Regards,
Ants Aasma
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index cc3cf7d..84ffa91 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1157,7 +1157,9 @@ XLogWalRcvSendReply(bool force, bool requestReply)
  * in case they don't have a watch.
  *
  * If the user disables feedback, send one final message to tell sender
- * to forget about the xmin on this standby.
+ * to forget about the xmin on this standby. We also send this message
+ * on first connect because a previous connection might have set xmin
+ * on a replication slot.
  */
 static void
 XLogWalRcvSendHSFeedback(bool immed)
@@ -1167,7 +1169,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 	uint32		nextEpoch;
 	TransactionId xmin;
 	static TimestampTz sendTime = 0;
-	static bool master_has_standby_xmin = false;
+	/* initially true so we always send at least one feedback message */
+	static bool master_has_standby_xmin = true;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -1192,16 +1195,6 @@ XLogWalRcvSendHSFeedback(bool immed)
 	}
 
 	/*
-	 * If Hot Standby is not yet active there is nothing to send. Check this
-	 * after the interval has expired to reduce number of calls.
-	 */
-	if (!HotStandbyActive())
-	{
-		Assert(!master_has_standby_xmin);
-		return;
-	}
-
-	/*
 	 * Make the expensive call to get the oldest xmin once we are certain
 	 * everything else has been checked.
 	 */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Simplify LWLock tranche machinery by removing array_base/array_s

2016-12-21 Thread Pavel Stehule
2016-12-21 13:54 GMT+01:00 Robert Haas :

> On Wed, Dec 21, 2016 at 2:14 AM, Pavel Stehule 
> wrote:
> > Is there some help for extensions developers, how to fix extensions after
> > this change?
> >
> > Orafce hits this change.
>
> array_base and array_stride are gone; don't pass them.  The second
> argument to LWLockRegisterTranche() is now the tranche name.  Pass
> that instead of a structure.
>

Thank you

Pavel

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] Speedup twophase transactions

2016-12-21 Thread Michael Paquier
On Fri, Dec 16, 2016 at 8:00 PM, Stas Kelvich  wrote:
> So, here is brand new implementation of the same thing.
>
> Now instead of creating pgproc entry for prepared transaction during
> recovery,
> I just store recptr/xid correspondence in separate 2L-list and deleting
> entries in that
> list if redo process faced commit/abort. In case of checkpoint or end of
> recovery
> transactions remaining in that list are dumped to files in pg_twophase.
>
> Seems that current approach is way more simpler and patch has two times less
> LOCs then previous one.

That's indeed way simpler than before. Have you as well looked at the
most simple approach discussed? That would be just roughly replacing
the pg_fsync() calls currently in RecreateTwoPhaseFile() by a save
into a list as you are doing, then issue them all checkpoint. Even for
2PC files that are created and then removed before the next
checkpoint, those will likely be in system cache. This removes as well
the need to have XlogReadTwoPhaseData() work in crash recovery, which
makes me a bit nervous. And this saves lookups at the WAL segments
still present in pg_xlog, making the operation at checkpoint much
faster with many 2PC files to process.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Simplify LWLock tranche machinery by removing array_base/array_s

2016-12-21 Thread Robert Haas
On Wed, Dec 21, 2016 at 2:14 AM, Pavel Stehule  wrote:
> Is there some help for extensions developers, how to fix extensions after
> this change?
>
> Orafce hits this change.

array_base and array_stride are gone; don't pass them.  The second
argument to LWLockRegisterTranche() is now the tranche name.  Pass
that instead of a structure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw bug in 9.6

2016-12-21 Thread Etsuro Fujita

On 2016/12/20 0:37, Tom Lane wrote:

Etsuro Fujita  writes:

On 2016/12/17 1:13, Tom Lane wrote:

So I think the rule could be



"When first asked to produce a path for a given foreign joinrel, collect
the cheapest paths for its left and right inputs, and make a nestloop path
(or hashjoin path, if full join) from those, using the join quals needed
for the current input relation pair.



Seems reasonable.



Use this as the fdw_outerpath for
all foreign paths made for the joinrel."



I'm not sure that would work well for foreign joins with sort orders.
Consider a merge join, whose left input is a 2-way foreign join with a
sort order that implements a full join and whose right input is a sorted
local table scan.  If the EPQ subplan for the foreign join wouldn't
produce the right sort order, the merge join might break during EPQ
rechecks (note that in this case the EPQ subplan for the foreign join
might produce more than a single row during an EPQ recheck).



How so?  We only recheck one row at a time, therefore it can be claimed to
have any sort order you care about.


I'll have second thoughts about that.  I agree with you except for that, 
so I've created a patch; I removed GetExistingLocalJoinPath and added a 
helper function, CreateLocalJoinPath, that generates a local join path 
for a given foreign join, as described above.  Please find attached a patch.


Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 1519,1540  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t
 Output: t1.c1, t2.c1, t1.c3, t1.*, t2.*
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8) END FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, r1."C 1" ASC NULLS LAST FOR UPDATE OF r1
!->  Merge Join
   Output: t1.c1, t1.c3, t1.*, t2.c1, t2.*
!  Merge Cond: (t1.c1 = t2.c1)
!  ->  Sort
 Output: t1.c1, t1.c3, t1.*
!Sort Key: t1.c1
!->  Foreign Scan on public.ft1 t1
!  Output: t1.c1, t1.c3, t1.*
!  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE
!  ->  Sort
 Output: t2.c1, t2.*
!Sort Key: t2.c1
!->  Foreign Scan on public.ft2 t2
!  Output: t2.c1, t2.*
!  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1"
! (23 rows)
  
  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10 FOR UPDATE OF t1;
   c1  | c1  
--- 1519,1534 
 Output: t1.c1, t2.c1, t1.c3, t1.*, t2.*
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8) END FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, r1."C 1" ASC NULLS LAST FOR UPDATE OF r1
!->  Nested Loop
   Output: t1.c1, t1.c3, t1.*, t2.c1, t2.*
!  Join Filter: (t1.c1 = t2.c1)
!  ->  Foreign Scan on public.ft1 t1
 Output: t1.c1, t1.c3, t1.*
!Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE
!  ->  Foreign Scan on public.ft2 t2
 Output: t2.c1, t2.*
!Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1"
! (17 rows)
  
  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10 FOR UPDATE OF t1;
   c1  | c1  
***
*** 1563,1584  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t
 Output: t1.c1, t2.c1, t1.c3, t1.*, t2.*
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, r2.c3

Re: [HACKERS] Rethinking our fulltext phrase-search implementation

2016-12-21 Thread Artur Zakirov

Hello Tom,

On 17.12.2016 21:36, Tom Lane wrote:


4. The transformations are wrong anyway.  The OR case I showed above is
all right, but as I argued in <24331.1480199...@sss.pgh.pa.us>, the AND
case is not:

regression=# select 'a <-> (b & c)'::tsquery;
  tsquery
---
 'a' <-> 'b' & 'a' <-> 'c'
(1 row)

This matches 'a b a c', because 'a <-> b' and 'a <-> c' can each be
matched at different places in that text; but it seems highly unlikely to
me that that's what the writer of such a query wanted.  (If she did want
that, she would write it that way to start with.)  NOT is not very nice
either:


If I'm not mistaken PostgreSQL 9.6 and master with patch 
"fix-phrase-search.patch" return false for the query:


select 'a b a c' @@ 'a <-> (b & c)'::tsquery;
 ?column?
--
 f
(1 row)

I agree that such query is confusing. Maybe it is better to return true 
for such queries?
Otherwise it seems that queries like 'a <-> (b & c)' will always return 
false. Then we need maybe some warning message.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-12-21 Thread Heikki Linnakangas

On 10/12/2016 05:30 PM, Heikki Linnakangas wrote:

On 10/10/2016 10:31 PM, Peter Geoghegan wrote:

On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas  wrote:

Ok, good. I think we're in agreement on doing this, then, even if we don't
agree on the justification :-).


Agreed. :-)


Attached are new patch versions. Rebased them over current head, fixed a
number of bugs in the seek and backspace code, and did a bunch of
stylistic cleanups and comment fixes.


A rebased set of patches attached.

In the meanwhile, Robert committed the cap on the number of tapes. Since 
that's in, I'm not sure if the pause/resume part of this is worthwhile. 
Maybe it is.


For now, barring objections, I'm going to commit the first patch. It 
seems like a worthwhile simplification in any case, especially for 
Peter's Parallel tuplesort patch set.


- Heikki

>From 70470e8903c5e9db8252463e1f0b063d209727dd Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 21 Dec 2016 14:12:37 +0200
Subject: [PATCH 1/2] Simplify tape block format.

No more indirect blocks. The blocks form a linked list instead.

This saves some memory, because we don't need to have a buffer in memory to
hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD
is reduced from 3 to 1 buffer, which allows using more memory for building
the initial runs.
---
 src/backend/utils/sort/logtape.c   | 627 +++--
 src/backend/utils/sort/tuplesort.c |  69 ++--
 src/include/utils/logtape.h|   4 +-
 3 files changed, 218 insertions(+), 482 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 316301061d..07d147dc47 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -31,15 +31,8 @@
  * in BLCKSZ-size blocks.  Space allocation boils down to keeping track
  * of which blocks in the underlying file belong to which logical tape,
  * plus any blocks that are free (recycled and not yet reused).
- * The blocks in each logical tape are remembered using a method borrowed
- * from the Unix HFS filesystem: we store data block numbers in an
- * "indirect block".  If an indirect block fills up, we write it out to
- * the underlying file and remember its location in a second-level indirect
- * block.  In the same way second-level blocks are remembered in third-
- * level blocks, and so on if necessary (of course we're talking huge
- * amounts of data here).  The topmost indirect block of a given logical
- * tape is never actually written out to the physical file, but all lower-
- * level indirect blocks will be.
+ * The blocks in each logical tape form a chain, with a prev- and next-
+ * pointer in each block.
  *
  * The initial write pass is guaranteed to fill the underlying file
  * perfectly sequentially, no matter how data is divided into logical tapes.
@@ -87,58 +80,65 @@
 #include "utils/memutils.h"
 
 /*
- * Block indexes are "long"s, so we can fit this many per indirect block.
- * NB: we assume this is an exact fit!
- */
-#define BLOCKS_PER_INDIR_BLOCK	((int) (BLCKSZ / sizeof(long)))
-
-/*
- * We use a struct like this for each active indirection level of each
- * logical tape.  If the indirect block is not the highest level of its
- * tape, the "nextup" link points to the next higher level.  Only the
- * "ptrs" array is written out if we have to dump the indirect block to
- * disk.  If "ptrs" is not completely full, we store -1L in the first
- * unused slot at completion of the write phase for the logical tape.
+ * A TapeBlockTrailer is stored at the end of each BLCKSZ block.
+ *
+ * The first block of a tape has prev == -1.  The last block of a tape
+ * stores the number of valid bytes on the block, inverted, in 'next'
+ * Therefore next < 0 indicates the last block.
  */
-typedef struct IndirectBlock
+typedef struct TapeBlockTrailer
 {
-	int			nextSlot;		/* next pointer slot to write or read */
-	struct IndirectBlock *nextup;		/* parent indirect level, or NULL if
-		 * top */
-	long		ptrs[BLOCKS_PER_INDIR_BLOCK];	/* indexes of contained blocks */
-} IndirectBlock;
+	long		prev;			/* previous block on this tape, or -1 on first
+ * block */
+	long		next;			/* next block on this tape, or # of valid
+ * bytes on last block (if < 0) */
+}	TapeBlockTrailer;
+
+#define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
+#define TapeBlockGetTrailer(buf) \
+	((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize))
+
+#define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0)
+#define TapeBlockGetNBytes(buf) \
+	(TapeBlockIsLast(buf) ? \
+	 (- TapeBlockGetTrailer(buf)->next) : TapeBlockPayloadSize)
+#define TapeBlockSetNBytes(buf, nbytes) \
+	(TapeBlockGetTrailer(buf)->next = -(nbytes))
+
 
 /*
  * This data structure represents a single "logical tape" within the set
- * of logical tapes stored in the same file.  We must keep track of the
- * current partially-read-or-written data block as well as th

Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-21 Thread Craig Ringer
On 21 December 2016 at 15:40, Ants Aasma  wrote:

>> So -1 on this part of the patch, unless there's something I've misunderstood.
>
> Currently there was no feedback sent if hot standby was not active. I
> was not sure if it was safe to call GetOldestXmin() in that case.
> However I did not consider cascading replica slots wanting to hold
> back xmin, where resetting the parents xmin is indeed wrong. Do you
> know if GetOldestXmin() is safe at this point and we can just remove
> the HotStandbyActive() check? Otherwise I think the correct approach
> is to move the check and return inside the hot_standby_feedback case
> like this:
>
> if (hot_standby_feedback)
> {
> if (!HotStandbyActive())
>return;

I feel like I'm missing something obvious here. If we force sending
hot standby feedback at least once, by assuming
master_has_standby_xmin = true at startup, why isn't that sufficient?
We'll send InvalidTransactionId if hot_standby_feedback is off. Isn't
that the point?

It's safe to call GetOldestXmin pretty much as soon as we load the
recovery start checkpoint. It won't consider the state of replication
slots until later in startup, but that's a pre-existing flaw that
should be addressed separately.

Why do we need to do more than master_has_standby_xmin = true ?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-21 Thread Heikki Linnakangas

On 12/21/2016 12:53 AM, Robert Haas wrote:

That leaves one problem, though: reusing space in the final merge phase. If
the tapes being merged belong to different LogicalTapeSets, and create one
new tape to hold the result, the new tape cannot easily reuse the space of
the input tapes because they are on different tape sets.


If the worker is always completely finished with the tape before the
leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
the tape and overwrite it like any other?


Currently, the logical tape code assumes that all tapes in a single 
LogicalTapeSet are allocated from the same BufFile. The logical tape's 
on-disk format contains block numbers, to point to the next/prev block 
of the tape [1], and they're assumed to refer to the same file. That 
allows reusing space efficiently during the merge. After you have read 
the first block from tapes A, B and C, you can immediately reuse those 
three blocks for output tape D.


Now, if you read multiple tapes, from different LogicalTapeSet, hence 
backed by different BufFiles, you cannot reuse the space from those 
different tapes for a single output tape, because the on-disk format 
doesn't allow referring to blocks in other files. You could reuse the 
space of *one* of the input tapes, by placing the output tape in the 
same LogicalTapeSet, but not all of them.


We could enhance that, by using "filename + block number" instead of 
just block number, in the pointers in the logical tapes. Then you could 
spread one logical tape across multiple files. Probably not worth it in 
practice, though.



[1] As the code stands, there are no next/prev pointers, but a tree of 
"indirect" blocks. But I'm planning to change that to simpler next/prev 
pointers, in 
https://www.postgresql.org/message-id/flat/55b3b7ae-8dec-b188-b8eb-e07604052351%40iki.fi


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SET NOT NULL [NOT VALID / CONCURRENTLY]?

2016-12-21 Thread Joel Jacobson
If you are fully confident you have no NULL values,
e.g. if you have all your logics in db functions and you validate all
INSERTs to a table won't pass any NULL values,
and you have checked all the rows in a table are NOT NULL for the column,
would it be completely crazy to just set pg_attribute.attnotnull to
TRUE for the column?

Is anything else happening "under the hood" than just locking all rows
and verifying there are no NULL rows, and then setting attnotnull to
TRUE?


On Wed, Dec 21, 2016 at 6:37 PM, Craig Ringer  wrote:
> On 21 December 2016 at 19:01, Joel Jacobson  wrote:
>
>> Similar to what we (Trustly) did when we sponsored the FOR KEY LOCK
>> feature to improve concurrency,
>> we would be very interested in also sponsoring this feature, as it
>> would mean a great lot to us.
>> I don't know if this is the right forum trying to find someone/some
>> company to sign up for the task,
>> please let me know if I should mail to some other list. Thanks.
>
> You'll probably get mail off list.
>
> For what it's worth, there's a bit of a complexity here. PostgreSQL
> doesn't model NOT NULL as a true CONSTRAINT. Instead it's a column
> attribute. I suspect we would need to change that in order to allow a
> NOT VALID NOT NULL constraint to be created.
>
> That's at least partly why the docs say that "option NOT VALID [...]
> is currently only allowed for foreign key and CHECK constraints".
>
> Note that "[VALIDATE] acquires only a SHARE UPDATE EXCLUSIVE lock on
> the table being altered" so it's already suitable for what you need.
> The challenge is making it possible to create a NOT VALID constraint
> for NOT NULL.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



-- 
Joel Jacobson

Mobile: +46703603801
Trustly.com | Newsroom | LinkedIn | Twitter


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SET NOT NULL [NOT VALID / CONCURRENTLY]?

2016-12-21 Thread Craig Ringer
On 21 December 2016 at 19:01, Joel Jacobson  wrote:

> Similar to what we (Trustly) did when we sponsored the FOR KEY LOCK
> feature to improve concurrency,
> we would be very interested in also sponsoring this feature, as it
> would mean a great lot to us.
> I don't know if this is the right forum trying to find someone/some
> company to sign up for the task,
> please let me know if I should mail to some other list. Thanks.

You'll probably get mail off list.

For what it's worth, there's a bit of a complexity here. PostgreSQL
doesn't model NOT NULL as a true CONSTRAINT. Instead it's a column
attribute. I suspect we would need to change that in order to allow a
NOT VALID NOT NULL constraint to be created.

That's at least partly why the docs say that "option NOT VALID [...]
is currently only allowed for foreign key and CHECK constraints".

Note that "[VALIDATE] acquires only a SHARE UPDATE EXCLUSIVE lock on
the table being altered" so it's already suitable for what you need.
The challenge is making it possible to create a NOT VALID constraint
for NOT NULL.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] multi-level partitions and partition-wise joins

2016-12-21 Thread Ashutosh Bapat
Hi,
I am starting this as a separate thread for this since the declarative
partitioning thread has many issues reported and it's better to keep
this discussion separate from the issues reported on that thread.

While expanding inheritance, any inheritance hierarchy is flattened
out including partition hierarchy. Partition-wise joins can be
employed if the joining tables have the same partitioning scheme and
have equi-join clauses on the partition keys. If two multi-level
partitioned tables are joined, the partition-wise join can be
percolated down to the levels up to which the partition schemes match
and suitable clauses are available. E.g. if two multi-level
partitioned table have matching partitioning schemes at the top-most
level, but not below that, we may join the topmost level partitions
pair-wise, but not partitions on the lower levels. In general, we may
use partition-wise join for the matching parts of partition hierarchy
and in the parts that do not match, use join between append relations.
Not always it will be efficient to execute partition-wise joins upto
the last levels of partition hierarchy, even if partition-wise join
can be employed. It might be possible that executing partition-wise
joins for only certain parts of partition hierarchy is efficient and
join of appends is efficient in the rest of the parts.

In order to decide whether partition-wise join is efficient for a join
between given partitioned partition, we need to identify its
subpartitions. Similarly when a join between partitioned partition can
not use partition-wise join but some other partitions can, we need to
identify the subpartitions of that partition, so that they can be
appended together before joining. That information is lost while
expanding RTE. It looks like we need to retain partitioning hierarchy
in order to implement partition-wise joins between multi-level
partitioned tables.

An earlier version of Amit's partition support patches had code to
retain partitioning hierarchy but it seems it was removed per
discussion at [1]. I agree with that decision.

[1]. 
https://www.postgresql.org/message-id/CA%2BTgmobMy%3DrqM%3DMTN_FUEfD-PiWSCSonH%2BZ1_SjL6ZmQ2GGz1w%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] invalid combination of options "-D - -F t -X stream" in pg_basebackup

2016-12-21 Thread Fujii Masao
On Tue, Dec 20, 2016 at 9:22 PM, Magnus Hagander  wrote:
>
>
> On Tue, Dec 20, 2016 at 6:56 AM, Fujii Masao  wrote:
>>
>> On Tue, Dec 20, 2016 at 1:43 AM, Magnus Hagander 
>> wrote:
>> >
>> >
>> > On Mon, Dec 19, 2016 at 5:39 PM, Fujii Masao 
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> Isn't it better to forbid the conbination of the options "-D -", "-F t"
>> >> and
>> >> "-X stream" in pg_basebackup? This is obviously invalid setting and the
>> >> docs
>> >> warns this as follows. But currently users can specify such setting and
>> >> pg_basebackup can exit unexpectedly with an error.
>> >>
>> >> ---
>> >> If the value - (dash) is specified as target directory, the tar
>> >> contents
>> >> will
>> >> be written to standard output, suitable for piping to for example gzip.
>> >> This is only possible if the cluster has no additional tablespaces.
>> >> ---
>> >
>> >
>> > Yes, definitely. I'd say that's an oversight in implementing the support
>> > for
>> > stream-to-tar that it did not detect this issue.
>> >
>> > Do you want to provide a patch for it, or should I?
>>
>> What about the attached patch?
>>
>> +fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
>> +progname);
>>
>> I added the above hint message because other codes checking invalid
>> options also have such hint messages. But there is no additional
>> useful information about valid combination of options in the help
>> messages, so I'm a bit tempted to remove the above hint message.
>
>
> Looks good to me.

Thanks for the review! Pushed.

I left the hint message for consistency with other code.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SET NOT NULL [NOT VALID / CONCURRENTLY]?

2016-12-21 Thread Joel Jacobson
On Wed, Dec 21, 2016 at 4:24 PM, Craig Ringer  wrote:
>> Is anyone working on fixing this for PostgreSQL 10?
>
> Not as far as I know.
>
> IMO this and other similar cases should all be handled the same way:
> create the constraint NOT VALID, then VALIDATE it while holding a weak
> lock that only blocks concurrent schema changes.

Sounds like a good approach.

Similar to what we (Trustly) did when we sponsored the FOR KEY LOCK
feature to improve concurrency,
we would be very interested in also sponsoring this feature, as it
would mean a great lot to us.
I don't know if this is the right forum trying to find someone/some
company to sign up for the task,
please let me know if I should mail to some other list. Thanks.

Joel Jacobson
Trustly


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-12-21 Thread Amit Langote
On 2016/12/21 1:53, Robert Haas wrote:
> On Mon, Dec 19, 2016 at 10:59 PM, Robert Haas  wrote:
>> On Sun, Dec 18, 2016 at 10:00 PM, Amit Langote
>>  wrote:
>>> Here are updated patches including the additional information.
>>
>> Thanks.  Committed 0001.  Will have to review the others when I'm less tired.
> 
> 0002. Can you add a test case for the bug fixed by this patch?

Done (also see below).

> 0003. Loses equalTupleDescs() check and various cases where
> ExecOpenIndexes can be skipped.  Isn't that bad?

I realized that as far as checking whether tuple conversion mapping is
required, the checks performed by convert_tuples_by_name() at the
beginning of the function following the comment /* Verify compatibility
and prepare attribute-number map */ are enough.  If equalTupleDescs()
returned false (which it always does because tdtypeid are never the same
for passed in tuple descriptors), we would be repeating some of the tests
in convert_tuples_by_name() anyway.

As for the checks performed for ExecOpenIndices(), it seems would be
better to keep the following in place, so added back:

if (leaf_part_rri->ri_RelationDesc->rd_rel->relhasindex &&
leaf_part_rri->ri_IndexRelationDescs == NULL)
ExecOpenIndices(leaf_part_rri, false);

> Also, "We locked all
> the partitions above including the leaf partitions" should say "Caller
> must have locked all the partitions including the leaf partitions".

No, we do the locking in RelationGetPartitionDispatchInfo(), which is
called by ExecSetupPartitionTupleRouting() itself.

In ExecSetupPartitionTupleRouting() is the first time we lock all the
partitions.

> 0004. Unnecessary whitespace change in executor.h.  Still don't
> understand why we need to hoist RelationGetPartitionQual() into the
> caller.

We only need to check a result relation's (ri_RelationDesc's) partition
constraint if we are inserting into the result relation directly.  In case
of tuple-routing, we do not want to check the leaf partitions' partition
constraint, but if the direct target in that case is an internal
partition, we must check its partition constraint, which is same for all
leaf partitions in that sub-tree.  It wouldn't be wrong per se to check
each leaf partition's constraint in that case, which includes the target
partitioned table's constraint as well, but that would inefficient due to
both having to retrieve the constraints and having ExecConstraints()
*unnecessarily* check it for every row.

If we keep doing RelationGetPartitionQual() in InitResultRelInfo()
controlled by a bool argument (load_partition_check), we cannot avoid the
above mentioned inefficiency if we want to fix this bug.

> 0005. Can you add a test case for the bug fixed by this patch?

Done, but...

Breaking changes into multiple commits/patches does not seem to work for
adding regression tests.  So, I've combined multiple patches into a single
patch which is now patch 0002 in the attached set of patches.  Its commit
message is very long now.  To show an example of bugs that 0002 is meant for:

create table p (a int, b int) partition by range (a, b);
create table p1 (b int, a int not null) partition by range (b);
create table p11 (like p1);
alter table p11 drop a;
alter table p11 add a int;
alter table p11 drop a;
alter table p11 add a int not null;

# select attrelid::regclass, attname, attnum
  from pg_attribute
  where attnum > 0
  and (attrelid = 'p'::regclass
or attrelid = 'p1'::regclass
or attrelid = 'p11'::regclass) and attname = 'a';
 attrelid | attname | attnum
--+-+
 p| a   |  1
 p1   | a   |  2
 p11  | a   |  4
(3 rows)

alter table p1 attach partition p11 for values from (1) to (5);
alter table p attach partition p1 for values from (1, 1) to (1, 10);

-- the following is wrong
# insert into p11 (a, b) values (10, 4);
INSERT 0 1

-- wrong too (using the wrong TupleDesc after tuple routing)
# insert into p1 (a, b) values (10, 4);
ERROR:  null value in column "a" violates not-null constraint
DETAIL:  Failing row contains (4, null).

-- once we fix the wrong TupleDesc issue
# insert into p1 (a, b) values (10, 4);
INSERT 0 1

which is wrong because p1, as a partition of p, should not accept 10 for
a.  But its partition constraint is not being applied to the leaf
partition p11 into which the tuple is routed (the bug).

Thanks,
Amit
>From d21ae74ae01e93f21df5b84ed097ebbc85e529dd Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 15 Dec 2016 15:56:27 +0900
Subject: [PATCH 1/3] Refactor tuple-routing setup code

It's unnecessarily repeated in copy.c and nodeModifyTable.c, which makes
it a burden to maintain.  Should've been like this to begin with.

I moved the common code to ExecSetupPartitionTupleRouting() in execMain.c
that also houses ExecFindParttion() currently.  Hmm, should there be a
new src/backend/executor/execPartition.c?

Reported by: n/a
Patch by: Amit Langote
Reports: n/a
---
 src/backend/commands/copy.c

Re: [HACKERS] pg_background contrib module proposal

2016-12-21 Thread Craig Ringer
On 21 December 2016 at 14:26, Andrew Borodin  wrote:

> I'm not sure every platform supports microsecond sleeps

Windows at least doesn't by default, unless that changed in Win2k12
and Win8 with the same platform/kernel improvements that delivered
https://msdn.microsoft.com/en-us/library/hh706895(v=vs.85).aspx . I'm
not sure. On older systems sleeps are 1ms to 15ms.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-21 Thread Fujii Masao
On Wed, Dec 21, 2016 at 10:39 AM, Kyotaro HORIGUCHI
 wrote:
> At Tue, 20 Dec 2016 23:47:22 +0900, Fujii Masao  wrote 
> in 
>> On Tue, Dec 20, 2016 at 2:46 PM, Michael Paquier
>>  wrote:
>> > On Tue, Dec 20, 2016 at 2:31 PM, Masahiko Sawada  
>> > wrote:
>> >> Do we need to consider the sorting method and the selecting k-th
>> >> latest LSN method?
>> >
>> > Honestly, nah. Tests are showing that there are many more bottlenecks
>> > before that with just memory allocation and parsing.
>>
>> I think that it's worth prototyping alternative algorithm, and
>> measuring the performances of those alternative and current
>> algorithms. This measurement should check not only the bottleneck
>> but also how much each algorithm increases the time that backends
>> need to wait for before they receive ack from walsender.
>>
>> If it's reported that current algorithm is enough "effecient",
>> we can just leave the code as it is. OTOH, if not, let's adopt
>> the alternative one.
>
> I'm personally interested in the difference of them but it
> doesn't seem urgently required.

Yes, it's not urgent task.

> If we have nothing particular to
> do with this, considering other ordering method would be
> valuable.
>
> By a not-well-grounded thought though, maintaining top-kth list
> by insertion sort would be promising rather than running top-kth
> sorting on the whole list. Sorting on all walsenders is needed
> for the first time and some other situation though.
>
>
> By the way, do we continue dispu^h^hcussing on the format of
> s_s_names and/or a successor right now?

Yes. If there is better approach, we should discuss that.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >