Re: cleaning perl code

2020-04-10 Thread Noah Misch
On Thu, Apr 09, 2020 at 11:44:11AM -0400, Andrew Dunstan wrote:
>  39 Always unpack @_ first

Requiring a "my @args = @_" does not improve this code:

sub CreateSolution
{
...
if ($visualStudioVersion eq '12.00')
{
return new VS2013Solution(@_);
}

>  30 Code before warnings are enabled

Sounds good.  We already require "use strict" before code.  Requiring "use
warnings" in the exact same place does not impose much burden.

>  12 Subroutine "new" called using indirect syntax

No, thanks.  "new VS2013Solution(@_)" and "VS2013Solution->new(@_)" are both
fine; enforcing the latter is an ongoing waste of effort.

>   9 Multiple "package" declarations

This is good advice if you're writing for CPAN, but it would make PostgreSQL
code worse by having us split affiliated code across multiple files.

>   9 Expression form of "grep"

No, thanks.  I'd be happier with the opposite, requiring grep(/x/, $arg)
instead of grep { /x/ } $arg.  Neither is worth enforcing.

>   7 Symbols are exported by default

This is good advice if you're writing for CPAN.  For us, it just adds typing.

>   5 Warnings disabled
>   4 Magic variable "$/" should be assigned as "local"
>   4 Comma used to separate statements
>   2 Readline inside "for" loop
>   2 Pragma "constant" used
>   2 Mixed high and low-precedence booleans
>   2 Don't turn off strict for large blocks of code
>   1 Magic variable "@a" should be assigned as "local"
>   1 Magic variable "$|" should be assigned as "local"
>   1 Magic variable "$\" should be assigned as "local"
>   1 Magic variable "$?" should be assigned as "local"
>   1 Magic variable "$," should be assigned as "local"
>   1 Magic variable "$"" should be assigned as "local"
>   1 Expression form of "map"

I looked less closely at the rest, but none give me a favorable impression.


In summary, among those warnings, I see non-negative value in "Code before
warnings are enabled" only.  While we're changing this, I propose removing
Subroutines::RequireFinalReturn.  Implicit return values were not a material
source of PostgreSQL bugs, yet we've allowed this to litter our code:

$ find src -name '*.p[lm]'| xargs grep -n '^.return;' | wc -l
194




Re: Cache relation sizes?

2020-04-10 Thread Thomas Munro
On Fri, Feb 14, 2020 at 1:50 PM Thomas Munro  wrote:
> On Thu, Feb 13, 2020 at 7:18 PM Thomas Munro  wrote:
> > ... (1) I'm pretty sure some systems would not be happy
> > about that (see claims in this thread) ...
>
> I poked a couple of people off-list and learned that, although the
> Linux and FreeBSD systems I tried could do a million lseek(SEEK_END)
> calls in 60-100ms, a couple of different Windows systems took between
> 1.1 and 3.4 seconds (worse times when running as non-administrator),
> which seems to be clearly in the territory that can put a dent in your
> recovery speeds on that OS.  I also learned that GetFileSizeEx() is
> "only" about twice as fast, which is useful information for that other
> thread about which syscall to use for this, but it's kind of
> irrelevant this thread about how we can get rid of these crazy
> syscalls altogether.

I received a report off-list from someone who experimented with the
patch I shared earlier on this thread[1], using a crash recovery test
similar to one I showed on the WAL prefetching thread[2] (which he was
also testing, separately).

He observed that the lseek() rate in recovery was actually a
significant problem for his environment on unpatched master, showing
up as the top sampled function in perf, and by using that patch he got
(identical) crash recovery to complete in 41s instead of 65s, with a
sane looking perf (= 58% speedup).  His test system was an AWS
i3.16xlarge running an unspecified version of Linux.

I think it's possible that all of the above reports can be explained
by variations in the speculative execution bug mitigations that are
enabled by default on different systems, but I haven't tried to
investigate that yet.

[1] https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks
[2] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BOcWr8nHqa3%3DZoPTGgdDcrgjSC4R2sT%2BjrUunBua3rpg%40mail.gmail.com




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-10 Thread Etsuro Fujita
On Sat, Apr 11, 2020 at 1:00 AM Tom Lane  wrote:
> Ashutosh Bapat  writes:
> > On Fri, Apr 10, 2020 at 9:14 PM Tom Lane  wrote:
> >> I see no patch here ...
>
> > Sorry. Here it is
>
> LGTM, will push in a moment.

Thanks for taking care of this, Tom!  Thanks for the patch, Ashutosh!
Thanks for the report, Kuntal!

Best regards,
Etsuro Fujita




Re: Corruption during WAL replay

2020-04-10 Thread Alvaro Herrera
On 2020-Mar-30, Andres Freund wrote:

> If we are really concerned with truncation failing - I don't know why we
> would be, we accept that we have to be able to modify files etc to stay
> up - we can add a pre-check ensuring that permissions are set up
> appropriately to allow us to truncate.

I remember I saw a case where the datadir was NFS or some other network
filesystem thingy, and it lost connection just before autovacuum
truncation, or something like that -- so there was no permission
failure, but the truncate failed and yet PG soldiered on.  I think the
connection was re-established soon thereafter and things went back to
"normal", with nobody realizing that a truncate had been lost.
Corruption was discovered a long time afterwards IIRC (weeks or months,
I don't remember).

I didn't review Teja's patch carefully, but the idea of panicking on
failure (causing WAL replay) seems better than the current behavior.
I'd rather put the server to wait until storage is really back.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_basebackup, manifests and backends older than ~12

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 16:32:08 -0400, David Steele wrote:
> On 4/10/20 4:09 AM, Michael Paquier wrote:
> > 
> > I have noticed that attempting to use pg_basebackup from HEAD leads to
> > failures when using it with backend versions from 12 and older:
> > $ pg_basebackup -D hoge
> > pg_basebackup: error: backup manifests are not supported by server
> > version 12beta2
> > pg_basebackup: removing data directory "hoge"
> > 
> > This is a bit backwards with what we did in the past to maintain
> > compatibility silently when possible, for example look at the handling
> > of temporary replication slots.  Instead of an error when means to
> > force users to have to specify --no-manifest in this case, shouldn't
> > we silently disable the generation of the backup manifest?  We know
> > that this option won't work on older server versions anyway.
> 
> I'm a bit conflicted here. I see where you are coming from, but given that
> writing a manifest is now the default I'm not sure silently skipping it is
> ideal.

I think we at the very least should add a hint about how to perform a
backup without a manifest.

Greetings,

Andres Freund




Re: SyncRepLock acquired exclusively in default configuration

2020-04-10 Thread Masahiko Sawada
On Fri, 10 Apr 2020 at 21:52, Fujii Masao  wrote:
>
>
>
> On 2020/04/10 20:56, Masahiko Sawada wrote:
> > On Fri, 10 Apr 2020 at 18:57, Fujii Masao  
> > wrote:
> >>
> >>
> >>
> >> On 2020/04/10 14:11, Masahiko Sawada wrote:
> >>> On Fri, 10 Apr 2020 at 13:20, Fujii Masao  
> >>> wrote:
> 
> 
> 
>  On 2020/04/08 3:01, Ashwin Agrawal wrote:
> >
> > On Mon, Apr 6, 2020 at 2:14 PM Andres Freund  > > wrote:
> >
> >> How about we change it to this ?
> >
> >   Hm. Better. But I think it might need at least a compiler barrier 
> > /
> >   volatile memory load?  Unlikely here, but otherwise the compiler 
> > could
> >   theoretically just stash the variable somewhere locally (it's not 
> > likely
> >   to be a problem because it'd not be long ago that we acquired an 
> > lwlock,
> >   which is a full barrier).
> >
> >
> > That's the part, I am not fully sure about. But reading the comment 
> > above SyncRepUpdateSyncStandbysDefined(), it seems fine.
> >
> >> Bring back the check which existed based on GUC but instead of 
> > just blindly
> >> returning based on just GUC not being set, check
> >> WalSndCtl->sync_standbys_defined. Thoughts?
> >
> >   Hm. Is there any reason not to just check
> >   WalSndCtl->sync_standbys_defined? rather than both 
> > !SyncStandbysDefined()
> >   and WalSndCtl->sync_standbys_defined?
> >
> >
> > Agree, just checking for WalSndCtl->sync_standbys_defined seems fine.
> 
>  So the consensus is something like the following? Patch attached.
> 
> /*
>  -* Fast exit if user has not requested sync replication.
>  +* Fast exit if user has not requested sync replication, or 
>  there are no
>  +* sync replication standby names defined.
>  */
>  -   if (!SyncRepRequested())
>  +   if (!SyncRepRequested() ||
>  +   !((volatile WalSndCtlData *) 
>  WalSndCtl)->sync_standbys_defined)
> return;
> 
> >>>
> >>> I think we need more comments describing why checking
> >>> sync_standby_defined without SyncRepLock is safe here. For example:
> >>
> >> Yep, agreed!
> >>
> >>> This routine gets called every commit time. So, to check if the
> >>> synchronous standbys is defined as quick as possible we check
> >>> WalSndCtl->sync_standbys_defined without acquiring SyncRepLock. Since
> >>> we make this test unlocked, there's a change we might fail to notice
> >>> that it has been turned off and continue processing.
> >>
> >> Does this really happen? I was thinking that the problem by not taking
> >> the lock here is that SyncRepWaitForLSN() can see that shared flag after
> >> SyncRepUpdateSyncStandbysDefined() wakes up all the waiters and
> >> before it sets the flag to false. Then if SyncRepWaitForLSN() adds  itself
> >> into the wait queue becaues the flag was true, without lock, it may keep
> >> sleeping infinitely.
> >
> > I think that because a backend does the following check after
> > acquiring SyncRepLock, in that case, once the backend has taken
> > SyncRepLock it can see that sync_standbys_defined is false and return.
>
> Yes, but the backend can see that sync_standby_defined indicates false
> whether holding SyncRepLock or not, after the checkpointer sets it to false.
>
> > But you meant that we do both checks without SyncRepLock?
>
> Maybe No. The change that the latest patch provides should be applied, I 
> think.
> That is, sync_standbys_defined should be check without lock at first, then
> only if it's true, it should be checked again with lock.

Yes. My understanding is the same.

After applying your patch, SyncRepWaitForLSN() is going to become
something like:

/*
 * Fast exit if user has not requested sync replication, or there are no
 * sync replication standby names defined.
 */
if (!SyncRepRequested() ||
!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
return;

Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
Assert(WalSndCtl != NULL);

LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);

/*
 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 * set.  See SyncRepUpdateSyncStandbysDefined.
 *
 * Also check that the standby hasn't already replied. Unlikely race
 * condition but we'll be fetching that cache line anyway so it's likely
 * to be a low cost check.
 */
if (!WalSndCtl->sync_standbys_defined ||
lsn <= WalSndCtl->lsn[mode])
{
LWLockRelease(SyncRepLock);
return;
}

/*
 * Set our waitLSN so WALSender will know when to wake us, and add
 * ourselves to the queue.
 */
MyProc->waitLSN = lsn;
 

Re: Corruption during WAL replay

2020-04-10 Thread Teja Mupparti
Thanks Andres and Kyotaro for the quick review.  I have fixed the typos and 
also included the critical section (emulated it with try-catch block since 
palloc()s are causing issues in the truncate code). This time I used git 
format-patch.

Regards
Teja




From: Andres Freund 
Sent: Monday, March 30, 2020 4:31 PM
To: Kyotaro Horiguchi 
Cc: tejesw...@hotmail.com ; pgsql-hack...@postgresql.org 
; hexexp...@comcast.net 
Subject: Re: Corruption during WAL replay

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> At Mon, 23 Mar 2020 20:56:59 +, Teja Mupparti  
> wrote in
> > The original bug reporting-email and the relevant discussion is here
> ...
> > The crux of the fix is, in the current code, engine drops the buffer and 
> > then truncates the file, but a crash before the truncate and after the 
> > buffer-drop is causing the corruption. Patch reverses the order i.e. 
> > truncate the file and drop the buffer later.
>
> BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?


> I'm not sure it's acceptable to remember all to-be-deleted buffers
> while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).


>
> +  /*START_CRIT_SECTION();*/

> Is this a point of argument?  It is not needed if we choose the
> strategy (c) in [1], since the protocol is aiming to allow server to
> continue running after truncation failure.
>
> [1]: 
> https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.


If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund


0001-Wal-replay-corruption.patch
Description: 0001-Wal-replay-corruption.patch


Re: Properly mark NULL returns in numeric aggregates

2020-04-10 Thread Tom Lane
Jesse Zhang  writes:
> On the other hand, we examined the corresponding final functions
> (numeric_stddev_pop and friends), they all seem to carefully treat a
> NULL trans value the same as a "zero input" (as in, state.N == 0 &&
> state.NaNcount ==0). That does suggest to me that it should be fine to
> declare those combine functions as strict (barring the restriction that
> they should not be STRICT, anybody remembers why?).

They can't be strict because the initial iteration needs to produce
something from a null state and non-null input.  nodeAgg's default
behavior won't work for those because nodeAgg doesn't know how to
copy a value of type "internal".

regards, tom lane




Re: spin_delay() for ARM

2020-04-10 Thread Tom Lane
I wrote:
> A more useful test would be to directly experiment with contended
> spinlocks.  As I recall, we had some test cases laying about when
> we were fooling with the spin delay stuff on Intel --- maybe
> resurrecting one of those would be useful?

The last really significant performance testing we did in this area
seems to have been in this thread:

https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com

A relevant point from that is Haas' comment

I think optimizing spinlocks for machines with only a few CPUs is
probably pointless.  Based on what I've seen so far, spinlock
contention even at 16 CPUs is negligible pretty much no matter what
you do.  Whether your implementation is fast or slow isn't going to
matter, because even an inefficient implementation will account for
only a negligible percentage of the total CPU time - much less than 1%
- as opposed to a 64-core machine, where it's not that hard to find
cases where spin-waits consume the *majority* of available CPU time
(recall previous discussion of lseek).

So I wonder whether this patch is getting ahead of the game.  It does
seem that ARM systems with a couple dozen cores exist, but are they
common enough to optimize for yet?  Can we even find *one* to test on
and verify that this is a win and not a loss?  (Also, seeing that
there are so many different ARM vendors, results from just one
chipset might not be too trustworthy ...)

regards, tom lane




Re: Properly mark NULL returns in numeric aggregates

2020-04-10 Thread Jesse Zhang
Hi Andres,

On Fri, Apr 10, 2020 at 12:14 PM Andres Freund  wrote:
>
> Shouldn't these just be marked as strict?
>

Are you suggesting that because none of the corresponding trans
functions (int8_avg_accum, int2_accum, and friends) ever output NULL?
That's what we thought, but then I realized that an input to a comebine
function is not necessarily an output from a trans function invocation:
for example, when there is a "FILTER (WHERE ...)" clause that filters
out every tuple in a group, the partial aggregate might just throw a
NULL state for the final aggregate to combine.

On the other hand, we examined the corresponding final functions
(numeric_stddev_pop and friends), they all seem to carefully treat a
NULL trans value the same as a "zero input" (as in, state.N == 0 &&
state.NaNcount ==0). That does suggest to me that it should be fine to
declare those combine functions as strict (barring the restriction that
they should not be STRICT, anybody remembers why?).

Cheers,
Jesse and Deep




Re: Report error position in partition bound check

2020-04-10 Thread Alexandra Wang
On Fri, Apr 10, 2020 at 8:37 AM Ashutosh Bapat <
ashutosh.ba...@2ndquadrant.com> wrote:
> for a multi-key value the ^
> points to the first column and the reader may think that that's the
> problematci column. Should it instead point to ( ?

I attached a v2 of Amit's 0002 patch to also report the exact column
for the partition overlap errors.
From f7cf86fcf7c1a26895e68f357717244f776fe5cd Mon Sep 17 00:00:00 2001
From: Alexandra Wang 
Date: Fri, 10 Apr 2020 13:51:53 -0700
Subject: [PATCH v2] Improve check new partition bound error position report

---
 src/backend/parser/parse_utilcmd.c |  3 ++
 src/backend/partitioning/partbounds.c  | 60 +++---
 src/test/regress/expected/alter_table.out  |  2 +-
 src/test/regress/expected/create_table.out | 28 +-
 4 files changed, 59 insertions(+), 34 deletions(-)

diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index 75c122fe34..72113bb7ff 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -4170,5 +4170,8 @@ transformPartitionBoundValue(ParseState *pstate, Node *val,
 	if (!IsA(value, Const))
 		elog(ERROR, "could not evaluate partition bound expression");
 
+	/* Preserve parser location information. */
+	((Const *) value)->location = exprLocation(val);
+
 	return (Const *) value;
 }
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index dd56832efc..0ef0f6ed07 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -223,7 +223,8 @@ static int32 partition_rbound_cmp(int partnatts, FmgrInfo *partsupfunc,
 static int	partition_range_bsearch(int partnatts, FmgrInfo *partsupfunc,
 	Oid *partcollation,
 	PartitionBoundInfo boundinfo,
-	PartitionRangeBound *probe, bool *is_equal);
+	PartitionRangeBound *probe, bool *is_equal,
+	int32 *cmpval);
 static int	get_partition_bound_num_indexes(PartitionBoundInfo b);
 static Expr *make_partition_op_expr(PartitionKey key, int keynum,
 	uint16 strategy, Expr *arg1, Expr *arg2);
@@ -2810,6 +2811,7 @@ check_new_partition_bound(char *relname, Relation parent,
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 	int			with = -1;
 	bool		overlap = false;
+	int			overlap_location = 0;
 
 	if (spec->is_default)
 	{
@@ -2904,6 +2906,7 @@ check_new_partition_bound(char *relname, Relation parent,
 		if (boundinfo->indexes[remainder] != -1)
 		{
 			overlap = true;
+			overlap_location = spec->location;
 			with = boundinfo->indexes[remainder];
 			break;
 		}
@@ -2932,6 +2935,7 @@ check_new_partition_bound(char *relname, Relation parent,
 	{
 		Const	   *val = castNode(Const, lfirst(cell));
 
+		overlap_location = val->location;
 		if (!val->constisnull)
 		{
 			int			offset;
@@ -2965,6 +2969,7 @@ check_new_partition_bound(char *relname, Relation parent,
 			{
 PartitionRangeBound *lower,
 		   *upper;
+int			cmpval;
 
 Assert(spec->strategy == PARTITION_STRATEGY_RANGE);
 lower = make_one_partition_rbound(key, -1, spec->lowerdatums, true);
@@ -2974,10 +2979,16 @@ check_new_partition_bound(char *relname, Relation parent,
  * First check if the resulting range would be empty with
  * specified lower and upper bounds
  */
-if (partition_rbound_cmp(key->partnatts, key->partsupfunc,
-		 key->partcollation, lower->datums,
-		 lower->kind, true, upper) >= 0)
+cmpval = partition_rbound_cmp(key->partnatts,
+			  key->partsupfunc,
+			  key->partcollation, lower->datums,
+			  lower->kind, true, upper);
+if (cmpval >= 0)
 {
+	/* Fetch the problem bound from lower datums list. */
+	PartitionRangeDatum *datum = list_nth(spec->lowerdatums,
+		  cmpval - 1);
+
 	ereport(ERROR,
 			(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 			 errmsg("empty range bound specified for partition \"%s\"",
@@ -2985,7 +2996,7 @@ check_new_partition_bound(char *relname, Relation parent,
 			 errdetail("Specified lower bound %s is greater than or equal to upper bound %s.",
 	   get_range_partbound_string(spec->lowerdatums),
 	   get_range_partbound_string(spec->upperdatums)),
-			 parser_errposition(pstate, spec->location)));
+			 parser_errposition(pstate, datum->location)));
 }
 
 if (partdesc->nparts > 0)
@@ -3017,7 +3028,7 @@ check_new_partition_bound(char *relname, Relation parent,
 	 key->partsupfunc,
 	 key->partcollation,
 	 boundinfo, lower,
-	 );
+	 , );
 
 	if (boundinfo->indexes[offset + 1] < 0)
 	{
@@ -3029,7 +3040,6 @@ check_new_partition_bound(char *relname, Relation parent,
 		 */
 		if (offset + 1 < boundinfo->ndatums)
 		{
-			int32		cmpval;
 			Datum	   *datums;
 			PartitionRangeDatumKind *kind;
 			bool		is_lower;
@@ 

Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 17:23:58 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-04-10 16:40:02 -0400, Tom Lane wrote:
> >> It doesn't really seem like either name is problematic from that
> >> standpoint?  "Verify backup" isn't prejudging what aspect of the
> >> backup is going to be verified, AFAICS.
> 
> > My point is that I'd eventually like to see the same tool also be usable
> > to just verify the checksums of a normal, non-backup, data directory.
> 
> Meh.  I would argue that that's an actively BAD idea.  The use-cases
> are entirely different, the implementation is going to be quite a lot
> different, the relevant options are going to be quite a lot different.
> It will not be better for either implementors or users to force those
> into the same executable.

I don't agree with any of that. Combining the manifest validation with
checksum validation halves the IO. It allows to offload some of the
expense of verifying page level checksums from the primary.

And all of the operations require iterating through data directories,
classify files that are part / not part of a normal data directory, etc.

Greetings,

Andres Freund




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 17:17:10 -0400, Tom Lane wrote:
> > ISTM that we can make it BEGIN AFTER 'xx/xx' or such, which'd not
> > require any keywords, it'd be easier to use than a procedure.
> 
> I still don't see a good argument for tying this to BEGIN.  If it
> has to be a statement, why not a standalone statement?

Because the goal is to start a transaction where a certain action from
the primary is visible.

I think there's also some advantages of having it in a single statement
for poolers. If a pooler analyzes BEGIN AFTER 'xx/xx' it could
e.g. redirect the transaction to a node that's caught up far enough,
instead of blocking. But that can't work even close to as easily if it's
something that has to be executed before transaction begin.


> (I also have a lurking suspicion that this shouldn't be SQL at all
> but part of the replication command set.)

Hm? I'm not quite following. The feature is useful to achieve
read-your-own-writes consistency. Consider

Primary: INSERT INTO app_users ...; SELECT pg_current_wal_lsn();
Standby: BEGIN AFTER 'returned/lsn';
Standby: SELECT i_am_a_set_of_very_expensive_queries FROM ..., app_users;

without the AFTER/WAIT whatnot, you cannot rely on the insert having
been replicated to the standby.

Offloading queries from the write node to replicas is a pretty standard
technique for scaling out databases (including PG). We just make it
harder than necessary.

How would this be part of the replication command set? This shouldn't
require replication permissions for the user executing the queries.
While I'm in favor of merging the replication protocol entirely with the
normal protocol, I've so far received very little support for that
proposition...

Greetings,

Andres Freund




Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Tom Lane
Andres Freund  writes:
> On 2020-04-10 16:40:02 -0400, Tom Lane wrote:
>> It doesn't really seem like either name is problematic from that
>> standpoint?  "Verify backup" isn't prejudging what aspect of the
>> backup is going to be verified, AFAICS.

> My point is that I'd eventually like to see the same tool also be usable
> to just verify the checksums of a normal, non-backup, data directory.

Meh.  I would argue that that's an actively BAD idea.  The use-cases
are entirely different, the implementation is going to be quite a lot
different, the relevant options are going to be quite a lot different.
It will not be better for either implementors or users to force those
into the same executable.

regards, tom lane




Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 16:40:02 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-04-10 16:13:18 -0400, Tom Lane wrote:
> >> Well, we're not getting there for v13.  Are you proposing that this
> >> patch just be reverted because it doesn't do everything at once?
> 
> > No. I suggest choosing a name that's compatible with moving more
> > capabilities under the same umbrella at a later time (and I suggested
> > the same pre freeze too). Possibly adding a toplevel --verify-manifest
> > option as the only change besides naming.
> 
> It doesn't really seem like either name is problematic from that
> standpoint?  "Verify backup" isn't prejudging what aspect of the
> backup is going to be verified, AFAICS.

My point is that I'd eventually like to see the same tool also be usable
to just verify the checksums of a normal, non-backup, data directory.

We shouldn't end up with pg_verifybackup, pg_checksums,
pg_dbdir_checknofilesmissing, pg_checkpageheaders, ...

Greetings,

Andres Freund




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-10 Thread Tom Lane
Andres Freund  writes:
> On 2020-04-10 16:29:39 -0400, Tom Lane wrote:
>> Good point, but we could address that by making it a procedure no?

> Probably. Don't think we have great infrastructure for builtin
> procedures yet though? We'd presumably not want to use plpgsql.

Don't think anyone's tried yet.  It's not instantly clear that the
amount of code needed would be more than comes along with new
syntax, though.

> ISTM that we can make it BEGIN AFTER 'xx/xx' or such, which'd not
> require any keywords, it'd be easier to use than a procedure.

I still don't see a good argument for tying this to BEGIN.  If it
has to be a statement, why not a standalone statement?

(I also have a lurking suspicion that this shouldn't be SQL at all
but part of the replication command set.)

regards, tom lane




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 16:29:39 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I don't think a function is a good idea - it'll cause a snapshot to be
> > held while waiting. Which in turn will cause hot_standby_feedback to not
> > be able to report an increased xmin up. And it will possibly hit
> > snapshot recovery conflicts.
>
> Good point, but we could address that by making it a procedure no?

Probably. Don't think we have great infrastructure for builtin
procedures yet though? We'd presumably not want to use plpgsql.

ISTM that we can make it BEGIN AFTER 'xx/xx' or such, which'd not
require any keywords, it'd be easier to use than a procedure.

With a separate procedure, you'd likely need more roundtrips / complex
logic at the client. You either need to check first if the procedure
errored ou, and then send the BEGIN, or send both together and separate
out potential errors.

Greetings,

Andres Freund




Re: pg_basebackup, manifests and backends older than ~12

2020-04-10 Thread David Steele

On 4/10/20 4:41 PM, Stephen Frost wrote:

Greetings,

* David Steele (da...@pgmasters.net) wrote:

On 4/10/20 4:09 AM, Michael Paquier wrote:

I have noticed that attempting to use pg_basebackup from HEAD leads to
failures when using it with backend versions from 12 and older:
$ pg_basebackup -D hoge
pg_basebackup: error: backup manifests are not supported by server
version 12beta2
pg_basebackup: removing data directory "hoge"

This is a bit backwards with what we did in the past to maintain
compatibility silently when possible, for example look at the handling
of temporary replication slots.  Instead of an error when means to
force users to have to specify --no-manifest in this case, shouldn't
we silently disable the generation of the backup manifest?  We know
that this option won't work on older server versions anyway.


I'm a bit conflicted here. I see where you are coming from, but given that
writing a manifest is now the default I'm not sure silently skipping it is
ideal.


It's only the default in v13..  Surely when we connect to a v12 or
earlier system we should just keep working and accept that we don't get
a manifest as part of that.


Yeah, OK. It's certainly better than forcing the user to disable 
manifests, which might also disable them for v13 clusters.


--
-David
da...@pgmasters.net




Re: pg_basebackup, manifests and backends older than ~12

2020-04-10 Thread Stephen Frost
Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 4/10/20 4:09 AM, Michael Paquier wrote:
> >I have noticed that attempting to use pg_basebackup from HEAD leads to
> >failures when using it with backend versions from 12 and older:
> >$ pg_basebackup -D hoge
> >pg_basebackup: error: backup manifests are not supported by server
> >version 12beta2
> >pg_basebackup: removing data directory "hoge"
> >
> >This is a bit backwards with what we did in the past to maintain
> >compatibility silently when possible, for example look at the handling
> >of temporary replication slots.  Instead of an error when means to
> >force users to have to specify --no-manifest in this case, shouldn't
> >we silently disable the generation of the backup manifest?  We know
> >that this option won't work on older server versions anyway.
> 
> I'm a bit conflicted here. I see where you are coming from, but given that
> writing a manifest is now the default I'm not sure silently skipping it is
> ideal.

It's only the default in v13..  Surely when we connect to a v12 or
earlier system we should just keep working and accept that we don't get
a manifest as part of that.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Tom Lane
Andres Freund  writes:
> On 2020-04-10 16:13:18 -0400, Tom Lane wrote:
>> Well, we're not getting there for v13.  Are you proposing that this
>> patch just be reverted because it doesn't do everything at once?

> No. I suggest choosing a name that's compatible with moving more
> capabilities under the same umbrella at a later time (and I suggested
> the same pre freeze too). Possibly adding a toplevel --verify-manifest
> option as the only change besides naming.

It doesn't really seem like either name is problematic from that
standpoint?  "Verify backup" isn't prejudging what aspect of the
backup is going to be verified, AFAICS.

regards, tom lane




Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 16:13:18 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > FWIW, I still think it's a mistake to accumulate all these bespoke
> > tools. We should go towards having one tool that can verify checksums,
> > validate backup manifests etc. Partially because it's more discoverable,
> > but also because it allows to verify multiple such properties in a
> > single pass, rather than reading the huge base backup twice.
> 
> Well, we're not getting there for v13.  Are you proposing that this
> patch just be reverted because it doesn't do everything at once?

No. I suggest choosing a name that's compatible with moving more
capabilities under the same umbrella at a later time (and I suggested
the same pre freeze too). Possibly adding a toplevel --verify-manifest
option as the only change besides naming.

Andres




Re: pg_basebackup, manifests and backends older than ~12

2020-04-10 Thread David Steele

On 4/10/20 4:09 AM, Michael Paquier wrote:


I have noticed that attempting to use pg_basebackup from HEAD leads to
failures when using it with backend versions from 12 and older:
$ pg_basebackup -D hoge
pg_basebackup: error: backup manifests are not supported by server
version 12beta2
pg_basebackup: removing data directory "hoge"

This is a bit backwards with what we did in the past to maintain
compatibility silently when possible, for example look at the handling
of temporary replication slots.  Instead of an error when means to
force users to have to specify --no-manifest in this case, shouldn't
we silently disable the generation of the backup manifest?  We know
that this option won't work on older server versions anyway.


I'm a bit conflicted here. I see where you are coming from, but given 
that writing a manifest is now the default I'm not sure silently 
skipping it is ideal.


Regards,
--
-David
da...@pgmasters.net




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-10 Thread Tom Lane
Andres Freund  writes:
> I don't think a function is a good idea - it'll cause a snapshot to be
> held while waiting. Which in turn will cause hot_standby_feedback to not
> be able to report an increased xmin up. And it will possibly hit
> snapshot recovery conflicts.

Good point, but we could address that by making it a procedure no?

regards, tom lane




Re: pg_dump issue with renamed system schemas

2020-04-10 Thread Tom Lane
"Bossart, Nathan"  writes:
> I think I've found a small bug in pg_dump that could cause some schema
> privileges to be missed.  In short, if you've renamed a schema that
> has an entry in pg_init_privs, pg_dump will skip dumping the initial
> ACL for the schema.  This results in missing privileges on restore.

This seems like a special case of the issue discussed in

https://www.postgresql.org/message-id/flat/f85991ad-bbd4-ad57-fde4-e12f0661d...@postgrespro.ru

AFAICT we didn't think we'd found a satisfactory solution yet.

regards, tom lane




Re: spin_delay() for ARM

2020-04-10 Thread Tom Lane
Andres Freund  writes:
> On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
>> On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
>> using a sample C program (attached spin.c).

> PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
> just schedule another process - you won't see different CPU usage if you
> measure it purely as the time running. You should be able to see a
> difference if you measure with a profiler that shows you data from the
> CPUs performance monitoring unit.

A more useful test would be to directly experiment with contended
spinlocks.  As I recall, we had some test cases laying about when
we were fooling with the spin delay stuff on Intel --- maybe
resurrecting one of those would be useful?

regards, tom lane




Re: Properly mark NULL returns in numeric aggregates

2020-04-10 Thread Tom Lane
Andres Freund  writes:
> On 2020-04-09 16:22:11 -0700, Jesse Zhang wrote:
>> We found that several functions -- namely numeric_combine,
>> numeric_avg_combine, numeric_poly_combine, and int8_avg_combine -- are
>> returning NULL without signaling the nullity of datum in fcinfo.isnull.
>> This is obscured by the fact that the only functions in core (finalfunc
>> for various aggregates) that those return values feed into happen to
>> tolerate (or rather, not quite distinguish) zero-but-not-NULL trans
>> values.

> Shouldn't these just be marked as strict?

No, certainly not --- they need to be able to act on null inputs.
The question is how careful do we need to be about representing
null results as "real" nulls rather than NULL pointers.

regards, tom lane




Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Tom Lane
Andres Freund  writes:
> FWIW, I still think it's a mistake to accumulate all these bespoke
> tools. We should go towards having one tool that can verify checksums,
> validate backup manifests etc. Partially because it's more discoverable,
> but also because it allows to verify multiple such properties in a
> single pass, rather than reading the huge base backup twice.

Well, we're not getting there for v13.  Are you proposing that this
patch just be reverted because it doesn't do everything at once?

regards, tom lane




Re: weird hash plan cost, starting with pg10

2020-04-10 Thread Tom Lane
Konstantin Knizhnik  writes:
> On 25.03.2020 13:36, Richard Guo wrote:
>> I tried this recipe on different PostgreSQL versions, starting from
>> current master and going backwards. I was able to reproduce this issue
>> on all versions above 8.4. In 8.4 version, we do not output information
>> on hash buckets/batches. But manual inspection with gdb shows in 8.4 we
>> also have the dangling pointer for HashState->hashtable. I didn't check
>> versions below 8.4 though.

> I can propose the following patch for the problem.

I looked at this patch a bit, and I don't think it goes far enough.
What this issue is really pointing out is that EXPLAIN is not considering
the possibility of a Hash node having had several hashtable instantiations
over its lifespan.  I propose what we do about that is generalize the
policy that show_hash_info() is already implementing (in a rather half
baked way) for multiple workers, and report the maximum field values
across all instantiations.  We can combine the code needed to do so
with the code for the parallelism case, as shown in the 0001 patch
below.

In principle we could probably get away with back-patching 0001,
at least into branches that already have the HashState.hinstrument
pointer.  I'm not sure it's worth any risk though.  A much simpler
fix is to make sure we clear the dangling hashtable pointer, as in
0002 below (a simplified form of Konstantin's patch).  The net
effect of that is that in the case where a hash table is destroyed
and never rebuilt, EXPLAIN ANALYZE would report no hash stats,
rather than possibly-garbage stats like it does today.  That's
probably good enough, because it should be an uncommon corner case.

Thoughts?

regards, tom lane

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 455f54e..ff7c592 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2964,14 +2964,16 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 	HashInstrumentation hinstrument = {0};
 
 	/*
+	 * Collect stats from the local process, even when it's a parallel query.
 	 * In a parallel query, the leader process may or may not have run the
 	 * hash join, and even if it did it may not have built a hash table due to
 	 * timing (if it started late it might have seen no tuples in the outer
 	 * relation and skipped building the hash table).  Therefore we have to be
 	 * prepared to get instrumentation data from all participants.
 	 */
-	if (hashstate->hashtable)
-		ExecHashGetInstrumentation(, hashstate->hashtable);
+	if (hashstate->hinstrument)
+		memcpy(, hashstate->hinstrument,
+			   sizeof(HashInstrumentation));
 
 	/*
 	 * Merge results from workers.  In the parallel-oblivious case, the
@@ -2979,7 +2981,10 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 	 * participants didn't run the join at all so have no data.  In the
 	 * parallel-aware case, we need to consider all the results.  Each worker
 	 * may have seen a different subset of batches and we want to find the
-	 * highest memory usage for any one batch across all batches.
+	 * highest memory usage for any one batch across all batches.  We take the
+	 * maxima of other values too, for safety.  (In principle all workers
+	 * should have the same nbuckets values, but workers that started late
+	 * might have seen fewer batches than others.)
 	 */
 	if (hashstate->shared_info)
 	{
@@ -2990,31 +2995,16 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		{
 			HashInstrumentation *worker_hi = _info->hinstrument[i];
 
-			if (worker_hi->nbatch > 0)
-			{
-/*
- * Every participant should agree on the buckets, so to be
- * sure we have a value we'll just overwrite each time.
- */
-hinstrument.nbuckets = worker_hi->nbuckets;
-hinstrument.nbuckets_original = worker_hi->nbuckets_original;
-
-/*
- * Normally every participant should agree on the number of
- * batches too, but it's possible for a backend that started
- * late and missed the whole join not to have the final nbatch
- * number.  So we'll take the largest number.
- */
-hinstrument.nbatch = Max(hinstrument.nbatch, worker_hi->nbatch);
-hinstrument.nbatch_original = worker_hi->nbatch_original;
-
-/*
- * In a parallel-aware hash join, for now we report the
- * maximum peak memory reported by any worker.
- */
-hinstrument.space_peak =
-	Max(hinstrument.space_peak, worker_hi->space_peak);
-			}
+			hinstrument.nbuckets = Max(hinstrument.nbuckets,
+	   worker_hi->nbuckets);
+			hinstrument.nbuckets_original = Max(hinstrument.nbuckets_original,
+worker_hi->nbuckets_original);
+			hinstrument.nbatch = Max(hinstrument.nbatch,
+	 worker_hi->nbatch);
+			hinstrument.nbatch_original = Max(hinstrument.nbatch_original,
+			  worker_hi->nbatch_original);
+			hinstrument.space_peak = Max(hinstrument.space_peak,
+		 worker_hi->space_peak);
 		}
 	}
 

Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2020-04-10 14:56:48 -0400, David Steele wrote:
> > On 4/10/20 11:37 AM, Tom Lane wrote:
> > > Robert Haas  writes:
> > > > Over at 
> > > > https://www.postgresql.org/message-id/172c9d9b-1d0a-1b94-1456-376b1e017...@2ndquadrant.com
> > > > Peter Eisentraut suggests that pg_validatebackup should be called
> > > > pg_verifybackup, with corresponding terminology changes throughout the
> > > > code and documentation.
> > > 
> > > > Here's a patch for that. I'd like to commit this quickly or abandon in
> > > > quickly, because large renaming patches like this are a pain to
> > > > maintain. I believe that there was a mild consensus in favor of this
> > > > on that thread, so I plan to go forward unless somebody shows up
> > > > pretty quickly to object.
> > > 
> > > +1, let's get it done.
> > 
> > I'm not sure that Peter suggested verify was the correct name, he just
> > pointed out that verify and validate are not necessarily the same thing (and
> > that we should be consistent in the docs one way or the other). It'd be nice
> > if Peter (now CC'd) commented since he's the one who brought it up.
> > 
> > Having said that, I'm +1 on verify.
> 
> FWIW, I still think it's a mistake to accumulate all these bespoke
> tools. We should go towards having one tool that can verify checksums,
> validate backup manifests etc. Partially because it's more discoverable,
> but also because it allows to verify multiple such properties in a
> single pass, rather than reading the huge base backup twice.

Would be kinda neat to have a single tool for doing backups and restores
too, as well as validating backup manifests and checksums, that can back
up to s3 or to a remote system with ssh, has multiple compression
options and a pretty sound architecture that's all written in C and is
OSS.

I also agree with Tom/David that verify probably makes sense for this
command, in its current form at least.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 14:56:48 -0400, David Steele wrote:
> On 4/10/20 11:37 AM, Tom Lane wrote:
> > Robert Haas  writes:
> > > Over at 
> > > https://www.postgresql.org/message-id/172c9d9b-1d0a-1b94-1456-376b1e017...@2ndquadrant.com
> > > Peter Eisentraut suggests that pg_validatebackup should be called
> > > pg_verifybackup, with corresponding terminology changes throughout the
> > > code and documentation.
> > 
> > > Here's a patch for that. I'd like to commit this quickly or abandon in
> > > quickly, because large renaming patches like this are a pain to
> > > maintain. I believe that there was a mild consensus in favor of this
> > > on that thread, so I plan to go forward unless somebody shows up
> > > pretty quickly to object.
> > 
> > +1, let's get it done.
> 
> I'm not sure that Peter suggested verify was the correct name, he just
> pointed out that verify and validate are not necessarily the same thing (and
> that we should be consistent in the docs one way or the other). It'd be nice
> if Peter (now CC'd) commented since he's the one who brought it up.
> 
> Having said that, I'm +1 on verify.

FWIW, I still think it's a mistake to accumulate all these bespoke
tools. We should go towards having one tool that can verify checksums,
validate backup manifests etc. Partially because it's more discoverable,
but also because it allows to verify multiple such properties in a
single pass, rather than reading the huge base backup twice.

Greetings,

Andres Freund




Re: WAL usage calculation patch

2020-04-10 Thread Julien Rouhaud
On Fri, Apr 10, 2020 at 8:17 AM Amit Kapila  wrote:
>
> On Tue, Apr 7, 2020 at 2:48 PM Julien Rouhaud  wrote:
> >
> > On Tue, Apr 7, 2020 at 4:36 AM Amit Kapila  wrote:
> > >
> > > On Mon, Apr 6, 2020 at 7:58 PM Euler Taveira
> > >  wrote:
> > > Few comments:
> > > 1.
> > > - int64 wal_num_fpw; /* # of WAL full page image records generated */
> > > + int64 wal_num_fpw; /* # of WAL full page images generated */
> > >
> > > Let's change comment as " /* # of WAL full page writes generated */"
> > > to be consistent with other places like instrument.h.  Also, make a
> > > similar change at other places if required.
> >
> > Agreed.  That's pg_stat_statements.c and instrument.h.  I'll send a
> > patch once we reach consensus with the rest of the comments.
> >
>
> Would you like to send a consolidated patch that includes Euler's
> suggestion and Justin's patch (by making changes for points we
> discussed.)?  I think we can keep the point related to number of
> spaces before each field open?

Sure, I'll take care of that tomorrow!

> > > 2.
> > >
> > > -Total amount of WAL bytes generated by the statement
> > > +Total number of WAL bytes generated by the statement
> > >
> > >
> > > I feel the previous text was better as this field can give us the size
> > > of WAL with which we can answer "how much WAL data is generated by a
> > > particular statement?".  Julien, do you have any thoughts on this?
> >
> > I also prefer "amount" as it feels more natural.
> >
>
> As we see no other opinion on this matter, we can use "amount" here.

Ok.




Re: where should I stick that backup?

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 12:20:01 -0400, Robert Haas wrote:
> - We're only talking about writing a handful of tar files, and that's
> in the context of a full-database backup, which is a much
> heavier-weight operation than a query.
> - There is not really any state that needs to be maintained across calls.
> - The expected result is that a file gets written someplace, which is
> not an in-memory data structure but something that gets written to a
> place outside of PostgreSQL.

Wouldn't there be state like a S3/ssh/https/... connection? And perhaps
a 'backup_id' in the backup metadata DB that'd one would want to update
at the end?

Greetings,

Andres Freund




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 11:25:02 +0900, Fujii Masao wrote:
> > BEGIN
> > WAIT (LSN '16/B374D848', WHATEVER_OPTION_YOU_WANT);
> > ...
> > COMMIT;
> > 
> > It requires only one reserved keyword 'WAIT'. The advantage of this 
> > approach is that it can be extended to support xid, timestamp, csn or 
> > anything else, that may be invented in the future, without affecting the 
> > grammar.
> > 
> > What do you think?
> > 
> > Personally, I find this syntax to be more convenient and human-readable 
> > compared with function call:
> > 
> > SELECT pg_wait_for_lsn('16/B374D848');
> > BEGIN;
> 
> I can imagine that some users want to specify the LSN to wait for,
> from the result of another query, for example,
> SELECT pg_wait_for_lsn(lsn) FROM xxx. If this is valid use case,
> isn't the function better?

I don't think a function is a good idea - it'll cause a snapshot to be
held while waiting. Which in turn will cause hot_standby_feedback to not
be able to report an increased xmin up. And it will possibly hit
snapshot recovery conflicts.

Whereas explicit syntax, especially if a transaction control statement,
won't have that problem.

I'd personally look at 'AFTER' instead of 'WAIT'. Upthread you talked
about a reserved keyword - why does it have to be reserved?


FWIW, I'm not really convinced there needs to be bespoke timeout syntax
for this feature. I can see reasons why you'd not just want to rely on
statement_timeout, but at least that should be discussed.

Greetings,

Andres Freund




Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread David Steele

On 4/10/20 3:27 PM, Tom Lane wrote:

David Steele  writes:

Having said that, I'm +1 on verify.


Me too, if only because it's shorter.


I also think it is (probably) more correct but failing that it is 
*definitely* shorter!


--
-David
da...@pgmasters.net




Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Tom Lane
David Steele  writes:
> Having said that, I'm +1 on verify.

Me too, if only because it's shorter.

regards, tom lane




Re: spin_delay() for ARM

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
> On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
> using a sample C program (attached spin.c). Here, many child processes
> (much more than CPUs) wait in a tight loop for a shared variable to
> become 0, while the parent process continuously increments a sequence
> number for a fixed amount of time, after which, it sets the shared
> variable to 0. The child's tight loop calls PAUSE in each iteration.
> What I hoped was that because of PAUSE in children, the parent process
> would get more share of the CPU, due to which, in a given time, the
> sequence number will reach a higher value. Also, I expected the CPU
> cycles spent by child processes to drop down, thanks to PAUSE. None of
> these happened. There was no change.

> Possibly, this testcase is not right. Probably the process preemption
> occurs only within the set of hyperthreads attached to a single core.
> And in my testcase, the parent process is the only one who is ready to
> run. Still, I have anyway attached the program (spin.c) for archival;
> in case somebody with a YIELD-supporting ARM machine wants to use it
> to test YIELD.

PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
just schedule another process - you won't see different CPU usage if you
measure it purely as the time running. You should be able to see a
difference if you measure with a profiler that shows you data from the
CPUs performance monitoring unit.

Greetings,

Andres Freund




Re: Properly mark NULL returns in numeric aggregates

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-09 16:22:11 -0700, Jesse Zhang wrote:
> We found that several functions -- namely numeric_combine,
> numeric_avg_combine, numeric_poly_combine, and int8_avg_combine -- are
> returning NULL without signaling the nullity of datum in fcinfo.isnull.
> This is obscured by the fact that the only functions in core (finalfunc
> for various aggregates) that those return values feed into happen to
> tolerate (or rather, not quite distinguish) zero-but-not-NULL trans
> values.

Shouldn't these just be marked as strict?

Greetings,

Andres Freund




Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread David Steele

On 4/10/20 11:37 AM, Tom Lane wrote:

Robert Haas  writes:

Over at 
https://www.postgresql.org/message-id/172c9d9b-1d0a-1b94-1456-376b1e017...@2ndquadrant.com
Peter Eisentraut suggests that pg_validatebackup should be called
pg_verifybackup, with corresponding terminology changes throughout the
code and documentation.



Here's a patch for that. I'd like to commit this quickly or abandon in
quickly, because large renaming patches like this are a pain to
maintain. I believe that there was a mild consensus in favor of this
on that thread, so I plan to go forward unless somebody shows up
pretty quickly to object.


+1, let's get it done.


I'm not sure that Peter suggested verify was the correct name, he just 
pointed out that verify and validate are not necessarily the same thing 
(and that we should be consistent in the docs one way or the other). 
It'd be nice if Peter (now CC'd) commented since he's the one who 
brought it up.


Having said that, I'm +1 on verify.

Regards,
--
-David
da...@pgmasters.net




Re: Parallel copy

2020-04-10 Thread Andres Freund
Hi,

On 2020-04-10 07:40:06 -0400, Robert Haas wrote:
> On Thu, Apr 9, 2020 at 4:00 PM Andres Freund  wrote:
> > Imo, yes, there should be only one process doing the chunking. For ilp, 
> > cache efficiency, but also because the leader is the only process with 
> > access to the network socket. It should load input data into one large 
> > buffer that's shared across processes. There should be a separate 
> > ringbuffer with tuple/partial tuple (for huge tuples) offsets. Worker 
> > processes should grab large chunks of offsets from the offset ringbuffer. 
> > If the ringbuffer is not full, the worker chunks should be reduced in size.
> 
> My concern here is that it's going to be hard to avoid processes going
> idle. If the leader does nothing at all once the ring buffer is full,
> it's wasting time that it could spend processing a chunk. But if it
> picks up a chunk, then it might not get around to refilling the buffer
> before other processes are idle with no work to do.

An idle process doesn't cost much. Processes that use CPU inefficiently
however...


> Still, it might be the case that having the process that is reading
> the data also find the line endings is so fast that it makes no sense
> to split those two tasks. After all, whoever just read the data must
> have it in cache, and that helps a lot.

Yea. And if it's not fast enough to split lines, then we have a problem
regardless of which process does the splitting.

Greetings,

Andres Freund




Re: proposal: schema variables

2020-04-10 Thread Pavel Stehule
Hi

rebase

Regards

Pavel

ne 22. 3. 2020 v 8:40 odesílatel Pavel Stehule 
napsal:

> Hi
>
> rebase
>
> Regards
>
> Pavel
>


schema-variables-20200310.patch.gz
Description: application/gzip


Re: Report error position in partition bound check

2020-04-10 Thread Alexandra Wang
On Fri, 10 Apr 2020 at 14:31, Amit Langote  wrote:
> I agree with that.  Tried that in the attached 0002, although trying
> to get the cursor to point to exactly the offending column seems a bit
> tough for partition overlap errors.  The patch does allow to single
> out which one of the lower and upper bounds is causing the overlap
> with an existing partition, which is better than now and seems helpful
> enough.
>
> Also, updated Alexandra's patch to incorporate Ashutosh's comment such
> that we get the same output with ATTACH PARTITION commands too.

Thank you Amit for updating the patches, the cursor looks much helpful now.
I
created the commitfest entry https://commitfest.postgresql.org/28/2533/


Re: where should I stick that backup?

2020-04-10 Thread Robert Haas
On Fri, Apr 10, 2020 at 10:54 AM Stephen Frost  wrote:
> So, this goes to what I was just mentioning to Bruce independently- you
> could have made the same argument about FDWs, but it just doesn't
> actually hold any water.  Sure, some of the FDWs aren't great, but
> there's certainly no shortage of them, and the ones that are
> particularly important (like postgres_fdw) are well written and in core.

That's a fairly different use case. In the case of the FDW interface:

- The number of interface method calls is very high, at least one per
tuple and a bunch of extra ones for each query.
- There is a significant amount of complex state that needs to be
maintained across API calls.
- The return values are often tuples, which are themselves an
in-memory data structure.

But here:

- We're only talking about writing a handful of tar files, and that's
in the context of a full-database backup, which is a much
heavier-weight operation than a query.
- There is not really any state that needs to be maintained across calls.
- The expected result is that a file gets written someplace, which is
not an in-memory data structure but something that gets written to a
place outside of PostgreSQL.

> The concerns about there being too many possibilities and new ones
> coming up all the time could be applied equally to FDWs, but rather than
> ending up with a dearth of options and external solutions there, what
> we've actually seen is an explosion of options and externally written
> libraries for a large variety of options.

Sure, but a lot of those FDWs are relatively low-quality, and it's
often hard to find one that does what you want. And even if you do,
you don't really know how good it is. Unfortunately, in that case
there's no real alternative, because implementing something based on
shell commands couldn't ever have reasonable performance or a halfway
decent feature set. That's not the case here.

> How does this solution give them a good way to do the right thing
> though?  In a way that will work with large databases and complex
> requirements?  The answer seems to be "well, everyone will have to write
> their own tool to do that" and that basically means that, at best, we're
> only providing half of a solution and expecting all of our users to
> provide the other half, and to always do it correctly and in a well
> written way.  Acknowledging that most users aren't going to actually do
> that and instead they'll implement half measures that aren't reliable
> shouldn't be seen as an endorsement of this approach.

I don't acknowledge that. I think it's possible to use tools like the
proposed option in a perfectly reliable way, and I've already given a
bunch of examples of how it could be done. Writing a file is not such
a complex operation that every bit of code that writes one reliably
has to be written by someone associated with the PostgreSQL project. I
strongly suspect that people who use a cloud provider's tools to
upload their backup files will be quite happy with the results, and if
they aren't, I hope they will blame the cloud provider's tool for
eating the data rather than this option for making it easy to give the
data to the thing that ate it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-10 Thread Tom Lane
Ashutosh Bapat  writes:
> On Fri, Apr 10, 2020 at 9:14 PM Tom Lane  wrote:
>> I see no patch here ...

> Sorry. Here it is

LGTM, will push in a moment.

regards, tom lane




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-10 Thread Ashutosh Bapat
On Fri, Apr 10, 2020 at 9:14 PM Tom Lane  wrote:
>
> Ashutosh Bapat  writes:
> > On Fri, 10 Apr 2020 at 20:44, Jeff Janes  wrote:
> >> In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make
> >> the compiler happy.
>
> > Attaching my patch again. It doesn't need PG_USED_FOR_ASSERTS_ONLY as well.
> > Kuntal has confirmed that this fixes the warning for him.
>
> I see no patch here ...
>

Sorry. Here it is


-- 
Best Wishes,
Ashutosh Bapat
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 7607501fe7..4681441dcc 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1021,7 +1021,6 @@ partition_bounds_merge(int partnatts,
 	   List **outer_parts, List **inner_parts)
 {
 	PartitionBoundInfo outer_binfo = outer_rel->boundinfo;
-	PartitionBoundInfo inner_binfo = inner_rel->boundinfo;
 
 	/*
 	 * Currently, this function is called only from try_partitionwise_join(),
@@ -1032,7 +1031,7 @@ partition_bounds_merge(int partnatts,
 		   jointype == JOIN_ANTI);
 
 	/* The partitioning strategies should be the same. */
-	Assert(outer_binfo->strategy == inner_binfo->strategy);
+	Assert(outer_binfo->strategy == inner_rel->boundinfo->strategy);
 
 	*outer_parts = *inner_parts = NIL;
 	switch (outer_binfo->strategy)


Re: where should I stick that backup?

2020-04-10 Thread Bruce Momjian
On Fri, Apr 10, 2020 at 10:54:10AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Robert Haas (robertmh...@gmail.com) wrote:
> > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian  wrote:
> > > Good point, but if there are multiple APIs, it makes shell script
> > > flexibility even more useful.
> > 
> > This is really the key point for me. There are so many existing tools
> > that store a file someplace that we really can't ever hope to support
> > them all in core, or even to have well-written extensions that support
> > them all available on PGXN or wherever. We need to integrate with the
> > tools that other people have created, not try to reinvent them all in
> > PostgreSQL.
> 
> So, this goes to what I was just mentioning to Bruce independently- you
> could have made the same argument about FDWs, but it just doesn't
> actually hold any water.  Sure, some of the FDWs aren't great, but
> there's certainly no shortage of them, and the ones that are
> particularly important (like postgres_fdw) are well written and in core.

No, no one made that argument.  It isn't clear how a shell script API
would map to relational database queries.  The point is how well the
APIs match, and then if they are close, does it give us the flexibility
we need.  You can't just look at flexibility without an API match.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-10 Thread Tom Lane
Ashutosh Bapat  writes:
> On Fri, 10 Apr 2020 at 20:44, Jeff Janes  wrote:
>> In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make
>> the compiler happy.

> Attaching my patch again. It doesn't need PG_USED_FOR_ASSERTS_ONLY as well.
> Kuntal has confirmed that this fixes the warning for him.

I see no patch here ...

regards, tom lane




Re: Report error position in partition bound check

2020-04-10 Thread Ashutosh Bapat
On Fri, 10 Apr 2020 at 14:31, Amit Langote  wrote:

> On Thu, Apr 9, 2020 at 11:04 PM Tom Lane  wrote:
> > While I'm quite on board with providing useful error cursors,
> > the example cases in this patch don't seem all that useful:
> >
> >  -- trying to create range partition with empty range
> >  CREATE TABLE fail_part PARTITION OF range_parted2 FOR VALUES FROM (1)
> TO (0);
> >  ERROR:  empty range bound specified for partition "fail_part"
> > +LINE 1: ...E fail_part PARTITION OF range_parted2 FOR VALUES FROM (1)
> T...
> > + ^
> >  DETAIL:  Specified lower bound (1) is greater than or equal to upper
> bound (0).
> >
> > As best I can tell from these examples, the cursor will always
> > point at the FROM keyword, making it pretty unhelpful.  It seems
> > like in addition to getting the query string passed down, you
> > need to do some work on the code that's actually reporting the
> > error position.  I'd expect at a minimum that the pointer allows
> > identifying which column of a multi-column partition key is
> > giving trouble.  The phrasing of this particular message, for
> > example, suggests that it ought to point at the "1" expression.
>
> I agree with that.  Tried that in the attached 0002, although trying
> to get the cursor to point to exactly the offending column seems a bit
> tough for partition overlap errors.  The patch does allow to single
> out which one of the lower and upper bounds is causing the overlap
> with an existing partition, which is better than now and seems helpful
> enough.
>
> Also, updated Alexandra's patch to incorporate Ashutosh's comment such
> that we get the same output with ATTACH PARTITION commands too.
>

I looked at this briefly. It looks good, but I will review more in the next
CF. Do we have entry there yet? To nit-pick: for a multi-key value the ^
points to the first column and the reader may think that that's the
problematci column. Should it instead point to ( ?

-- 
Best Wishes,
Ashutosh


Re: pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Tom Lane
Robert Haas  writes:
> Over at 
> https://www.postgresql.org/message-id/172c9d9b-1d0a-1b94-1456-376b1e017...@2ndquadrant.com
> Peter Eisentraut suggests that pg_validatebackup should be called
> pg_verifybackup, with corresponding terminology changes throughout the
> code and documentation.

> Here's a patch for that. I'd like to commit this quickly or abandon in
> quickly, because large renaming patches like this are a pain to
> maintain. I believe that there was a mild consensus in favor of this
> on that thread, so I plan to go forward unless somebody shows up
> pretty quickly to object.

+1, let's get it done.

regards, tom lane




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-10 Thread Ashutosh Bapat
On Fri, 10 Apr 2020 at 20:44, Jeff Janes  wrote:

> On Thu, Apr 9, 2020 at 10:04 AM Ashutosh Bapat <
> ashutosh.bapat@gmail.com> wrote:
>
>> On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita 
>> wrote:
>> >
>> > On Thu, Apr 9, 2020 at 2:36 PM Tom Lane  wrote:
>> > > Etsuro Fujita  writes:
>> > > > Yeah, partition_bounds_merge() is currently called only from
>> > > > try_partitionwise_join(), which guarantees that the strategies are
>> the
>> > > > same.
>> >
>> > > If there's only one caller and there's not likely to ever be more,
>> > > then I tend to agree that you don't need the assertion.
>> >
>> > It seems unlikely that partition_bounds_merge() will be called from
>> > more places in the foreseeable future, so I'd still vote for removing
>> > the assertion.
>>
>> When I wrote that function, I had UNION also in mind. A UNION across
>> multiple partitioned relations will be partitioned if we can merge the
>> partition bounds in a sensible manner. Of course the current structure
>> of that function looks more purposed for join, but it's not difficult
>> to convert it to be used for UNION as well. In that case those set of
>> functions will have many more callers. So, I will vote to keep that
>> assertion now that we have it there.
>>
>
> In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make
> the compiler happy.
>
>
Attaching my patch again. It doesn't need PG_USED_FOR_ASSERTS_ONLY as well.
Kuntal has confirmed that this fixes the warning for him.

-- 
Best Wishes,
Ashutosh


Re: Support for DATETIMEOFFSET

2020-04-10 Thread Andreas Karlsson

On 4/10/20 3:19 PM, Jeremy Morton wrote:
Oh well.  Guess I keep using SQL Server then.  datetimeoffset makes it 
impossible for developers to make the mistake of forgetting to use UTC 
instead of local datetime, and for that reason alone it makes it 
invaluable in my opinion.  It should be used universally instead of 
datetime.


I think that the timestamptz type already helps out a lot with that 
since it accepts input strings with a time zone offest (e.g. '2020-04-10 
17:19:39+02') and converts it to UTC after parsing the timestamp. In 
fact I would argue that it does so with fewer pitfalls than the 
datetimeoffset type since with timestamptz everything you read will have 
the same time zone while when you read a datetimeoffset column you will 
get the time zone used by the application which inserted it originally, 
and if e.g. one of the application servers have a different time zone 
(let's say the sysadmin forgot to set it to UTC and it runs in local 
time) you will get a mix which will make bugs hard to spot.


I am not saying there isn't a use case for something like 
datetimeoffset, I think that there is. For example in some kind of 
calendar or scheduling application. But as a generic type for storing 
points in time we already have timestamptz which is easy to use and 
handles most of the common use cases, e.g. storing when an event happened.


Andreas





Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-10 Thread Jeff Janes
On Thu, Apr 9, 2020 at 10:04 AM Ashutosh Bapat 
wrote:

> On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita 
> wrote:
> >
> > On Thu, Apr 9, 2020 at 2:36 PM Tom Lane  wrote:
> > > Etsuro Fujita  writes:
> > > > Yeah, partition_bounds_merge() is currently called only from
> > > > try_partitionwise_join(), which guarantees that the strategies are
> the
> > > > same.
> >
> > > If there's only one caller and there's not likely to ever be more,
> > > then I tend to agree that you don't need the assertion.
> >
> > It seems unlikely that partition_bounds_merge() will be called from
> > more places in the foreseeable future, so I'd still vote for removing
> > the assertion.
>
> When I wrote that function, I had UNION also in mind. A UNION across
> multiple partitioned relations will be partitioned if we can merge the
> partition bounds in a sensible manner. Of course the current structure
> of that function looks more purposed for join, but it's not difficult
> to convert it to be used for UNION as well. In that case those set of
> functions will have many more callers. So, I will vote to keep that
> assertion now that we have it there.
>

In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make the
compiler happy.

Cheers,

Jeff


Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-10 Thread Alexey Kondratov

On 2020-04-10 05:25, Fujii Masao wrote:

On 2020/04/10 3:16, Alexey Kondratov wrote:
Just another idea in case if one will still decide to go with a 
separate statement + BEGIN integration instead of a function. We could 
use parenthesized options list here. This is already implemented for 
VACUUM, REINDEX, etc. There was an idea to allow CONCURRENTLY in 
REINDEX there [1] and recently this was proposed again for new options 
[2], since it is much more extensible from the grammar perspective.


That way, the whole feature may look like:

WAIT (LSN '16/B374D848', TIMEOUT 100);

and/or

BEGIN
WAIT (LSN '16/B374D848', WHATEVER_OPTION_YOU_WANT);
...
COMMIT;

It requires only one reserved keyword 'WAIT'. The advantage of this 
approach is that it can be extended to support xid, timestamp, csn or 
anything else, that may be invented in the future, without affecting 
the grammar.


What do you think?

Personally, I find this syntax to be more convenient and 
human-readable compared with function call:


SELECT pg_wait_for_lsn('16/B374D848');
BEGIN;


I can imagine that some users want to specify the LSN to wait for,
from the result of another query, for example,
SELECT pg_wait_for_lsn(lsn) FROM xxx. If this is valid use case,
isn't the function better?



I think that the main purpose of the feature is to achieve 
read-your-writes-consistency, while using async replica for reads. In 
that case lsn of last modification is stored inside application, so 
there is no need to do any query for that. Moreover, you cannot store 
this lsn inside database, since reads are distributed across all 
replicas (+ primary).


Thus, I could imagine that 'xxx' in your example states for some kind of 
stored procedure, that fetches lsn from the off-postgres storage, but it 
looks like very narrow case to count on it, doesn't it?


Anyway, I am not against implementing this as a function. That was just 
another option to consider.


Just realized that the last patch I have seen does not allow usage of 
wait on primary. It may be a problem if reads are pooled not only across 
replicas, but on primary as well, which should be quite usual I guess. 
In that case application does not know either request will be processed 
on replica, or on primary. I think it should be allowed without any 
warning, or just saying some LOG/DEBUG at most, that there was no 
waiting performed.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-10 Thread Juan José Santamaría Flecha
On Fri, Apr 10, 2020 at 2:25 PM Amit Kapila  wrote:

>
> I see that the kind of check you are talking is recently added by
> commit 352f6f2d.  I think it is better to be consistent in all places.
> Let's pick one and use that if possible.


Currently there are two constructs to test the same logic, which is not
great. I think that using _MSC_VER  makes it seem as MSVC exclusive code,
when MinGW should also be considered.

In the longterm aligning Postgres with MS product obsolescence will make
these tests unneeded, but I can propose a patch for making the test
consistent in all cases, on a different thread since this has little to do
with $SUBJECT.

Regards,

Juan José Santamaría Flecha

>
>


pg_validatebackup -> pg_verifybackup?

2020-04-10 Thread Robert Haas
Over at 
https://www.postgresql.org/message-id/172c9d9b-1d0a-1b94-1456-376b1e017...@2ndquadrant.com
Peter Eisentraut suggests that pg_validatebackup should be called
pg_verifybackup, with corresponding terminology changes throughout the
code and documentation.

Here's a patch for that. I'd like to commit this quickly or abandon in
quickly, because large renaming patches like this are a pain to
maintain. I believe that there was a mild consensus in favor of this
on that thread, so I plan to go forward unless somebody shows up
pretty quickly to object.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0001-Rename-pg_validatebackup-to-pg_verifybackup.patch
Description: Binary data


Re: where should I stick that backup?

2020-04-10 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian  wrote:
> > Good point, but if there are multiple APIs, it makes shell script
> > flexibility even more useful.
> 
> This is really the key point for me. There are so many existing tools
> that store a file someplace that we really can't ever hope to support
> them all in core, or even to have well-written extensions that support
> them all available on PGXN or wherever. We need to integrate with the
> tools that other people have created, not try to reinvent them all in
> PostgreSQL.

So, this goes to what I was just mentioning to Bruce independently- you
could have made the same argument about FDWs, but it just doesn't
actually hold any water.  Sure, some of the FDWs aren't great, but
there's certainly no shortage of them, and the ones that are
particularly important (like postgres_fdw) are well written and in core.

> Now what I understand Stephen to be saying is that a lot of those
> tools actually suck, and I think that's a completely valid point. But
> I also think that it's unwise to decide that such problems are our
> problems rather than problems with those tools. That's a hole with no
> bottom.

I don't really think 'bzip2' sucks as a tool, or that bash does.  They
weren't designed or intended to meet the expectations that we have for
data durability though, which is why relying on them for exactly that
ends up being a bad recipe.

> One thing I do think would be realistic would be to invent a set of
> tools that are perform certain local filesystem operations in a
> "hardened" way. Maybe a single tool with subcommands and options. So
> you could say, e.g. 'pgfile cp SOURCE TARGET' and it would create a
> temporary file in the target directory, write the contents of the
> source into that file, fsync the file, rename it into place, and do
> more fsyncs to make sure it's all durable in case of a crash. You
> could have a variant of this that instead of using the temporary file
> and rename in place approach, does the thing where you open the target
> file with O_CREAT|O_EXCL, writes the bytes, and then closes and fsyncs
> it. And you could have other things too, like 'pgfile mkdir DIR' to
> create a directory and fsync it for durability. A toolset like this
> would probably help people write better archive commands - it would
> certainly been an improvement over what we have now, anyway, and it
> could also be used with the feature that I proposed upthread.

This argument leads in a direction to justify anything as being sensible
to implement using shell scripts.  If we're open to writing the shell
level tools that would be needed, we could reimplement all of our
indexes that way, or FDWs, or TDE, or just about anything else.

What we would end up with though is that we'd have more complications
changing those interfaces because people will be using those tools, and
maybe those tools don't get updated at the same time as PG does, and
maybe there's critical changes that need to be made in back branches and
we can't really do that with these interfaces.

> It is of course not impossible to teach pg_basebackup to do all of
> that stuff internally, but I have a really difficult time imagining us
> ever getting it done. There are just too many possibilities, and new
> ones arise all the time.

I agree that it's certainly a fair bit of work, but it can be
accomplished incrementally and, with a good design, allow for adding in
new options in the future with relative ease.  Now is the time to
discuss what that design looks like, think about how we can implement it
in a way that all of the tools we have are able to work together, and
have them all support and be tested together with these different
options.

The concerns about there being too many possibilities and new ones
coming up all the time could be applied equally to FDWs, but rather than
ending up with a dearth of options and external solutions there, what
we've actually seen is an explosion of options and externally written
libraries for a large variety of options.

> A 'pgfile' utility wouldn't help at all for people who are storing to
> S3 or whatever. They could use 'aws s3' as a target for --pipe-output,
> but if it turns out that said tool is insufficiently robust in terms
> of overwriting files or doing fsyncs or whatever, then they might have
> problems. Now, Stephen or anyone else could choose to provide
> alternative tools with more robust behavior, and that would be great.
> But even if he didn't, people could take their chances with what's
> already out there. To me, that's a good thing. Yeah, maybe they'll do
> dumb things that don't work, but realistically, they can do dumb stuff
> without the proposed option too.

How does this solution give them a good way to do the right thing
though?  In a way that will work with large databases and complex
requirements?  The answer seems to be "well, everyone will have to write
their own tool 

Re: Implementing Incremental View Maintenance

2020-04-10 Thread Yugo NAGATA
Hi, 

Attached is the latest patch (v15) to add support for Incremental Materialized
View Maintenance (IVM). It is possible to apply to current latest master branch.

Differences from the previous patch (v14) include:

* Fix to not use generate_series when views are queried

In the previous implementation, multiplicity of each tuple was stored
in ivm_count column in views. When SELECT was issued for views with
duplicate, the view was replaced with a subquery in which each tuple
was joined with generate_series function in order to output tuples
of the number of ivm_count.

This was problematic for following reasons:
  
- The overhead was huge. When almost of tuples in a view were selected,
  it took much longer time than the original query. This lost the meaning
  of materialized views.

- Optimizer could not estimate row numbers correctly because this had to
  know ivm_count values stored in tuples.

- System columns of materialized views like cmin, xmin, xmax could not
  be used because a view was replaced with a subquery.

To resolve this, the new implementation doen't store multiplicities
for views with tuple duplicates, and doesn't use generate_series
when SELECT query is issued for such views.

Note that we still have to use ivm_count for supporting DISTINCT and
aggregates.

*  Add query checks for IVM restrictions

Query checks for following restrictions are added:

- DISTINCT ON
- TABLESAMPLE parameter
- inheritance parent table
- window function
- some aggregate options(such as FILTER, DISTINCT, ORDER and GROUPING SETS)
- targetlist containing IVM column
- simple subquery is only supported
- FOR UPDATE/SHARE
- empty target list
- UNION/INTERSECT/EXCEPT
- GROUPING SETS clauses

* Improve error messages

Add error code ERRCODE_FEATURE_NOT_SUPPORTED to each IVM error message.
Also, the message format was unified.

* Support subqueries containig joins in FROM clause

Previously, when multi tables are updated simultaneously, incremental
view maintenance with subqueries including JOIN didn't work correctly
due to a bug. 

Best Regards,
Takuma Hoshiai

-- 
Yugo NAGATA 


IVM_patches_v15.tar.gz
Description: application/gzip


Re: Support for DATETIMEOFFSET

2020-04-10 Thread Neil


> On Apr 10, 2020, at 8:19 AM, Jeremy Morton  wrote:
> 
> Oh well.  Guess I keep using SQL Server then.  datetimeoffset makes it 
> impossible for developers to make the mistake of forgetting to use UTC 
> instead of local datetime, and for that reason alone it makes it invaluable 
> in my opinion.  It should be used universally instead of datetime.

1. Not sure I understand. I’ve never used datetimeoffset so please bear with 
me. How does storing a time zone with the date time “make it impossible for 
developers to make the mistake….”

2. I usually work with timestamps that have input and output across multiple 
time zones, why would one store a time zone in the database?  If I need a local 
time, then postgres does that automatically. 

3. At the end of the day a point in time in UTC is about as clear as it is 
possible to make it.

Not trying to be difficult, just trying to understand.

Neil

> 
> -- 
> Best regards,
> Jeremy Morton (Jez)
> 
> Andreas Karlsson wrote:
>> On 4/10/20 10:34 AM, Jeremy Morton wrote:
>>> I've noticed that Postgres doesn't have support for DATETIMEOFFSET (or any 
>>> functional equivalent data type) yet.  Is this on the roadmap to implement? 
>>>  I find it a very useful data type that I use all over the place in TSQL 
>>> databases.
>> Hi,
>> I do not think anyone is working on such a type.  And personally I think 
>> such a type is better suite for an extension rather than for core 
>> PostgreSQL. For most applications the timestamptz and date types are enough 
>> to solve everything time related (with some use of the timestamp type when 
>> doing calculations), but there are niche applications where other temporal 
>> types can be very useful, but I personally do not think those are common 
>> enough for inclusion in core PostgreSQL.
>> I suggest writing an extension with this type and see if there is any 
>> interest in it.
>> Andreas
> 
> 





Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-10 Thread James Coleman
One thing I just noticed and had a question about: in
preparePresortedCols (which sets up a function call context), do we
need to call pg_proc_aclcheck?

James




Re: Support for DATETIMEOFFSET

2020-04-10 Thread Tom Lane
Jeremy Morton  writes:
> Oh well.  Guess I keep using SQL Server then.  datetimeoffset makes it 
> impossible for developers to make the mistake of forgetting to use UTC 
> instead of local datetime,

Really?  That would be a remarkable feat for a mere datatype to
accomplish.

> and for that reason alone it makes it 
> invaluable in my opinion.  It should be used universally instead of 
> datetime.

What's it do that timestamptz together with setting timezone to UTC
doesn't?

regards, tom lane




Re: Fast DSM segments

2020-04-10 Thread Robert Haas
On Thu, Apr 9, 2020 at 1:46 AM Thomas Munro  wrote:
> The attached highly experimental patch adds a new GUC
> dynamic_shared_memory_main_size.  If you set it > 0, it creates a
> fixed sized shared memory region that supplies memory for "fast" DSM
> segments.  When there isn't enough free space, dsm_create() falls back
> to the traditional approach using eg shm_open().

I think this is a reasonable option to have available for people who
want to use it. I didn't want to have parallel query be limited to a
fixed-size amount of shared memory because I think there are some
cases where efficient performance really requires a large chunk of
memory, and it seemed impractical to keep the largest amount of memory
that any query might need to use permanently allocated, let alone that
amount multiplied by the maximum possible number of parallel queries
that could be running at the same time. But none of that is any
argument against giving people the option to preallocate some memory
for parallel query.

My guess is that on smaller boxes this won't find a lot of use, but on
bigger ones it will be handy. It's hard to imagine setting aside 1GB
of memory for this if you only have 8GB total, but if you have 512GB
total, it's pretty easy to imagine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: where should I stick that backup?

2020-04-10 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, Apr  9, 2020 at 04:15:07PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > I think we need to step back and look at the larger issue.  The real
> > > argument goes back to the Unix command-line API vs the VMS/Windows API. 
> > > The former has discrete parts that can be stitched together, while the
> > > VMS/Windows API presents a more duplicative but more holistic API for
> > > every piece.  We have discussed using shell commands for
> > > archive_command, and even more recently, for the server pass phrase.  
> > 
> > When it comes to something like the server pass phrase, it seems much
> > more reasonable to consider using a shell script (though still perhaps
> > not ideal) because it's not involved directly in ensuring that the data
> > is reliably stored and it's pretty clear that if it doesn't work the
> > worst thing that happens is that the database doesn't start up, but it
> > won't corrupt any data or destroy it or do other bad things.
> 
> Well, the pass phrase relates to security, so it is important too.  I
> don't think the _importance_ of the action is the most determining
> issue.  Rather, I think it is how well the action fits the shell script
> API.

There isn't a single 'shell script API' though, and it's possible to
craft a 'shell script API' to fit nearly any use-case, but that doesn't
make it a good solution.  The amount we depend on the external code for
the correct operation of the system is relevant, and important to
consider.

> > > To get more specific, I think we have to understand how the
> > > _requirements_ of the job match the shell script API, with stdin,
> > > stdout, stderr, return code, and command-line arguments.  Looking at
> > > archive_command, the command-line arguments allow specification of file
> > > names, but quoting can be complex.  The error return code and stderr
> > > output seem to work fine.  There is no clean API for fsync and testing
> > > if the file exists, so that all that has to be hand done in one
> > > command-line.  This is why many users use pre-written archive_command
> > > shell scripts.
> > 
> > We aren't considering all of the use-cases really though, in specific,
> > things like pushing to s3 or gcs require, at least, good retry logic,
> > and that's without starting to think about things like high-rate systems
> > (spawning lots of new processes isn't free, particularly if they're
> > written in shell script but any interpreted language is expensive) and
> > wanting to parallelize.
> 
> Good point, but if there are multiple APIs, it makes shell script
> flexibility even more useful.

This doesn't seem to answer the concerns that I brought up.

Trying to understand it did make me think of another relevant question
that was brought up in this discussion- can we really expect users to
actually implement a C library for this, if we provided a way for them
to?  For that, I'd point to FDWs, where we certainly don't have any
shortage of external, written in C, solutions.  Another would be logical
decoding.

> > > This brings up a few questions:
> > > 
> > > *  Should we have split apart archive_command into file-exists, copy,
> > > fsync-file?  Should we add that now?
> > 
> > No..  The right approach to improving on archive command is to add a way
> > for an extension to take over that job, maybe with a complete background
> > worker of its own, or perhaps a shared library that can be loaded by the
> > archiver process, at least if we're talking about how to allow people to
> > extend it.
> 
> That seems quite vague, which is the issue we had years ago when
> considering doing archive_command as a link to a C library.

That prior discussion isn't really relevant though, as it was before we
had extensions, and before we had background workers that can run as part
of an extension.

> > Potentially a better answer is to just build this stuff into PG- things
> > like "archive WAL to s3/GCS with these credentials" are what an awful
> > lot of users want.  There's then some who want "archive first to this
> > other server, and then archive to s3/GCS", or more complex options.
> 
> Yes, we certainly know how to do a file system copy, but what about
> copying files to other things like S3?  I don't know how we would do
> that and allow users to change things like file paths or URLs.

There's a few different ways we could go about this.  The simple answer
would be to use GUCs, which would simplify things like dealing with the
restore side too.  Another option would be to have a concept of
'repository' objects in the system, not unlike tablespaces, but they'd
have more options.  To deal with that during recovery though, we'd need
a way to get the relevant information from the catalogs (maybe we write
the catalog out to a flat file on update, not unlike what we used to do
with pg_shadow), perhaps even in a format that users could modify if
they needed to.  The 

Re: where should I stick that backup?

2020-04-10 Thread Robert Haas
On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian  wrote:
> Good point, but if there are multiple APIs, it makes shell script
> flexibility even more useful.

This is really the key point for me. There are so many existing tools
that store a file someplace that we really can't ever hope to support
them all in core, or even to have well-written extensions that support
them all available on PGXN or wherever. We need to integrate with the
tools that other people have created, not try to reinvent them all in
PostgreSQL.

Now what I understand Stephen to be saying is that a lot of those
tools actually suck, and I think that's a completely valid point. But
I also think that it's unwise to decide that such problems are our
problems rather than problems with those tools. That's a hole with no
bottom.

One thing I do think would be realistic would be to invent a set of
tools that are perform certain local filesystem operations in a
"hardened" way. Maybe a single tool with subcommands and options. So
you could say, e.g. 'pgfile cp SOURCE TARGET' and it would create a
temporary file in the target directory, write the contents of the
source into that file, fsync the file, rename it into place, and do
more fsyncs to make sure it's all durable in case of a crash. You
could have a variant of this that instead of using the temporary file
and rename in place approach, does the thing where you open the target
file with O_CREAT|O_EXCL, writes the bytes, and then closes and fsyncs
it. And you could have other things too, like 'pgfile mkdir DIR' to
create a directory and fsync it for durability. A toolset like this
would probably help people write better archive commands - it would
certainly been an improvement over what we have now, anyway, and it
could also be used with the feature that I proposed upthread.

For example, if you're concerned that bzip might overwrite an existing
file and that it might not fsync, then instead of saying:

pg_basebackup -Ft --pipe-output 'bzip > %f.bz2'

You could instead write:

pg_basebackup -Ft --pipe-output 'bzip | pgfile create-exclusive - %f.bz2'

or whatever we pick for actual syntax. And that provides a kind of
hardening that can be used with any other command line tool that can
be used as a filter.

If you want to compress with bzip, encrypt, and then copy the file to
a remote system, you could do:

pg_basebackup -Ft --pipe-output 'bzip | gpg -e | ssh someuser@somehost
pgfile create-exclusive - /backups/tuesday/%f.bz2'

It is of course not impossible to teach pg_basebackup to do all of
that stuff internally, but I have a really difficult time imagining us
ever getting it done. There are just too many possibilities, and new
ones arise all the time.

A 'pgfile' utility wouldn't help at all for people who are storing to
S3 or whatever. They could use 'aws s3' as a target for --pipe-output,
but if it turns out that said tool is insufficiently robust in terms
of overwriting files or doing fsyncs or whatever, then they might have
problems. Now, Stephen or anyone else could choose to provide
alternative tools with more robust behavior, and that would be great.
But even if he didn't, people could take their chances with what's
already out there. To me, that's a good thing. Yeah, maybe they'll do
dumb things that don't work, but realistically, they can do dumb stuff
without the proposed option too.

> Yes, we certainly know how to do a file system copy, but what about
> copying files to other things like S3?  I don't know how we would do
> that and allow users to change things like file paths or URLs.

Right. I think it's key that we provide people with tools that are
highly flexible and, ideally, also highly composable.

(Incidentally, pg_basebackup already has an option to output the
entire backup as a tarfile on standard output, and a user can already
pipe that into any tool they like. However, it doesn't work with
tablespaces. So you could think of this proposal as extending the
existing functionality to cover that case.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: doc review for parallel vacuum

2020-04-10 Thread Justin Pryzby
On Fri, Apr 10, 2020 at 12:56:08PM +0530, Amit Kapila wrote:
> On Wed, Apr 8, 2020 at 12:49 PM Masahiko Sawada
>  wrote:
> >
> > On Tue, 7 Apr 2020 at 13:55, Justin Pryzby  wrote:
> > >
> >
> > I don't have comments on your change other than the comments Amit
> > already sent. Thank you for reviewing this part!
> >
> 
> I have made the modifications as per my comments.  What do you think
> about the attached?

Couple more changes (in bold):

- The PARALLEL option is used only for vacuum PURPOSES.
- Even if this option is specified with THE ANALYZE option

Also, this part still doesn't read well:

-* amvacuumcleanup to the DSM segment if it's the first time to get it?
-* from them? because they? allocate it locally and it's possible that 
an
-* index will be vacuumed by the different vacuum process at the next

If you change "it" and "them" and "it" and say "*a* different", then it'll be
ok.

-- 
Justin




Re: Support for DATETIMEOFFSET

2020-04-10 Thread Jeremy Morton
Oh well.  Guess I keep using SQL Server then.  datetimeoffset makes it 
impossible for developers to make the mistake of forgetting to use UTC 
instead of local datetime, and for that reason alone it makes it 
invaluable in my opinion.  It should be used universally instead of 
datetime.


--
Best regards,
Jeremy Morton (Jez)

Andreas Karlsson wrote:

On 4/10/20 10:34 AM, Jeremy Morton wrote:
I've noticed that Postgres doesn't have support for DATETIMEOFFSET 
(or any functional equivalent data type) yet.  Is this on the 
roadmap to implement?  I find it a very useful data type that I use 
all over the place in TSQL databases.


Hi,

I do not think anyone is working on such a type.  And personally I 
think such a type is better suite for an extension rather than for 
core PostgreSQL. For most applications the timestamptz and date types 
are enough to solve everything time related (with some use of the 
timestamp type when doing calculations), but there are niche 
applications where other temporal types can be very useful, but I 
personally do not think those are common enough for inclusion in core 
PostgreSQL.


I suggest writing an extension with this type and see if there is any 
interest in it.


Andreas








Re: Vacuum o/p with (full 1, parallel 0) option throwing an error

2020-04-10 Thread Justin Pryzby
On Fri, Apr 10, 2020 at 10:34:02AM +0530, Amit Kapila wrote:
> On Thu, Apr 9, 2020 at 2:03 PM Justin Pryzby  wrote:
> > On Thu, Apr 09, 2020 at 05:07:48PM +0900, Masahiko Sawada wrote:
> > > Yes but the difference is that we cannot disable PARSER or COPY by
> > > specifying options whereas we can do something like "VACUUM (FULL
> > > false) tbl" to disable FULL option. I might be misunderstanding the
> > > meaning of "specify" though.
> >
> > You have it right.
> >
> > We should fix the behavior, but change the error message for consistency 
> > with
> > that change, like so.
> >
> 
> Okay, but I think the error message suggested by Robert "ERROR: VACUUM
> FULL cannot be performed in parallel" sounds better than what you have
> proposed.  What do you think?

No problem.  I think I was trying to make my text similar to that from
14a4f6f37.

I realized that I didn't sq!uash my last patch, so it didn't include the
functional change (which is maybe what Robert was referring to).

-- 
Justin
>From 16ff7441791c481ab97b7eb8b2948e38626c7273 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 8 Apr 2020 11:38:36 -0500
Subject: [PATCH v3] Allow specifying "(parallel 0)" with "vacuum(full)"..

..even though full implies parallel 0

Check options using the same test as in vacuumlazy.c

See also similar change long ago:
14a4f6f3748df4ff63bb2d2d01146b2b98df20ef

Discussion:
https://www.postgresql.org/message-id/58c8d171-e665-6fa3-a9d3-d9423b694dae%40enterprisedb.com
---
 src/backend/commands/vacuum.c| 6 ++
 src/test/regress/expected/vacuum.out | 5 +++--
 src/test/regress/sql/vacuum.sql  | 3 ++-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 351d5215a9..9cb65d76a9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -104,7 +104,6 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	bool		freeze = false;
 	bool		full = false;
 	bool		disable_page_skipping = false;
-	bool		parallel_option = false;
 	ListCell   *lc;
 
 	/* Set default value */
@@ -145,7 +144,6 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 			params.truncate = get_vacopt_ternary_value(opt);
 		else if (strcmp(opt->defname, "parallel") == 0)
 		{
-			parallel_option = true;
 			if (opt->arg == NULL)
 			{
 ereport(ERROR,
@@ -199,10 +197,10 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		   !(params.options & (VACOPT_FULL | VACOPT_FREEZE)));
 	Assert(!(params.options & VACOPT_SKIPTOAST));
 
-	if ((params.options & VACOPT_FULL) && parallel_option)
+	if ((params.options & VACOPT_FULL) && params.nworkers > 0)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot specify both FULL and PARALLEL options")));
+ errmsg("VACUUM FULL cannot be performed in parallel")));
 
 	/*
 	 * Make sure VACOPT_ANALYZE is specified if any column lists are present.
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 0cfe28e63f..f90c498813 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -115,14 +115,15 @@ LINE 1: VACUUM (PARALLEL -1) pvactst;
 ^
 VACUUM (PARALLEL 2, INDEX_CLEANUP FALSE) pvactst;
 VACUUM (PARALLEL 2, FULL TRUE) pvactst; -- error, cannot use both PARALLEL and FULL
-ERROR:  cannot specify both FULL and PARALLEL options
+ERROR:  VACUUM FULL cannot be performed in parallel
+VACUUM (PARALLEL 0, FULL TRUE) pvactst; -- can specify parallel disabled (even though that's implied by FULL)
 VACUUM (PARALLEL) pvactst; -- error, cannot use PARALLEL option without parallel degree
 ERROR:  parallel option requires a value between 0 and 1024
 LINE 1: VACUUM (PARALLEL) pvactst;
 ^
 CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);
 CREATE INDEX tmp_idx1 ON tmp (a);
-VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option
+VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp tables
 WARNING:  disabling parallel option of vacuum on "tmp" --- cannot vacuum temporary tables in parallel
 RESET min_parallel_index_scan_size;
 DROP TABLE pvactst;
diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql
index cf741f7b11..dade1e5e57 100644
--- a/src/test/regress/sql/vacuum.sql
+++ b/src/test/regress/sql/vacuum.sql
@@ -99,10 +99,11 @@ VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
 VACUUM (PARALLEL -1) pvactst; -- error
 VACUUM (PARALLEL 2, INDEX_CLEANUP FALSE) pvactst;
 VACUUM (PARALLEL 2, FULL TRUE) pvactst; -- error, cannot use both PARALLEL and FULL
+VACUUM (PARALLEL 0, FULL TRUE) pvactst; -- can specify parallel disabled (even though that's implied by FULL)
 VACUUM (PARALLEL) pvactst; -- error, cannot use PARALLEL option without parallel degree
 CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);
 CREATE INDEX tmp_idx1 ON tmp (a);
-VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum 

Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-10 Thread James Coleman
On Thu, Apr 9, 2020 at 10:08 PM Peter Geoghegan  wrote:
>
> On Thu, Apr 9, 2020 at 6:47 PM James Coleman  wrote:
> > I believe the write pattern to this table likely looks like:
> > - INSERT
> > - UPDATE
> > - DELETE
> > for every row. But tomorrow I can do some more digging if needed.
>
> The pg_stats.null_frac for the column/index might be interesting here. I
> believe that Active Record will sometimes generate created_at columns
> that sometimes end up containing NULL values. Not sure why.

null_frac is 0 for created_at (what I expected). Also (under current
data) all created_at values are unique except a single row duplicate.

That being said, remember the write pattern above: every row gets
deleted eventually, so there'd be a lots of dead tuples overall.

James




Re: Improve heavyweight locks instead of building new lock managers?

2020-04-10 Thread Robert Haas
On Wed, Feb 19, 2020 at 11:14 PM Andres Freund  wrote:
> Here's a *draft* patch series for removing all use of SHM_QUEUE, and
> subsequently removing SHM_QUEUE.

+1 for that idea.

But color me skeptical of what Thomas described as the "incidental
constification".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improve heavyweight locks instead of building new lock managers?

2020-04-10 Thread Robert Haas
On Mon, Feb 10, 2020 at 11:22 PM Andres Freund  wrote:
> To me this seems to go in the direction of having multiple bespoke lock
> managers with slightly different feature sets, different debugging /
> logging / monitoring support, with separate locking code each.  That's
> bad for maintainability.

I think it would help a lot if the lock manager were not such a
monolithic thing. For instance, suppose that instead of having the
deadlock detector be part of the lock manager, it's a separate thing
that integrates with the lock manager. Instead of only knowing about
waits for heavyweight locks, it knows about whatever you want to tell
it about. Then, for instance, you could possibly tell it that process
A is waiting for a cleanup lock while process B holds a pin, possibly
detecting a deadlock we can't notice today. There are likely other
cases as well.

> E.g. we could:
>
> - Have an extended LockAcquire API where the caller provides a stack
>   variable for caching information until the LockRelease call, avoiding
>   separate LockMethodLocalHash lookup for release.

Maybe. Seems like it might make things a little clunky for the caller,
though. If you just kept a stack of locks acquired and checked the
lock being released against the top of the stack, you'd probably catch
a large percentage of cases. Or, the caller could pass a flag
indicating whether they intend to release the lock prior to the end of
the transaction (which could be cross-checked by assertions). Any that
you intend to release early go on a stack.

> - Instead of always implementing HASH_REMOVE as a completely fresh
>   lookup in dynahash, we should provide an API for removing an object we
>   *know* is in the hashtable at a specific pointer. We're obviously
>   already relying on that address to stay constant, so we'd not loose
>   flexibility.  With this separate API we can avoid the bucket lookup,
>   walking through each element in that bucket, and having to compare the
>   hash key every time (which is quite expensive for a full LOCKTAG).

That makes a lot of sense, although I wonder if we should go further
and replace dynahash entirely.

> - We should try to figure out whether we can avoid needing the separate
>   lock table for PROCLOCKs. As far as I can tell there's not a single
>   reason for it to be a hashtable, rather than something like
>   NUM_LOCK_PARTITIONS lists of free PROCLOCKs.

I agree that the PROCLOCK thing seems ugly and inefficient. I'm not
sure that a bunch of lists is the best answer, though it might be. One
case to consider is when a lock is initially acquired via the
fast-path and then, because of a conflicting lock acquisition,
transferred by *some other process* to the main lock table. If this
occurs, the original locker must be able to find its PROCLOCK. It
doesn't have to be crazy efficient because it shouldn't happen very
often, but it shouldn't suck too much.

> - Whenever we do a relation extension, we better already hold a normal
>   relation lock. We don't actually need to have an entirely distinct
>   type of lock for extensions, that theoretically could support N
>   extension locks for each relation.  The fact that these are associated
>   could be utilized in different ways:
>
>   - We could define relation extension as a separate lock level only
> conflicting with itself (and perhaps treat AELs as including
> it). That'd avoid needing separate shared memory states in many
> cases.
>
> This might be too problematic, because extension locks don't have
> the same desired behaviour around group locking (and a small army of
> other issues).

Yeah, I don't think that's likely to work out very nicely.

>   - We could keep a separate extension lock cached inside the relation
> lock.  The first time a transaction needs to extend, it does the
> normal work, and after that stores the PROCLOCK in the LOCALLOCK (or
> something like that). Once extension is done, don't release the lock
> entirely, but instead just reduce the lock level to a new
> non-conflicting lock level
>
> Alternatively we could implement something very similar outside of
> lock.c, e.g. by caching the LOCALLOCK* (possibly identified by an
> integer or such) in RelationData.  That'd require enough support
> from lock.c to be able to make that really cheap.

Not sure I quite see what you mean here.

> We could e.g. split the maintenance of the hashtable(s) from protecting
> the state of individual locks. The locks protecting the hashtable would
> just be held long enough to change a "pin count" of each lock, and then
> a per LOCK lwlock would protect each heavyweight lock's state.  We'd not
> need to lock the partition locks for quite a few cases, because there's
> many locks in a loaded system that always have lockers.  There'd be cost
> to that, needing more atomic ops in some cases, but I think it'd reduce
> contention tremendously, and it'd open a lot more optimization
> potential. 

Re: SyncRepLock acquired exclusively in default configuration

2020-04-10 Thread Fujii Masao




On 2020/04/10 20:56, Masahiko Sawada wrote:

On Fri, 10 Apr 2020 at 18:57, Fujii Masao  wrote:




On 2020/04/10 14:11, Masahiko Sawada wrote:

On Fri, 10 Apr 2020 at 13:20, Fujii Masao  wrote:




On 2020/04/08 3:01, Ashwin Agrawal wrote:


On Mon, Apr 6, 2020 at 2:14 PM Andres Freund mailto:and...@anarazel.de>> wrote:

   > How about we change it to this ?

  Hm. Better. But I think it might need at least a compiler barrier /
  volatile memory load?  Unlikely here, but otherwise the compiler could
  theoretically just stash the variable somewhere locally (it's not likely
  to be a problem because it'd not be long ago that we acquired an lwlock,
  which is a full barrier).


That's the part, I am not fully sure about. But reading the comment above 
SyncRepUpdateSyncStandbysDefined(), it seems fine.

   > Bring back the check which existed based on GUC but instead of just 
blindly
   > returning based on just GUC not being set, check
   > WalSndCtl->sync_standbys_defined. Thoughts?

  Hm. Is there any reason not to just check
  WalSndCtl->sync_standbys_defined? rather than both !SyncStandbysDefined()
  and WalSndCtl->sync_standbys_defined?


Agree, just checking for WalSndCtl->sync_standbys_defined seems fine.


So the consensus is something like the following? Patch attached.

   /*
-* Fast exit if user has not requested sync replication.
+* Fast exit if user has not requested sync replication, or there are no
+* sync replication standby names defined.
*/
-   if (!SyncRepRequested())
+   if (!SyncRepRequested() ||
+   !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
   return;



I think we need more comments describing why checking
sync_standby_defined without SyncRepLock is safe here. For example:


Yep, agreed!


This routine gets called every commit time. So, to check if the
synchronous standbys is defined as quick as possible we check
WalSndCtl->sync_standbys_defined without acquiring SyncRepLock. Since
we make this test unlocked, there's a change we might fail to notice
that it has been turned off and continue processing.


Does this really happen? I was thinking that the problem by not taking
the lock here is that SyncRepWaitForLSN() can see that shared flag after
SyncRepUpdateSyncStandbysDefined() wakes up all the waiters and
before it sets the flag to false. Then if SyncRepWaitForLSN() adds  itself
into the wait queue becaues the flag was true, without lock, it may keep
sleeping infinitely.


I think that because a backend does the following check after
acquiring SyncRepLock, in that case, once the backend has taken
SyncRepLock it can see that sync_standbys_defined is false and return.


Yes, but the backend can see that sync_standby_defined indicates false
whether holding SyncRepLock or not, after the checkpointer sets it to false.


But you meant that we do both checks without SyncRepLock?


Maybe No. The change that the latest patch provides should be applied, I think.
That is, sync_standbys_defined should be check without lock at first, then
only if it's true, it should be checked again with lock.

ISTM that basically SyncRepLock is used in SyncRepWaitForLSN() and
SyncRepUpdateSyncStandbysDefined() to make operation on the queue
and enabling sync_standbys_defined atomic. Without lock, the issue that
the comment in SyncRepUpdateSyncStandbysDefined() explains would
happen. That is, the backend may keep waiting infinitely as follows.

1. checkpointer calls SyncRepUpdateSyncStandbysDefined()
2. checkpointer sees that the flag indicates true but the config indicates false
3. checkpointer takes lock and wakes up all the waiters
4. backend calls SyncRepWaitForLSN() can see that the flag indicates true
5. checkpointer sets the flag to false and releases the lock
6. backend adds itself to the queue and wait until it's waken up, but will not 
happen immediately

So after the backend sees that the flag indicates true without lock,
it must check the flag again with lock immediately without operating
the queue. If this my understanding is right, I was thinking that
the comment should mention these things.


 /*
  * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
  * set.  See SyncRepUpdateSyncStandbysDefined.
  *
  * Also check that the standby hasn't already replied. Unlikely race
  * condition but we'll be fetching that cache line anyway so it's likely
  * to be a low cost check.
  */
 if (!WalSndCtl->sync_standbys_defined ||
 lsn <= WalSndCtl->lsn[mode])
 {
 LWLockRelease(SyncRepLock);
 return;
 }




But since the
subsequent check will check it again while holding SyncRepLock, it's
no problem. Similarly even if we fail to notice that it has been
turned on

Is this true? ISTM that after SyncRepUpdateSyncStandbysDefined()
sets the flag to true, SyncRepWaitForLSN() 

Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-10 Thread Amit Kapila
On Fri, Apr 10, 2020 at 5:35 PM Amit Kapila  wrote:
>
> On Tue, Apr 7, 2020 at 8:30 PM Juan José Santamaría Flecha
>  wrote:
> >
> > * I think you could save a couple of code lines, and make it clearer, by 
> > merging both tests on  _MSC_VER  into a single #if... #else and leaving as 
> > common code:
> > + }
> > + else
> > + return NULL;
> > +#endif /*_MSC_VER && _MSC_VER < 1900*/
> >
> > * The logic on "defined(_MSC_VER) && (_MSC_VER >= 1900)" is defined as 
> > "_WIN32_WINNT >= 0x0600" on other parts of the code. I would recommend 
> > using the later.
> >
>
> I see that we have used _MSC_VER form of checks in win32_langinfo
> (chklocale.c) for a similar kind of handling.  So, isn't it better to
> be consistent with that?  Which exact part of the code you are
> referring to?
>

I see that the kind of check you are talking is recently added by
commit 352f6f2d.  I think it is better to be consistent in all places.
Let's pick one and use that if possible.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Support for DATETIMEOFFSET

2020-04-10 Thread Andreas Karlsson

On 4/10/20 10:34 AM, Jeremy Morton wrote:
I've noticed that Postgres doesn't have support for DATETIMEOFFSET (or 
any functional equivalent data type) yet.  Is this on the roadmap to 
implement?  I find it a very useful data type that I use all over the 
place in TSQL databases.


Hi,

I do not think anyone is working on such a type.  And personally I think 
such a type is better suite for an extension rather than for core 
PostgreSQL. For most applications the timestamptz and date types are 
enough to solve everything time related (with some use of the timestamp 
type when doing calculations), but there are niche applications where 
other temporal types can be very useful, but I personally do not think 
those are common enough for inclusion in core PostgreSQL.


I suggest writing an extension with this type and see if there is any 
interest in it.


Andreas




Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-10 Thread Amit Kapila
On Tue, Apr 7, 2020 at 8:30 PM Juan José Santamaría Flecha
 wrote:
>
> * I think you could save a couple of code lines, and make it clearer, by 
> merging both tests on  _MSC_VER  into a single #if... #else and leaving as 
> common code:
> + }
> + else
> + return NULL;
> +#endif /*_MSC_VER && _MSC_VER < 1900*/
>
> * The logic on "defined(_MSC_VER) && (_MSC_VER >= 1900)" is defined as 
> "_WIN32_WINNT >= 0x0600" on other parts of the code. I would recommend using 
> the later.
>

I see that we have used _MSC_VER form of checks in win32_langinfo
(chklocale.c) for a similar kind of handling.  So, isn't it better to
be consistent with that?  Which exact part of the code you are
referring to?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-10 Thread Amit Kapila
On Mon, Apr 6, 2020 at 4:38 PM davinder singh
 wrote:
>
> Hi All,
>
> I am working on “pg_locale compilation error with Visual Studio 2017”, 
> Related threads[1],[2].
> We are getting compilation error in static char *IsoLocaleName(const char 
> *winlocname) function in pg_locale.c file. This function is trying to convert 
> the locale name into the Unix style. For example, it will change the locale 
> name "en-US" into "en_US".
> It is creating a locale using _locale_t _create_locale( int category, const 
> char *locale) and then trying to access the name of that locale by pointing 
> to internal elements of the structure loct->locinfo->locale_name[LC_CTYPE] 
> but it has been missing from the _locale_t since VS2015 which is causing the 
> compilation error. I found a few useful APIs that can be used here.
>
> ResolveLocaleName and GetLocaleInfoEx both can take locale in the following 
> format.
> -
> 

Re: SyncRepLock acquired exclusively in default configuration

2020-04-10 Thread Masahiko Sawada
On Fri, 10 Apr 2020 at 18:57, Fujii Masao  wrote:
>
>
>
> On 2020/04/10 14:11, Masahiko Sawada wrote:
> > On Fri, 10 Apr 2020 at 13:20, Fujii Masao  
> > wrote:
> >>
> >>
> >>
> >> On 2020/04/08 3:01, Ashwin Agrawal wrote:
> >>>
> >>> On Mon, Apr 6, 2020 at 2:14 PM Andres Freund  >>> > wrote:
> >>>
> >>>   > How about we change it to this ?
> >>>
> >>>  Hm. Better. But I think it might need at least a compiler barrier /
> >>>  volatile memory load?  Unlikely here, but otherwise the compiler 
> >>> could
> >>>  theoretically just stash the variable somewhere locally (it's not 
> >>> likely
> >>>  to be a problem because it'd not be long ago that we acquired an 
> >>> lwlock,
> >>>  which is a full barrier).
> >>>
> >>>
> >>> That's the part, I am not fully sure about. But reading the comment above 
> >>> SyncRepUpdateSyncStandbysDefined(), it seems fine.
> >>>
> >>>   > Bring back the check which existed based on GUC but instead of 
> >>> just blindly
> >>>   > returning based on just GUC not being set, check
> >>>   > WalSndCtl->sync_standbys_defined. Thoughts?
> >>>
> >>>  Hm. Is there any reason not to just check
> >>>  WalSndCtl->sync_standbys_defined? rather than both 
> >>> !SyncStandbysDefined()
> >>>  and WalSndCtl->sync_standbys_defined?
> >>>
> >>>
> >>> Agree, just checking for WalSndCtl->sync_standbys_defined seems fine.
> >>
> >> So the consensus is something like the following? Patch attached.
> >>
> >>   /*
> >> -* Fast exit if user has not requested sync replication.
> >> +* Fast exit if user has not requested sync replication, or there 
> >> are no
> >> +* sync replication standby names defined.
> >>*/
> >> -   if (!SyncRepRequested())
> >> +   if (!SyncRepRequested() ||
> >> +   !((volatile WalSndCtlData *) 
> >> WalSndCtl)->sync_standbys_defined)
> >>   return;
> >>
> >
> > I think we need more comments describing why checking
> > sync_standby_defined without SyncRepLock is safe here. For example:
>
> Yep, agreed!
>
> > This routine gets called every commit time. So, to check if the
> > synchronous standbys is defined as quick as possible we check
> > WalSndCtl->sync_standbys_defined without acquiring SyncRepLock. Since
> > we make this test unlocked, there's a change we might fail to notice
> > that it has been turned off and continue processing.
>
> Does this really happen? I was thinking that the problem by not taking
> the lock here is that SyncRepWaitForLSN() can see that shared flag after
> SyncRepUpdateSyncStandbysDefined() wakes up all the waiters and
> before it sets the flag to false. Then if SyncRepWaitForLSN() adds  itself
> into the wait queue becaues the flag was true, without lock, it may keep
> sleeping infinitely.

I think that because a backend does the following check after
acquiring SyncRepLock, in that case, once the backend has taken
SyncRepLock it can see that sync_standbys_defined is false and return.
But you meant that we do both checks without SyncRepLock?

/*
 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 * set.  See SyncRepUpdateSyncStandbysDefined.
 *
 * Also check that the standby hasn't already replied. Unlikely race
 * condition but we'll be fetching that cache line anyway so it's likely
 * to be a low cost check.
 */
if (!WalSndCtl->sync_standbys_defined ||
lsn <= WalSndCtl->lsn[mode])
{
LWLockRelease(SyncRepLock);
return;
}

>
> > But since the
> > subsequent check will check it again while holding SyncRepLock, it's
> > no problem. Similarly even if we fail to notice that it has been
> > turned on
> Is this true? ISTM that after SyncRepUpdateSyncStandbysDefined()
> sets the flag to true, SyncRepWaitForLSN() basically doesn't seem
> to fail to notice that. No?

What I wanted to say is, in the current code, while the checkpointer
process is holding SyncRepLock to turn off sync_standbys_defined,
backends who reach SyncRepWaitForLSN() wait for the lock. Then, after
the checkpointer process releases SyncRepLock these backend can
enqueue themselves to the wait queue because they can see that
sync_standbys_defined is turned on. On the other hand if we do the
check without SyncRepLock, backends who reach SyncRepWaitForLSN() will
return instead of waiting, in spite of checkpointer process being
turning on sync_standbys_defined. Which means these backends are
failing to notice that it has been turned on, I thought.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Parallel copy

2020-04-10 Thread Robert Haas
On Thu, Apr 9, 2020 at 4:00 PM Andres Freund  wrote:
> I've not yet read the whole thread. So I'm probably restating ideas.

Yeah, but that's OK.

> Imo, yes, there should be only one process doing the chunking. For ilp, cache 
> efficiency, but also because the leader is the only process with access to 
> the network socket. It should load input data into one large buffer that's 
> shared across processes. There should be a separate ringbuffer with 
> tuple/partial tuple (for huge tuples) offsets. Worker processes should grab 
> large chunks of offsets from the offset ringbuffer. If the ringbuffer is not 
> full, the worker chunks should be reduced in size.

My concern here is that it's going to be hard to avoid processes going
idle. If the leader does nothing at all once the ring buffer is full,
it's wasting time that it could spend processing a chunk. But if it
picks up a chunk, then it might not get around to refilling the buffer
before other processes are idle with no work to do.

Still, it might be the case that having the process that is reading
the data also find the line endings is so fast that it makes no sense
to split those two tasks. After all, whoever just read the data must
have it in cache, and that helps a lot.

> Given that everything stalls if the leader doesn't accept further input data, 
> as well as when there are no available splitted chunks, it doesn't seem like 
> a good idea to have the leader do other work.
>
> I don't think optimizing/targeting copy from local files, where multiple 
> processes could read, is useful. COPY STDIN is the only thing that 
> practically matters.

Yeah, I think Amit has been thinking primarily in terms of COPY from
files, and I've been encouraging him to at least consider the STDIN
case. But I think you're right, and COPY FROM STDIN should be the
design center for this feature.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: SyncRepLock acquired exclusively in default configuration

2020-04-10 Thread Fujii Masao




On 2020/04/10 14:11, Masahiko Sawada wrote:

On Fri, 10 Apr 2020 at 13:20, Fujii Masao  wrote:




On 2020/04/08 3:01, Ashwin Agrawal wrote:


On Mon, Apr 6, 2020 at 2:14 PM Andres Freund mailto:and...@anarazel.de>> wrote:

  > How about we change it to this ?

 Hm. Better. But I think it might need at least a compiler barrier /
 volatile memory load?  Unlikely here, but otherwise the compiler could
 theoretically just stash the variable somewhere locally (it's not likely
 to be a problem because it'd not be long ago that we acquired an lwlock,
 which is a full barrier).


That's the part, I am not fully sure about. But reading the comment above 
SyncRepUpdateSyncStandbysDefined(), it seems fine.

  > Bring back the check which existed based on GUC but instead of just 
blindly
  > returning based on just GUC not being set, check
  > WalSndCtl->sync_standbys_defined. Thoughts?

 Hm. Is there any reason not to just check
 WalSndCtl->sync_standbys_defined? rather than both !SyncStandbysDefined()
 and WalSndCtl->sync_standbys_defined?


Agree, just checking for WalSndCtl->sync_standbys_defined seems fine.


So the consensus is something like the following? Patch attached.

  /*
-* Fast exit if user has not requested sync replication.
+* Fast exit if user has not requested sync replication, or there are no
+* sync replication standby names defined.
   */
-   if (!SyncRepRequested())
+   if (!SyncRepRequested() ||
+   !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
  return;



I think we need more comments describing why checking
sync_standby_defined without SyncRepLock is safe here. For example:


Yep, agreed!


This routine gets called every commit time. So, to check if the
synchronous standbys is defined as quick as possible we check
WalSndCtl->sync_standbys_defined without acquiring SyncRepLock. Since
we make this test unlocked, there's a change we might fail to notice
that it has been turned off and continue processing.


Does this really happen? I was thinking that the problem by not taking
the lock here is that SyncRepWaitForLSN() can see that shared flag after
SyncRepUpdateSyncStandbysDefined() wakes up all the waiters and
before it sets the flag to false. Then if SyncRepWaitForLSN() adds  itself
into the wait queue becaues the flag was true, without lock, it may keep
sleeping infinitely.


But since the
subsequent check will check it again while holding SyncRepLock, it's
no problem. Similarly even if we fail to notice that it has been
turned on

Is this true? ISTM that after SyncRepUpdateSyncStandbysDefined()
sets the flag to true, SyncRepWaitForLSN() basically doesn't seem
to fail to notice that. No?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Report error position in partition bound check

2020-04-10 Thread Amit Langote
On Thu, Apr 9, 2020 at 11:04 PM Tom Lane  wrote:
> While I'm quite on board with providing useful error cursors,
> the example cases in this patch don't seem all that useful:
>
>  -- trying to create range partition with empty range
>  CREATE TABLE fail_part PARTITION OF range_parted2 FOR VALUES FROM (1) TO (0);
>  ERROR:  empty range bound specified for partition "fail_part"
> +LINE 1: ...E fail_part PARTITION OF range_parted2 FOR VALUES FROM (1) T...
> + ^
>  DETAIL:  Specified lower bound (1) is greater than or equal to upper bound 
> (0).
>
> As best I can tell from these examples, the cursor will always
> point at the FROM keyword, making it pretty unhelpful.  It seems
> like in addition to getting the query string passed down, you
> need to do some work on the code that's actually reporting the
> error position.  I'd expect at a minimum that the pointer allows
> identifying which column of a multi-column partition key is
> giving trouble.  The phrasing of this particular message, for
> example, suggests that it ought to point at the "1" expression.

I agree with that.  Tried that in the attached 0002, although trying
to get the cursor to point to exactly the offending column seems a bit
tough for partition overlap errors.  The patch does allow to single
out which one of the lower and upper bounds is causing the overlap
with an existing partition, which is better than now and seems helpful
enough.

Also, updated Alexandra's patch to incorporate Ashutosh's comment such
that we get the same output with ATTACH PARTITION commands too.

-- 

Amit Langote
EnterpriseDB: http://www.enterprisedb.com


v3-0001-Report-error-position-in-partition-bound-check.patch
Description: Binary data


v1-0002-Improve-check_new_partition_bound-error-position-.patch
Description: Binary data


Support for DATETIMEOFFSET

2020-04-10 Thread Jeremy Morton
I've noticed that Postgres doesn't have support for DATETIMEOFFSET (or 
any functional equivalent data type) yet.  Is this on the roadmap to 
implement?  I find it a very useful data type that I use all over the 
place in TSQL databases.


--
Best regards,
Jeremy Morton (Jez)




RE: Complete data erasure

2020-04-10 Thread asaba.takan...@fujitsu.com
Hello,

I was off the point.
I want to organize the discussion and suggest feature design.

There are two opinions.
1. COMMIT should not take a long time because errors are more likely to occur.
2. The data area should be released when COMMIT is completed because COMMIT has 
to be an atomic action.

These opinions are correct.
But it is difficult to satisfy them at the same time.
So I suggest that users have the option to choose.
DROP TABLE works as following two patterns:

1. Rename data file to "...del" instead of ftruncate(fd,0).
  After that, bgworker scan the directory and run erase_command.
  (erase_command is command set by user like archive_command.
   For example, shred on Linux.)

2. Run erase_command for data file immediately before ftruncate(fd,0).
  Wait until it completes, then reply COMMIT to the client.
  After that, it is the same as normal processing.

If error of erase_command occurs, it issues WARNING and don't request unlink to 
CheckPointer.
It’s not a security failure because I think that there is a risk when data area 
is returned to OS.

I will implement from pattern 2 because it's more similar to user experience 
than pattern 1.
This method has been pointed out as follows.

>From Stephen
> We certainly can't run external commands during transaction COMMIT, so
> this can't be part of a regular DROP TABLE.

I think it means that error of external commands can't be handled.
If so, it's no problem because I determined behavior after error.
Are there any other problems?

Regards,

--
Takanori Asaba






pg_basebackup, manifests and backends older than ~12

2020-04-10 Thread Michael Paquier
Hi,

I have noticed that attempting to use pg_basebackup from HEAD leads to
failures when using it with backend versions from 12 and older:
$ pg_basebackup -D hoge
pg_basebackup: error: backup manifests are not supported by server
version 12beta2
pg_basebackup: removing data directory "hoge"

This is a bit backwards with what we did in the past to maintain
compatibility silently when possible, for example look at the handling
of temporary replication slots.  Instead of an error when means to
force users to have to specify --no-manifest in this case, shouldn't
we silently disable the generation of the backup manifest?  We know
that this option won't work on older server versions anyway.

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: doc review for parallel vacuum

2020-04-10 Thread Masahiko Sawada
On Fri, 10 Apr 2020 at 16:26, Amit Kapila  wrote:
>
> On Wed, Apr 8, 2020 at 12:49 PM Masahiko Sawada
>  wrote:
> >
> > On Tue, 7 Apr 2020 at 13:55, Justin Pryzby  wrote:
> > >
> >
> > I don't have comments on your change other than the comments Amit
> > already sent. Thank you for reviewing this part!
> >
>
> I have made the modifications as per my comments.  What do you think
> about the attached?

Thank you for updating the patch! Looks good to me.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




spin_delay() for ARM

2020-04-10 Thread Amit Khandekar
Hi,

We use (an equivalent of) the PAUSE instruction in spin_delay() for
Intel architectures. The goal is to slow down the spinlock tight loop
and thus prevent it from eating CPU and causing CPU starvation, so
that other processes get their fair share of the CPU time. Intel
documentation [1] clearly mentions this, along with other benefits of
PAUSE, like, low power consumption, and avoidance of memory order
violation while exiting the loop.

Similar to PAUSE, the ARM architecture has YIELD instruction, which is
also clearly documented [2]. It explicitly says that it is a way to
hint the CPU that it is being called in a spinlock loop and this
process can be preempted out.  But for ARM, we are not using any kind
of spin delay.

For PG spinlocks, the goal of both of these instructions are the same,
and also both architectures recommend using them in spinlock loops.
Also, I found multiple places where YIELD is already used in same
situations : Linux kernel [3] ; OpenJDK [4],[5]

Now, for ARM implementations that don't implement YIELD, it runs as a
no-op.  Unfortunately the ARM machine I have does not implement YIELD.
But recently there has been some ARM implementations that are
hyperthreaded, so they are expected to actually do the YIELD, although
the docs do not explicitly say that YIELD has to be implemented only
by hyperthreaded implementations.

I ran some pgbench tests to test PAUSE/YIELD on the respective
architectures, once with the instruction present, and once with the
instruction removed.  Didn't see change in the TPS numbers; they were
more or less same. For Arm, this was expected because my ARM machine
does not implement it.

On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
using a sample C program (attached spin.c). Here, many child processes
(much more than CPUs) wait in a tight loop for a shared variable to
become 0, while the parent process continuously increments a sequence
number for a fixed amount of time, after which, it sets the shared
variable to 0. The child's tight loop calls PAUSE in each iteration.
What I hoped was that because of PAUSE in children, the parent process
would get more share of the CPU, due to which, in a given time, the
sequence number will reach a higher value. Also, I expected the CPU
cycles spent by child processes to drop down, thanks to PAUSE. None of
these happened. There was no change.

Possibly, this testcase is not right. Probably the process preemption
occurs only within the set of hyperthreads attached to a single core.
And in my testcase, the parent process is the only one who is ready to
run. Still, I have anyway attached the program (spin.c) for archival;
in case somebody with a YIELD-supporting ARM machine wants to use it
to test YIELD.

Nevertheless, I think because we have clear documentation that
strongly recommends to use it, and because it has been used in other
use-cases such as linux kernel and JDK, we should start using YIELD
for spin_delay() in ARM.

Attached is the trivial patch (spin_delay_for_arm.patch). To start
with, it contains changes only for aarch64. I haven't yet added
changes in configure[.in] for making sure yield compiles successfully
(YIELD is present in manuals from ARMv6 onwards). Before that I
thought of getting some comments; so didn't do configure changes yet.


[1] https://c9x.me/x86/html/file_module_x86_id_232.html
[2] 
https://developer.arm.com/docs/100076/0100/instruction-set-reference/a64-general-instructions/yield
[3] 
https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/processor.h#L259
[4] http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html
[5] 
http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004880.html


--
Thanks,
-Amit Khandekar
Huawei Technologies
/*
 * Sample program to test the effect of PAUSE/YIELD instruction in a highly
 * contended scenario.  The Intel and ARM docs recommend the use of PAUSE and
 * YIELD respectively, in spinlock tight loops.
 *
 * This program can be run with :
 * gcc -O3 -o spin spin.c -lrt ; ./spin [number_of_processes]
 * By default, 4 processes are spawned.
 *
 * Child processes wait in a tight loop for a shared variable to become 0,
 * while the parent process continuously increments a sequence number for a
 * fixed amount of time, after which, it sets the shared variable to 0. The
 * child tight loop calls YIELD/PAUSE in each iteration.
 *
 * The intention is to create a number of processes much larger than the
 * available CPUs, so that the scheduler hopefully pre-empts the processes
 * because of the PAUSE, and the main process gets more CPU share because of
 * which it will increment its sequence number more number of times. So the
 * expectation is that with PAUSE, the program will end up with a much higher
 * sequence number than without PAUSE. Similarly, the child processes should
 * have lesser CPU cycles with PAUSE than without PAUSE.
 *
 * Author: Amit Khandekar
 */

#include 
#include 
#include 
#include   

Re: doc review for parallel vacuum

2020-04-10 Thread Amit Kapila
On Wed, Apr 8, 2020 at 12:49 PM Masahiko Sawada
 wrote:
>
> On Tue, 7 Apr 2020 at 13:55, Justin Pryzby  wrote:
> >
>
> I don't have comments on your change other than the comments Amit
> already sent. Thank you for reviewing this part!
>

I have made the modifications as per my comments.  What do you think
about the attached?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


v4-0001-Comments-and-doc-fixes-for-commit-40d964ec99.patch
Description: Binary data


Re: Function to track shmem reinit time

2020-04-10 Thread Kyotaro Horiguchi
FWIW, I don't object for adding the feature like this (in other words
+1), since I see the advantages Alex mentioned is valid.  (Even though
most of our customers notices server restart by client
disconnections..)

At Wed, 8 Apr 2020 11:49:11 -0400, David Steele  wrote in 
> On 2/24/20 10:57 PM, Robert Haas wrote:
> > On Sat, Feb 22, 2020 at 10:31 PM Tom Lane  wrote:
> >> I'm still going to object to it, on the grounds that
> >>
> >> (1) it's exposing an implementation detail that clients should not be
> >> concerned with, and that we might change in future.  The name isn't
> >> even well chosen --- if the argument is that it's useful to monitor
> >> server uptime, why isn't the name "pg_server_uptime"?
> >>
> >> (2) if your server is crashing often enough that postmaster start
> >> time isn't an adequate substitute, you have way worse problems than
> >> whether you can measure it.  I'd rather see us put effort into
> >> fixing whatever the underlying problem is.
> > I don't accept the first argument, because I think the chances that
> > we'll change our basic approach to this problem are indistinguishable
> > from zero.  A few years back, you criticized some idea of mine as
> > tantamount to rm -rf src/backend/optimizer, which you were not ready
> > to endorse. That proposal was surely vastly less ambitious than some
> > proposal which would remove the idea of shared memory reinitialization
> > after an unsafe backend exit.

I'm not sure the name is good, mainly for the reason of
user-friendliness. Alexander's proposal "pg_server_up_since()"
returning absolute time makes sense to me.

> > I don't accept the second argument either, because it's a false
> > dichotomy. Adding this function won't reduce the amount of energy that
> > gets spent fixing crashes. There are always more crashes.
> > I realize that several other people voted against this proposal so I
> > don't think it's OK to commit a patch in this area unless some more
> > positive votes turn up, but I'm still +1 on the idea. Granted, we
> > don't want an infinite amount of clutter in the system catalogs, but I
> > have a hard time believing that this proposed function would get any
> > less use than the hyperbolic trig functions.

I'm on Robert and Alexander side about this.  I don't think
introducing this necessarily means we are obliged to accept all
requests for "last $anything" for other events, like introducing the
hyperbolic functions doesn't mean to accept all new functions alike,
maybe.

> Marking this Returned with Feedback again since we are still at an
> impasse.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: doc review for v13

2020-04-10 Thread Michael Paquier
On Thu, Apr 09, 2020 at 10:01:51PM -0500, Justin Pryzby wrote:
> On Fri, Apr 10, 2020 at 11:27:46AM +0900, Michael Paquier wrote:
>>  required pages to remove both downlink and rightlink are already  locked.  
>> That
>> -evades potential right to left page locking order, which could deadlock with
>> +avoid potential right to left page locking order, which could deadlock with
>> Not sure which one is better, but the new change is grammatically
>> incorrect.
> 
> "Evades" usually means to act to avoid detection by the government.  Like tax
> evasion.

This change is from Alexander Korotkov as of 32ca32d, so I am adding
him in CC to get his opinion.

>>auto_explain.log_settings controls whether 
>> information
>> -  about modified configuration options are printed when execution plan 
>> is logged.
>> -  Only options affecting query planning with value different from the 
>> built-in
>> +  about modified configuration options is printed when an execution 
>> plan is logged.
>> +  Only those options which affect query planning and whose value 
>> differs from its built-in
>> Depends on how you read the sentence, but here is seemt to me that 
>> "statistics" is the correct subject, no?
> 
> Statistics ?

Oops.  I may have messed up with a different part of the patch set.
Your suggestion is right as the subject is "information" here.
--
Michael


signature.asc
Description: PGP signature


Re: WAL usage calculation patch

2020-04-10 Thread Amit Kapila
On Tue, Apr 7, 2020 at 2:48 PM Julien Rouhaud  wrote:
>
> On Tue, Apr 7, 2020 at 4:36 AM Amit Kapila  wrote:
> >
> > On Mon, Apr 6, 2020 at 7:58 PM Euler Taveira
> >  wrote:
> > >
> > > On Mon, 6 Apr 2020 at 10:37, Julien Rouhaud  wrote:
> > >>
> > >> On Mon, Apr 06, 2020 at 10:12:55AM -0300, Euler Taveira wrote:
> > >> > On Mon, 6 Apr 2020 at 00:25, Amit Kapila  
> > >> > wrote:
> > >> >
> > >> > >
> > >> > > I have pushed pg_stat_statements and Explain related patches.  I am
> > >> > > now looking into (auto)vacuum patch and have few comments.
> > >> > >
> > >> > > I wasn't paying much attention to this thread. May I suggest changing
> > >> > wal_num_fpw to wal_fpw? wal_records and wal_bytes does not have a 
> > >> > prefix
> > >> > 'num'. It seems inconsistent to me.
> > >> >
> > >>
> > >> If we want to be consistent shouldn't we rename it to wal_fpws?  FTR I 
> > >> don't
> > >> like much either version.
> > >
> > >
> > > Since FPW is an acronym, plural form reads better when you are using 
> > > uppercase (such as FPWs or FPW's); thus, I prefer singular form because 
> > > parameter names are lowercase. Function description will clarify that 
> > > this is "number of WAL full page writes".
> > >
> >
> > I like Euler's suggestion to change wal_num_fpw to wal_fpw.  It is
> > better if others who didn't like this name can also share their
> > opinion now because changing multiple times the same thing is not a
> > good idea.
>
> +1
>
> About Justin and your comments on the other thread:
>
> On Tue, Apr 7, 2020 at 4:31 AM Amit Kapila  wrote:
> >
> > On Mon, Apr 6, 2020 at 10:04 PM Justin Pryzby  wrote:
> > >
> > > On Thu, Apr 02, 2020 at 08:29:31AM +0200, Julien Rouhaud wrote:
> > > > > > "full page records" seems to be showing the number of full page
> > > > > > images, not the record having full page images.
> > > > >
> > > > > I am not sure what exactly is a difference but it is the records
> > > > > having full page images.  Julien correct me if I am wrong.
> > >
> > > > Obviously previous complaints about the meaning and parsability of
> > > > "full page writes" should be addressed here for consistency.
> > >
> > > There's a couple places that say "full page image records" which I think 
> > > is
> > > language you were trying to avoid.  It's the number of pages, not the 
> > > number of
> > > records, no ?  I see explain and autovacuum say what I think is wanted, 
> > > but
> > > these say the wrong thing?  Find attached slightly larger patch.
> > >
> > > $ git grep 'image record'
> > > contrib/pg_stat_statements/pg_stat_statements.c:int64   
> > > wal_num_fpw;/* # of WAL full page image records generated */
> > > doc/src/sgml/ref/explain.sgml:  number of records, number of full 
> > > page image records and amount of WAL
> > >
> >
> > Few comments:
> > 1.
> > - int64 wal_num_fpw; /* # of WAL full page image records generated */
> > + int64 wal_num_fpw; /* # of WAL full page images generated */
> >
> > Let's change comment as " /* # of WAL full page writes generated */"
> > to be consistent with other places like instrument.h.  Also, make a
> > similar change at other places if required.
>
> Agreed.  That's pg_stat_statements.c and instrument.h.  I'll send a
> patch once we reach consensus with the rest of the comments.
>

Would you like to send a consolidated patch that includes Euler's
suggestion and Justin's patch (by making changes for points we
discussed.)?  I think we can keep the point related to number of
spaces before each field open?

> > 2.
> >
> > -Total amount of WAL bytes generated by the statement
> > +Total number of WAL bytes generated by the statement
> >
> >
> > I feel the previous text was better as this field can give us the size
> > of WAL with which we can answer "how much WAL data is generated by a
> > particular statement?".  Julien, do you have any thoughts on this?
>
> I also prefer "amount" as it feels more natural.
>

As we see no other opinion on this matter, we can use "amount" here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com