Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-20 Thread Michael Paquier
On Tue, Apr 21, 2020 at 02:27:20PM +0900, Fujii Masao wrote:
> On 2020/04/21 10:59, Michael Paquier wrote:
>> With your patch, this code
>> now means that in order to finish recovery you need to send SIGUSR2 to
>> the startup process *and* to create the promote signal file.
> 
> Yes, but isn't this the same as the way to trigger fast promotion in HEAD?

Yep, but my point is that some users who have been relying only on
SIGUSR2 sent to the startup process for a promotion may be surprised
to see that doing the same operation does not trigger a promotion
anymore.
--
Michael


signature.asc
Description: PGP signature


pgbench testing with contention scenarios

2020-04-20 Thread Amit Khandekar
Subject changed. Earlier it was : spin_delay() for ARM

On Fri, 17 Apr 2020 at 22:54, Robert Haas  wrote:
>
> On Thu, Apr 16, 2020 at 3:18 AM Amit Khandekar  wrote:
> > Not relevant to the PAUSE stuff  Note that when the parallel
> > clients reach from 24 to 32 (which equals the machine CPUs), the TPS
> > shoots from 454189 to 1097592 which is more than double speed gain
> > with just a 30% increase in parallel sessions.
For referencem the TPS can be seen here :
https://www.postgresql.org/message-id/CAJ3gD9e86GY%3DQfyfZQkb11Z%2BCVWowDiGgGThzKKwHDGU9uA2yA%40mail.gmail.com
>
> I've seen stuff like this too. For instance, check out the graph from
> this 2012 blog post:
>
> http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-about-64.html
>
> You can see that the performance growth is basically on a straight
> line up to about 16 cores, but then it kinks downward until about 28,
> after which it kinks sharply upward until about 36 cores.
>
> I think this has something to do with the process scheduling behavior
> of Linux, because I vaguely recall some discussion where somebody did
> benchmarking on the same hardware on both Linux and one of the BSD
> systems, and the effect didn't appear on BSD. They had other problems,
> like a huge drop-off at higher core counts, but they didn't have that
> effect.

Ah I see.

By the way, I have observed this behaviour in both x86 and ARM,
regardless of whether the CPUs are as low as 8, or as high as 32.

But for me, I suspect it's a combination of linux scheduler and
interactions between backends and pgbench clients.

So I used a custom script. I used the same point query that is
used with the -S option, but that query is run again and again on the
server side without the client having to send it again and again, so
the pgbench clients are idle most of the time.

Query used :
select foo(30);
where foo(int) is defined as :
create or replace function foo(iterations int) returns int as $$
declare
   id int; ret int ; counter int = 0;
begin
   WHILE counter < iterations
   LOOP
  counter = counter + 1;
  id = random() * 300;
  select into ret aid from pgbench_accounts where aid = id;
   END LOOP;
  return ret;
end $$ language plpgsql;

Below are results for 30 scale factor, with 8 CPUs :

Clients  TPS
21.255327
42.414139
63.532937
84.586583
10   4.557575
12   4.517226
14   4.551455
18   4.593271

You can see that the tps rise is almost linearly proportional to
increase in clients, with no deviation in between, upto 8 clients
where it does not rise because CPUs are fully utilized,

In this custom case as well, the behaviour is same for both x86 and
ARM, regardless of 8 CPUs or 32 CPUs.


--
Thanks,
-Amit Khandekar
Huawei Technologies




Re: design for parallel backup

2020-04-20 Thread Andres Freund
Hi,

On 2020-04-21 10:20:01 +0530, Amit Kapila wrote:
> It is quite likely that compression can benefit more from parallelism
> as compared to the network I/O as that is mostly a CPU intensive
> operation but I am not sure if we can just ignore the benefit of
> utilizing the network bandwidth.  In our case, after copying from the
> network we do write that data to disk, so during filesystem I/O the
> network can be used if there is some other parallel worker processing
> other parts of data.

Well, as I said, network and FS IO as done by server / pg_basebackup are
both fully buffered by the OS. Unless the OS throttles the userland
process, a large chunk of the work will be done by the kernel, in
separate kernel threads.

My workstation and my laptop can, in a single thread each, get close
20GBit/s of network IO (bidirectional 10GBit, I don't have faster - it's
a thunderbolt 10gbe card) and iperf3 is at 55% CPU while doing so. Just
connecting locally it's 45Gbit/s. Or over 8GBbyte/s of buffered
filesystem IO. And it doesn't even have that high per-core clock speed.

I just don't see this being the bottleneck for now.


> Also, there may be some users who don't want their data to be
> compressed due to some reason like the overhead of decompression is so
> high that restore takes more time and they are not comfortable with
> that as for them faster restore is much more critical then compressed
> or fast back up.  So, for such things, the parallelism during backup
> as being discussed in this thread will still be helpful.

I am not even convinced it'll be helpful in a large fraction of
cases. The added overhead of more connections / processes isn't free.

I believe there are some cases where it'd help. E.g. if there are
multiple tablespaces on independent storage, parallelism as described
here could end up to a significantly better utilization of the different
tablespaces. But that'd require sorting work between processes
appropriately.


> OTOH, I think without some measurements it is difficult to say that we
> have significant benefit by paralysing the backup without compression.
> I have scanned the other thread [1] where the patch for parallel
> backup was discussed and didn't find any performance numbers, so
> probably having some performance data with that patch might give us a
> better understanding of introducing parallelism in the backup.

Agreed, we need some numbers.

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2020-04-20 Thread Thomas Munro
On Sun, Apr 19, 2020 at 11:46 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Thanks for working on this patch, it seems like a great feature. I'm
> probably a bit late to the party, but still want to make couple of
> commentaries.

Hi Dmitry,

Thanks for your feedback and your interest in this work!

> The patch indeed looks good, I couldn't find any significant issues so
> far and almost all my questions I had while reading it were actually
> answered in this thread. I'm still busy with benchmarking, mostly to see
> how prefetching would work with different workload distributions and how
> much the kernel will actually prefetch.

Cool.

One report I heard recently said that if  you get rid of I/O stalls,
pread() becomes cheap enough that the much higher frequency lseek()
calls I've complained about elsewhere[1] become the main thing
recovery is doing, at least on some systems, but I haven't pieced
together the conditions required yet.  I'd be interested to know if
you see that.

> In the meantime I have a few questions:
>
> > 1.  It now uses effective_io_concurrency to control how many
> > concurrent prefetches to allow.  It's possible that we should have a
> > different GUC to control "maintenance" users of concurrency I/O as
> > discussed elsewhere[1], but I'm staying out of that for now; if we
> > agree to do that for VACUUM etc, we can change it easily here.  Note
> > that the value is percolated through the ComputeIoConcurrency()
> > function which I think we should discuss, but again that's off topic,
> > I just want to use the standard infrastructure here.
>
> This totally makes sense, I believe the question "how much to prefetch"
> eventually depends equally on a type of workload (correlates with how
> far in WAL to read) and how much resources are available for prefetching
> (correlates with queue depth). But in the documentation it looks like
> maintenance-io-concurrency is just an "unimportant" option, and I'm
> almost sure will be overlooked by many readers:
>
> The maximum distance to look ahead in the WAL during recovery, to find
> blocks to prefetch.  Prefetching blocks that will soon be needed can
> reduce I/O wait times.  The number of concurrent prefetches is limited
> by this setting as well as
> .  Setting it too high
> might be counterproductive, if it means that data falls out of the
> kernel cache before it is needed.  If this value is specified without
> units, it is taken as bytes.  A setting of -1 disables prefetching
> during recovery.
>
> Maybe it makes also sense to emphasize that maintenance-io-concurrency
> directly affects resource consumption and it's a "primary control"?

You're right.  I will add something in the next version to emphasise that.

> > On Wed, Mar 18, 2020 at 06:18:44PM +1300, Thomas Munro wrote:
> >
> > Here's a new version that changes that part just a bit more, after a
> > brief chat with Andres about his async I/O plans.  It seems clear that
> > returning an enum isn't very extensible, so I decided to try making
> > PrefetchBufferResult a struct whose contents can be extended in the
> > future.  In this patch set it's still just used to distinguish 3 cases
> > (hit, miss, no file), but it's now expressed as a buffer and a flag to
> > indicate whether I/O was initiated.  You could imagine that the second
> > thing might be replaced by a pointer to an async I/O handle you can
> > wait on or some other magical thing from the future.
>
> I like the idea of extensible PrefetchBufferResult. Just one commentary,
> if I understand correctly the way how it is being used together with
> prefetch_queue assumes one IO operation at a time. This limits potential
> extension of the underlying code, e.g. one can't implement some sort of
> buffering of requests and submitting an iovec to a sycall, then
> prefetch_queue will no longer correctly represent inflight IO. Also,
> taking into account that "we don't have any awareness of when I/O really
> completes", maybe in the future it makes to reconsider having queue in
> the prefetcher itself and rather ask for this information from the
> underlying code?

Yeah, you're right that it'd be good to be able to do some kind of
batching up of these requests to reduce system calls.  Of course
posix_fadvise() doesn't support that, but clearly in the AIO future[2]
it would indeed make sense to buffer up a few of these and then make a
single call to io_uring_enter() on Linux[3] or lio_listio() on a
hypothetical POSIX AIO implementation[4].  (I'm not sure if there is a
thing like that on Windows; at a glance, ReadFileScatter() is
asynchronous ("overlapped") but works only on a single handle so it's
like a hypothetical POSIX aio_readv(), not like POSIX lio_list()).

Perhaps there could be an extra call PrefetchBufferSubmit() that you'd
call at appropriate times, but you obviously can't call it too
infrequently.

As for how to make the prefetch queue a reusable component, rather
than having a 

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-20 Thread Fujii Masao




On 2020/04/21 10:59, Michael Paquier wrote:

On Mon, Apr 20, 2020 at 03:26:16PM +0900, Fujii Masao wrote:

Patch attached. I will add this into the first CF for v14.


Thanks!


-if (IsPromoteSignaled())
+/*
+ * In 9.1 and 9.2 the postmaster unlinked the promote file inside the
+ * signal handler. It now leaves the file in place and lets the
+ * Startup process do the unlink.
+ */
+if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, _buf) == 0)
  {
-/*
- * In 9.1 and 9.2 the postmaster unlinked the promote file inside the
- * signal handler. It now leaves the file in place and lets the
- * Startup process do the unlink. This allows Startup to know whether
- * it should create a full checkpoint before starting up (fallback
- * mode). Fast promotion takes precedence.
- */
-if (stat(PROMOTE_SIGNAL_FILE, _buf) == 0)
-{
-unlink(PROMOTE_SIGNAL_FILE);
-unlink(FALLBACK_PROMOTE_SIGNAL_FILE);
-fast_promote = true;
-}
-else if (stat(FALLBACK_PROMOTE_SIGNAL_FILE, _buf) == 0)
-{
-unlink(FALLBACK_PROMOTE_SIGNAL_FILE);
-fast_promote = false;
-}
-
  ereport(LOG, (errmsg("received promote request")));
-
+unlink(PROMOTE_SIGNAL_FILE);


Thanks for reviewing the patch!


On HEAD, this code means that it is possible to end recovery just by
sending SIGUSR2 to the startup process.


Yes, in this case, non-fast promotion is triggered.


With your patch, this code
now means that in order to finish recovery you need to send SIGUSR2 to
the startup process *and* to create the promote signal file.


Yes, but isn't this the same as the way to trigger fast promotion in HEAD?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Parallel Append can break run-time partition pruning

2020-04-20 Thread Amit Langote
Hi David,

On Tue, Apr 21, 2020 at 12:03 PM David Rowley  wrote:
> On Fri, 17 Apr 2020 at 19:08, Amit Langote  wrote:
> > On Wed, Apr 15, 2020 at 4:18 PM David Rowley  wrote:
> > > I'm a bit divided on what the correct fix is.  If I blame Parallel
> > > Append for not trying hard enough to pull up the lower Append in
> > > accumulate_append_subpath(), then clearly the parallel append code is
> > > to blame.
> >
> > I spent some time trying to understand how Append parallelism works
> > and I am tempted to agree with you that there might be problems with
> > how accumulate_append_subpath()'s interacts with parallelism. Maybe it
> > would be better to disregard a non-parallel-aware partial Append if it
> > requires us to fail on flattening a child Append.  I have as attached
> > a PoC fix to show that.  While a nested Append is not really a problem
> > in general, it appears to me that our run-time code is not in position
> > to work correctly with them, or at least not with how things stand
> > today...
>
> Thanks for taking a look at this. I've now looked at this in more
> detail and below is my understanding of what's going on:
>
> It seems, in this case, what's going on is, on the following line:
>
> accumulate_append_subpath(cheapest_partial_path,
>   _subpaths, NULL);
>
> we don't manage to pullup the sub-Append due to passing a NULL pointer
> for the final special_subpaths argument.  This results in just taking
> the child's Append path verbatim. i.e. nested Append
>
> Later, when we do:
>
> else if (nppath == NULL ||
> (cheapest_partial_path != NULL &&
>   cheapest_partial_path->total_cost < nppath->total_cost))
> {
> /* Partial path is cheaper or the only option. */
> Assert(cheapest_partial_path != NULL);
> accumulate_append_subpath(cheapest_partial_path,
>   _partial_subpaths,
>   _nonpartial_subpaths);
>
> we do pass a non-NULL special_subpaths argument to allow the
> sub-Append to be pulled up.
>
> So, now we have 2 paths, one with a nested Append and one with a
> flattened Append.  Both paths have the same cost, but due to the fact
> that we call add_partial_path() for the nested Append version first,
> the logic in add_partial_path() accepts that path. However, the
> subsequent call of add_partial_path(), the one for the non-nested
> Append, that path is rejected due to the total cost being too similar
> to one of the existing partial path. We just end up keeping the nested
> Append as the cheapest partial path... That path, since in the example
> case only has a single subpath, is pulled up into the main append by
> the logic added in 8edd0e794.
>
> I think you've realised this and that's why your PoC patch just
> rejected the first path when it's unable to do the pullup.

Right.

> We'll get a
> better path later when we allow mixed partial and non-partial paths.

Yes, but only if parallel-aware Append is allowed (pa_subpaths_valid).
So it's possible that the Append may not participate in any
parallelism whatsoever if we reject partial Append on failing to fold
a child Append, which does somewhat suck.

> (We'll never fail to do a pullup when calling
> accumulate_append_subpath() for "nppath", since that's a non-parallel
> path and accumulate_append_subpath() will always pull Append paths up
> when they're not parallel aware.)
>
> I wonder if the fix should be more something along the lines of trying
> to merge things do we only generate a single partial path.  That way
> we wouldn't be at the mercy of the logic in add_partial_path() to
> accept or reject the path based on the order the paths are added.

So as things stand, parallel-aware partial Append (Parallel Append)
path competes with non-parallel partial Append path on cost grounds.
As far as I can see, it's only the latter that can contain among its
subpaths an (nested) Append which can be problematic.  Given that, out
choice between the two types of partial Append paths becomes based on
something that is not cost, but is that okay?

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: It is not documented that pg_promote can exit standby mode

2020-04-20 Thread Fujii Masao




On 2020/04/20 20:38, Fujii Masao wrote:



On 2020/04/18 2:46, Tom Lane wrote:

Fujii Masao  writes:

Thanks for the report and the patch! It looks good to me.
Barring any objection, I will commit this patch.


It might be worth writing "pg_promote() is called"
(adding parentheses) to make it clearer that a function is being
referred to.  No objection otherwise.


Yes. Also Masahiro-san reported me, off-list, that there are other places
where pg_promote is mentioned without parentheses. I think it's better to
add parentheses there. Attached is the updated version of the patch.


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Fix for pg_statio_all_tables

2020-04-20 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Apr 21, 2020 at 02:44:45AM +0300, Alexander Korotkov wrote:
>> As a bugfix, I think this should be backpatched.  But this patch
>> requires catalog change.  Were  similar cases there before?  If so,
>> how did we resolve them?

> A backpatch can happen in such cases, see for example b6e39ca9.  In
> this case, the resolution was done with a backpatch to
> system_views.sql and the release notes include an additional note
> saying that the fix applies itself only on already-initialized
> clusters.  For other clusters, it was necessary to apply a SQL query,
> given also in the release notes, to fix the issue (just grep for 
> CVE-2017-7547 in release-9.6.sgml on the REL9_6_STABLE branch).

Yeah, but that was for a security hole.  I am doubtful that the
severity of this problem is bad enough to justify jumping through
similar hoops.  Even if we fixed it and documented it, how many
users would bother to apply the manual correction?

regards, tom lane




Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-20 Thread Michael Paquier
On Mon, Apr 20, 2020 at 10:35:15PM -0400, Bruce Momjian wrote:
> On Thu, Apr 16, 2020 at 03:11:51PM -0400, Tom Lane wrote:
>> If we were going to go down the path of periodically logging warnings
>> about old prepared transactions, some single-instance background task
>> like the checkpointer would be a better place to do the work in.  But
>> I'm not really recommending that, because I agree with Robert that
>> we just plain don't want this functionality.
> 
> I thought we would just emit a warning at boot time.

That's more tricky than boot time (did you mean postmaster context?),
especially if you are starting a cluster from a base backup as you
have no guarantee that the 2PC information is consistent by just
looking at what's on disk (some of the 2PC files may still be in WAL
records to-be-replayed), so a natural candidate to gather the
information wanted here would be RecoverPreparedTransactions() for a
primary, and StandbyRecoverPreparedTransactions() for a standby.
--
Michael


signature.asc
Description: PGP signature


Re: design for parallel backup

2020-04-20 Thread Amit Kapila
On Tue, Apr 21, 2020 at 2:40 AM Andres Freund  wrote:
>
> On 2020-04-20 16:36:16 -0400, Robert Haas wrote:
>
> > If a backup client - either current or hypothetical - is compressing
> > and encrypting, then it doesn't have either a filesystem I/O or a
> > network I/O in progress while it's doing so. You take not only the hit
> > of the time required for compression and/or encryption, but also use
> > that much less of the available network and/or I/O capacity.
>
> I don't think it's really the time for network/file I/O that's the
> issue. Sure memcpy()'ing from the kernel takes time, but compared to
> encryption/compression it's not that much.  Especially for compression,
> it's not really lack of cycles for networking that prevent a higher
> throughput, it's that after buffering a few MB there's just no point
> buffering more, given compression will plod along with 20-100MB/s.
>

It is quite likely that compression can benefit more from parallelism
as compared to the network I/O as that is mostly a CPU intensive
operation but I am not sure if we can just ignore the benefit of
utilizing the network bandwidth.  In our case, after copying from the
network we do write that data to disk, so during filesystem I/O the
network can be used if there is some other parallel worker processing
other parts of data.

Also, there may be some users who don't want their data to be
compressed due to some reason like the overhead of decompression is so
high that restore takes more time and they are not comfortable with
that as for them faster restore is much more critical then compressed
or fast back up.  So, for such things, the parallelism during backup
as being discussed in this thread will still be helpful.

OTOH, I think without some measurements it is difficult to say that we
have significant benefit by paralysing the backup without compression.
I have scanned the other thread [1] where the patch for parallel
backup was discussed and didn't find any performance numbers, so
probably having some performance data with that patch might give us a
better understanding of introducing parallelism in the backup.

[1] - 
https://www.postgresql.org/message-id/CADM=JehKgobEknb+_nab9179HzGj=9eitzwmod2mpqr_rif...@mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Bogus documentation for bogus geometric operators

2020-04-20 Thread Tom Lane
While revising the docs for the geometric operators, I came across
these entries:

<^  Is below (allows touching)? circle '((0,0),1)' <^ circle '((0,5),1)'
>^  Is above (allows touching)? circle '((0,5),1)' >^ circle '((0,0),1)'

These have got more than a few problems:

1. There are no such operators for circles, so the examples are pure
fantasy.

2. What these operators do exist for is points (point_below, point_above
respectively) and boxes (box_below_eq, box_above_eq).  However, the
stated behavior is not what the point functions actually do:

point_below(PG_FUNCTION_ARGS)
...
PG_RETURN_BOOL(FPlt(pt1->y, pt2->y));

point_above(PG_FUNCTION_ARGS)
...
PG_RETURN_BOOL(FPgt(pt1->y, pt2->y));

So point_below would be more accurately described as "is strictly below",
so its operator name really ought to be <<|.  And point_above is "is
strictly above", so its operator name ought to be |>>.

3. The box functions do seem to be correctly documented:

box_below_eq(PG_FUNCTION_ARGS)
...
PG_RETURN_BOOL(FPle(box1->high.y, box2->low.y));

box_above_eq(PG_FUNCTION_ARGS)
...
PG_RETURN_BOOL(FPge(box1->low.y, box2->high.y));

But there are comments in the source code to the effect of

 * box_below_eq and box_above_eq are obsolete versions that (probably
 * erroneously) accept the equal-boundaries case.  Since these are not
 * in sync with the box_left and box_right code, they are deprecated and
 * not supported in the PG 8.1 rtree operator class extension.

I'm not sure how seriously to take this deprecation comment, but it
is true that box_below (<<|) and box_above (|>>) have analogs for
other data types while these functions don't.

4. Just for extra fun, these point operators are listed in some
GIST and SP-GIST opclasses; though the box ones are not, as per
that code comment.

Perhaps it's too late in the v13 cycle to actually do anything
about this code-wise, but what should I do documentation-wise?
I'm certainly not eager to document that these operators behave
inconsistently depending on which type you're talking about.

regards, tom lane




Re: WIP/PoC for parallel backup

2020-04-20 Thread Amit Kapila
On Tue, Apr 14, 2020 at 8:07 PM Asif Rehman  wrote:

>
> I forgot to make a check for no-manifest. Fixed. Attached is the updated
> patch.
>
>
Have we done any performance testing with this patch to see the benefits?
If so, can you point me to the results? If not, then can we perform some
tests on large backups to see the benefits of this patch/idea?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: spin_delay() for ARM

2020-04-20 Thread Amit Khandekar
On Sat, 18 Apr 2020 at 03:30, Thomas Munro  wrote:
>
> On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma  wrote:
> > On Thu, 16 Apr 2020 at 10:33, Pavel Stehule  wrote:
> > > what I know, pgbench cannot be used for testing spinlocks problems.
> > >
> > > Maybe you can see this issue when a) use higher number clients - 
> > > hundreds, thousands. Decrease share memory, so there will be press on 
> > > related spin lock.
> >
> > There really aren't many spinlocks left that could be tickled by a
> > normal workload. I looked for a way to trigger spinlock contention
> > when I prototyped a patch to replace spinlocks with futexes. The only
> > one that I could figure out a way to make contended was the lock
> > protecting parallel btree scan. A highly parallel index only scan on a
> > fully cached index should create at least some spinlock contention.
>
> I suspect the snapshot-too-old "mutex_threshold" spinlock can become
> contended under workloads that generate a high rate of
> heap_page_prune_opt() calls with old_snapshot_threshold enabled.  One
> way to do that is with a bunch of concurrent index scans that hit the
> heap in random order.  Some notes about that:
>
> https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com

Thanks all for the inputs. Will keep these two particular scenarios in
mind, and try to get some bandwidth on this soon.


-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-20 Thread Dilip Kumar
On Mon, Apr 20, 2020 at 11:31 PM Robert Haas  wrote:
>
> On Mon, Apr 20, 2020 at 12:10 AM Dilip Kumar  wrote:
> > I have started reviewing these patches.  I think, the fixes looks right to 
> > me.
> >
> > + LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
> > + mapping->head_offset = oldSnapshotControl->head_offset;
> > + mapping->head_timestamp = oldSnapshotControl->head_timestamp;
> > + mapping->count_used = oldSnapshotControl->count_used;
> > + for (int i = 0; i < OLD_SNAPSHOT_TIME_MAP_ENTRIES; ++i)
> > + mapping->xid_by_minute[i] = oldSnapshotControl->xid_by_minute[i];
> > + LWLockRelease(OldSnapshotTimeMapLock);
> >
> > I think memcpy would be a better choice instead of looping it for all
> > the entries, since we are doing this under a lock?
>
> When I did it that way, it complained about "const" and I couldn't
> immediately figure out how to fix it.

I think we can typecast to (const void *).  After below change, I did
not get the warning.

diff --git a/contrib/old_snapshot/time_mapping.c
b/contrib/old_snapshot/time_mapping.c
index 37e0055..cc53bdd 100644
--- a/contrib/old_snapshot/time_mapping.c
+++ b/contrib/old_snapshot/time_mapping.c
@@ -94,8 +94,9 @@ GetOldSnapshotTimeMapping(void)
mapping->head_offset = oldSnapshotControl->head_offset;
mapping->head_timestamp = oldSnapshotControl->head_timestamp;
mapping->count_used = oldSnapshotControl->count_used;
-   for (int i = 0; i < OLD_SNAPSHOT_TIME_MAP_ENTRIES; ++i)
-   mapping->xid_by_minute[i] =
oldSnapshotControl->xid_by_minute[i];
+   memcpy(mapping->xid_by_minute,
+  (const void *) oldSnapshotControl->xid_by_minute,
+  sizeof(TransactionId) * OLD_SNAPSHOT_TIME_MAP_ENTRIES);
LWLockRelease(OldSnapshotTimeMapLock);

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel Append can break run-time partition pruning

2020-04-20 Thread David Rowley
On Fri, 17 Apr 2020 at 19:08, Amit Langote  wrote:
>
> On Wed, Apr 15, 2020 at 4:18 PM David Rowley  wrote:
> > I'm a bit divided on what the correct fix is.  If I blame Parallel
> > Append for not trying hard enough to pull up the lower Append in
> > accumulate_append_subpath(), then clearly the parallel append code is
> > to blame.
>
> I spent some time trying to understand how Append parallelism works
> and I am tempted to agree with you that there might be problems with
> how accumulate_append_subpath()'s interacts with parallelism. Maybe it
> would be better to disregard a non-parallel-aware partial Append if it
> requires us to fail on flattening a child Append.  I have as attached
> a PoC fix to show that.  While a nested Append is not really a problem
> in general, it appears to me that our run-time code is not in position
> to work correctly with them, or at least not with how things stand
> today...

Thanks for taking a look at this. I've now looked at this in more
detail and below is my understanding of what's going on:

It seems, in this case, what's going on is, on the following line:

accumulate_append_subpath(cheapest_partial_path,
  _subpaths, NULL);

we don't manage to pullup the sub-Append due to passing a NULL pointer
for the final special_subpaths argument.  This results in just taking
the child's Append path verbatim. i.e. nested Append

Later, when we do:

else if (nppath == NULL ||
(cheapest_partial_path != NULL &&
  cheapest_partial_path->total_cost < nppath->total_cost))
{
/* Partial path is cheaper or the only option. */
Assert(cheapest_partial_path != NULL);
accumulate_append_subpath(cheapest_partial_path,
  _partial_subpaths,
  _nonpartial_subpaths);

we do pass a non-NULL special_subpaths argument to allow the
sub-Append to be pulled up.

So, now we have 2 paths, one with a nested Append and one with a
flattened Append.  Both paths have the same cost, but due to the fact
that we call add_partial_path() for the nested Append version first,
the logic in add_partial_path() accepts that path. However, the
subsequent call of add_partial_path(), the one for the non-nested
Append, that path is rejected due to the total cost being too similar
to one of the existing partial path. We just end up keeping the nested
Append as the cheapest partial path... That path, since in the example
case only has a single subpath, is pulled up into the main append by
the logic added in 8edd0e794.

I think you've realised this and that's why your PoC patch just
rejected the first path when it's unable to do the pullup. We'll get a
better path later when we allow mixed partial and non-partial paths.

(We'll never fail to do a pullup when calling
accumulate_append_subpath() for "nppath", since that's a non-parallel
path and accumulate_append_subpath() will always pull Append paths up
when they're not parallel aware.)

I wonder if the fix should be more something along the lines of trying
to merge things do we only generate a single partial path.  That way
we wouldn't be at the mercy of the logic in add_partial_path() to
accept or reject the path based on the order the paths are added.

David




Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-20 Thread Bruce Momjian
On Thu, Apr 16, 2020 at 03:11:51PM -0400, Tom Lane wrote:
> If we were going to go down the path of periodically logging warnings
> about old prepared transactions, some single-instance background task
> like the checkpointer would be a better place to do the work in.  But
> I'm not really recommending that, because I agree with Robert that
> we just plain don't want this functionality.

I thought we would just emit a warning at boot time.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: 001_rep_changes.pl stalls

2020-04-20 Thread Michael Paquier
On Mon, Apr 20, 2020 at 07:24:28PM +0900, Fujii Masao wrote:
> I was misreading this as something like "any other blocking than
> the blocking in WalSndLoop()". Ok, I have no more comments on
> the patch.

Patch looks rather sane to me at quick glance.  I can see that WAL
senders are now not stuck at 100% CPU per process when sitting idle, 
for both physical and logical replication.  Thanks.
--
Michael


signature.asc
Description: PGP signature


Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-20 Thread Thomas Munro
On Mon, Apr 20, 2020 at 8:02 PM Dilip Kumar  wrote:
> On Mon, Apr 20, 2020 at 12:29 PM Thomas Munro  wrote:
> > I mean I want to verify that VACUUM or heap prune actually removed a
> > tuple that was visible to an old snapshot.  An idea I just had: maybe
> > sto_using_select.spec should check the visibility map (somehow).  For
> > example, the sto_using_select.spec (the version in the patch I just
> > posted) just checks that after time 00:11, the old snapshot gets a
> > snapshot-too-old error.  Perhaps we could add a VACUUM before that,
> > and then check that the page has become all visible, meaning that the
> > dead tuple our snapshot could see has now been removed.
>
> Okay, got your point.  Can we try to implement some test functions
> that can just call visibilitymap_get_status function internally?  I
> agree that we will have to pass the correct block number but that we
> can find using TID.   Or for testing, we can create a very small
> relation that just has 1 block?

I think it's enough to check SELECT EVERY(all_visible) FROM
pg_visibility_map('sto1'::regclass).  I realised that
src/test/module/snapshot_too_old is allowed to install
contrib/pg_visibility with EXTRA_INSTALL, so here's a new version to
try that idea.  It extends sto_using_select.spec to VACUUM and check
the vis map at key times.  That allows us to check that early pruning
really happens once the snapshot becomes too old.  There are other
ways you could check that but this seems quite "light touch" compared
to something based on page inspection.

I also changed src/test/module/snapshot_too_old/t/001_truncate.pl back
to using Robert's contrib/old_snapshot extension to know the size of
the time/xid map, allowing an introspection function to be dropped
from test_sto.c.

As before, these two apply on top of Robert's patches (or at least his
0001 and 0002).
From 28f81e93e8058d64c80d2b2b605a57f48ceebc0b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 20 Apr 2020 16:23:02 +1200
Subject: [PATCH v4 4/5] Rewrite the "snapshot_too_old" tests.

Previously the snapshot too old feature used a special test value for
old_snapshot_threshold.  Instead, use a new approach based on explicitly
advancing the "current" timestamp used in snapshot-too-old book keeping,
so that we can control the timeline precisely without having to resort
to sleeping or special test branches in the code.

Also check that early pruning actually happens, by vacuuming and
inspecting the visibility map at key points in the test schedule.
---
 src/backend/utils/time/snapmgr.c  |  24 ---
 src/test/modules/snapshot_too_old/Makefile|  23 +--
 .../expected/sto_using_cursor.out |  75 -
 .../expected/sto_using_hash_index.out |  26 ++-
 .../expected/sto_using_select.out | 157 +++---
 .../specs/sto_using_cursor.spec   |  30 ++--
 .../specs/sto_using_hash_index.spec   |  19 ++-
 .../specs/sto_using_select.spec   |  50 --
 src/test/modules/snapshot_too_old/sto.conf|   2 +-
 .../snapshot_too_old/test_sto--1.0.sql|  14 ++
 src/test/modules/snapshot_too_old/test_sto.c  |  74 +
 .../modules/snapshot_too_old/test_sto.control |   5 +
 12 files changed, 366 insertions(+), 133 deletions(-)
 create mode 100644 src/test/modules/snapshot_too_old/test_sto--1.0.sql
 create mode 100644 src/test/modules/snapshot_too_old/test_sto.c
 create mode 100644 src/test/modules/snapshot_too_old/test_sto.control

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 72b2c61a07..19e6c52b80 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1739,26 +1739,6 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 		update_ts = oldSnapshotControl->next_map_update;
 		SpinLockRelease(>mutex_latest_xmin);
 
-		/*
-		 * Zero threshold always overrides to latest xmin, if valid.  Without
-		 * some heuristic it will find its own snapshot too old on, for
-		 * example, a simple UPDATE -- which would make it useless for most
-		 * testing, but there is no principled way to ensure that it doesn't
-		 * fail in this way.  Use a five-second delay to try to get useful
-		 * testing behavior, but this may need adjustment.
-		 */
-		if (old_snapshot_threshold == 0)
-		{
-			if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
-&& TransactionIdFollows(latest_xmin, xlimit))
-xlimit = latest_xmin;
-
-			ts -= 5 * USECS_PER_SEC;
-			SetOldSnapshotThresholdTimestamp(ts, xlimit);
-
-			return xlimit;
-		}
-
 		ts = AlignTimestampToMinuteBoundary(ts)
 			- (old_snapshot_threshold * USECS_PER_MINUTE);
 
@@ -1860,10 +1840,6 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	if (!map_update_required)
 		return;
 
-	/* No further tracking needed for 0 (used for testing). */
-	if (old_snapshot_threshold == 0)
-		return;
-
 	/*
 	 * We don't want to do something stupid with unusual values, but we 

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-20 Thread Michael Paquier
On Mon, Apr 20, 2020 at 03:26:16PM +0900, Fujii Masao wrote:
> Patch attached. I will add this into the first CF for v14.

Thanks!

> -if (IsPromoteSignaled())
> +/*
> + * In 9.1 and 9.2 the postmaster unlinked the promote file inside the
> + * signal handler. It now leaves the file in place and lets the
> + * Startup process do the unlink.
> + */
> +if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, _buf) == 0)
>  {
> -/*
> - * In 9.1 and 9.2 the postmaster unlinked the promote file inside the
> - * signal handler. It now leaves the file in place and lets the
> - * Startup process do the unlink. This allows Startup to know whether
> - * it should create a full checkpoint before starting up (fallback
> - * mode). Fast promotion takes precedence.
> - */
> -if (stat(PROMOTE_SIGNAL_FILE, _buf) == 0)
> -{
> -unlink(PROMOTE_SIGNAL_FILE);
> -unlink(FALLBACK_PROMOTE_SIGNAL_FILE);
> -fast_promote = true;
> -}
> -else if (stat(FALLBACK_PROMOTE_SIGNAL_FILE, _buf) == 0)
> -{
> -unlink(FALLBACK_PROMOTE_SIGNAL_FILE);
> -fast_promote = false;
> -}
> -
>  ereport(LOG, (errmsg("received promote request")));
> -
> +unlink(PROMOTE_SIGNAL_FILE);

On HEAD, this code means that it is possible to end recovery just by
sending SIGUSR2 to the startup process.  With your patch, this code
now means that in order to finish recovery you need to send SIGUSR2 to
the startup process *and* to create the promote signal file.  Is that
really what you want?
--
Michael


signature.asc
Description: PGP signature


Re: Adding missing object access hook invocations

2020-04-20 Thread Michael Paquier
On Mon, Apr 20, 2020 at 01:32:31PM -0400, Alvaro Herrera wrote:
> I think it's fine to put this in at this time.  It's not a new feature.
> The only thing this needs is to go through a new release cycle so that
> people can adjust to the new hook invocations as necessary.

Okay.  Any other opinions?  I am in a 50/50 state about that stuff.
--
Michael


signature.asc
Description: PGP signature


Re: Fix for pg_statio_all_tables

2020-04-20 Thread Michael Paquier
On Tue, Apr 21, 2020 at 02:44:45AM +0300, Alexander Korotkov wrote:
> Among all the joined tables, only "pg_index I" is expected to have
> multiple rows associated with single relation.  But we do sum() for
> toast index "pg_index X" as well.  As the result, we multiply
> statistics for toast index by the number of relation indexes.  This is
> obviously wrong.

Oops.

> As a bugfix, I think this should be backpatched.  But this patch
> requires catalog change.  Were  similar cases there before?  If so,
> how did we resolve them?

A backpatch can happen in such cases, see for example b6e39ca9.  In
this case, the resolution was done with a backpatch to
system_views.sql and the release notes include an additional note
saying that the fix applies itself only on already-initialized
clusters.  For other clusters, it was necessary to apply a SQL query,
given also in the release notes, to fix the issue (just grep for 
CVE-2017-7547 in release-9.6.sgml on the REL9_6_STABLE branch).
--
Michael


signature.asc
Description: PGP signature


Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-20 Thread James Coleman
Over in "execExprInterp() questions / How to improve scalar array op
expr eval?" [1] I'd mused about how we might be able to optimized
scalar array ops with OR'd semantics.

This patch implements a binary search for such expressions when the
array argument is a constant so that we can avoid needing to teach
expression execution to cache stable values or know when a param has
changed.

The speed-up for the target case can pretty impressive: in my
admittedly contrived and relatively unscientific test with a query in
the form:

select count(*) from generate_series(1,10) n(i) where i in (<1000
random integers in the series>)

shows ~30ms for the patch versus ~640ms on master.

James

[1]: 
https://www.postgresql.org/message-id/flat/CAAaqYe-UQBba7sScrucDOyHb7cDoNbWf_rcLrOWeD4ikP3_qTQ%40mail.gmail.com
From 08742543d7865d5f25c24c26bf1014924035c9eb Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Fri, 10 Apr 2020 21:40:50 +
Subject: [PATCH v1] Binary search const arrays in OR'd ScalarArrayOps

Currently all scalar array op expressions execute as a linear search
through the array argument. However when OR semantics are desired it's
possible to instead use a binary search. Here we apply that optimization
to constant arrays (so we don't need to worry about teaching expression
execution when params change) of at least length 9 (since very short
arrays average to the same number of comparisons for linear searches and
thus avoid the preprocessing necessary for a binary search).
---
 src/backend/executor/execExpr.c   |  79 +++--
 src/backend/executor/execExprInterp.c | 193 ++
 src/include/executor/execExpr.h   |  14 ++
 src/test/regress/expected/expressions.out |  39 +
 src/test/regress/sql/expressions.sql  |  11 ++
 5 files changed, 326 insertions(+), 10 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index c6a77bd66f..c202cc7e89 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -49,6 +49,7 @@
 #include "utils/lsyscache.h"
 #include "utils/typcache.h"
 
+#define MIN_ARRAY_SIZE_FOR_BINARY_SEARCH 9
 
 typedef struct LastAttnumInfo
 {
@@ -947,11 +948,13 @@ ExecInitExprRec(Expr *node, ExprState *state,
 		case T_ScalarArrayOpExpr:
 			{
 ScalarArrayOpExpr *opexpr = (ScalarArrayOpExpr *) node;
+Oid			func;
 Expr	   *scalararg;
 Expr	   *arrayarg;
 FmgrInfo   *finfo;
 FunctionCallInfo fcinfo;
 AclResult	aclresult;
+bool		useBinarySearch = false;
 
 Assert(list_length(opexpr->args) == 2);
 scalararg = (Expr *) linitial(opexpr->args);
@@ -964,12 +967,58 @@ ExecInitExprRec(Expr *node, ExprState *state,
 if (aclresult != ACLCHECK_OK)
 	aclcheck_error(aclresult, OBJECT_FUNCTION,
    get_func_name(opexpr->opfuncid));
-InvokeFunctionExecuteHook(opexpr->opfuncid);
 
 /* Set up the primary fmgr lookup information */
 finfo = palloc0(sizeof(FmgrInfo));
 fcinfo = palloc0(SizeForFunctionCallInfo(2));
-fmgr_info(opexpr->opfuncid, finfo);
+func = opexpr->opfuncid;
+
+/*
+ * If we have a constant array and want OR semantics, then we
+ * can use a binary search to implement the op instead of
+ * looping through the entire array for each execution.
+ */
+if (opexpr->useOr && arrayarg && IsA(arrayarg, Const) &&
+	!((Const *) arrayarg)->constisnull)
+{
+	Datum		arrdatum = ((Const *) arrayarg)->constvalue;
+	ArrayType  *arr = (ArrayType *) DatumGetPointer(arrdatum);
+	Oid			orderingOp;
+	Oid			orderingFunc;
+	Oid			opfamily;
+	Oid			opcintype;
+	int16		strategy;
+	int			nitems;
+
+	/*
+	 * Only do the optimization if we have a large enough
+	 * array to make it worth it.
+	 */
+	nitems = ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
+	if (nitems >= MIN_ARRAY_SIZE_FOR_BINARY_SEARCH)
+	{
+		/*
+		 * Find the ordering op that matches the originally
+		 * planned equality op.
+		 */
+		orderingOp = get_ordering_op_for_equality_op(opexpr->opno, NULL);
+		get_ordering_op_properties(orderingOp, , , );
+		orderingFunc = get_opfamily_proc(opfamily, opcintype, opcintype, BTORDER_PROC);
+
+		/*
+		 * But we may not have one, so fall back to the
+		 * default implementation if necessary.
+		 */
+		if (OidIsValid(orderingFunc))
+		{
+			useBinarySearch = true;
+			func = orderingFunc;
+		}
+	}
+}
+
+InvokeFunctionExecuteHook(func);
+fmgr_info(func, finfo);
 fmgr_info_set_expr((Node *) node, finfo);
 InitFunctionCallInfoData(*fcinfo, finfo, 2,
 		 opexpr->inputcollid, NULL, NULL);
@@ -981,18 +1030,28 @@ ExecInitExprRec(Expr *node, ExprState *state,
 /*
  * Evaluate array argument into our return value.  There's no
  * danger in that, because the return value is guaranteed to
- * be overwritten by EEOP_SCALARARRAYOP, and will not be
- * 

Fix for pg_statio_all_tables

2020-04-20 Thread Alexander Korotkov
Hi!

It appears that definition of pg_statio_all_tables has bug.

CREATE VIEW pg_statio_all_tables AS
SELECT
C.oid AS relid,
N.nspname AS schemaname,
C.relname AS relname,
pg_stat_get_blocks_fetched(C.oid) -
pg_stat_get_blocks_hit(C.oid) AS heap_blks_read,
pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit,
sum(pg_stat_get_blocks_fetched(I.indexrelid) -
pg_stat_get_blocks_hit(I.indexrelid))::bigint AS
idx_blks_read,
sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit,
pg_stat_get_blocks_fetched(T.oid) -
pg_stat_get_blocks_hit(T.oid) AS toast_blks_read,
pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit,
sum(pg_stat_get_blocks_fetched(X.indexrelid) -
pg_stat_get_blocks_hit(X.indexrelid))::bigint AS
tidx_blks_read,
sum(pg_stat_get_blocks_hit(X.indexrelid))::bigint AS tidx_blks_hit
FROM pg_class C LEFT JOIN
pg_index I ON C.oid = I.indrelid LEFT JOIN
pg_class T ON C.reltoastrelid = T.oid LEFT JOIN
pg_index X ON T.oid = X.indrelid
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE C.relkind IN ('r', 't', 'm')
GROUP BY C.oid, N.nspname, C.relname, T.oid, X.indrelid;

Among all the joined tables, only "pg_index I" is expected to have
multiple rows associated with single relation.  But we do sum() for
toast index "pg_index X" as well.  As the result, we multiply
statistics for toast index by the number of relation indexes.  This is
obviously wrong.

Attached patch fixes the view definition to count toast index statistics once.

As a bugfix, I think this should be backpatched.  But this patch
requires catalog change.  Were  similar cases there before?  If so,
how did we resolve them?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


fix_pg_statio_all_tables.patch
Description: Binary data


Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables

2020-04-20 Thread Alvaro Herrera
> + deleteDependencyRecordsFor(TriggerRelationId,
> + pg_trigger->oid,
> + false);
> + deleteDependencyRecordsFor(RelationRelationId,
> + pg_trigger->oid,
> + false);
> +
> + CommandCounterIncrement();
> + ObjectAddressSet(object, TriggerRelationId, 
> pg_trigger->oid);
> + performDeletion(, DROP_RESTRICT, 
> PERFORM_DELETION_INTERNAL);
> + }
> +
> + systable_endscan(scan);
> + table_close(tgrel, RowExclusiveLock);
> + }

Two small issues here.  First, your second call to
deleteDependencyRecordsFor did nothing, because your first call deleted
all the dependency records.  I changed that to two
deleteDependencyRecordsForClass() calls, that actually do what you
intended.

The other is that instead of deleting each trigger, we can accumulate
them to delete with a single performMultipleDeletions call; this also
means we get to do CommandCounterIncrement just once.

v6 fixes those things and AFAICS is ready to push.

I haven't reviewed your 0002 carefully, but (as inventor of the "TABLE
t" marker for FK constraints) I agree with Amit that we should imitate
that instead of coming up with a new way to show it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From b8aeae162e03ccd0346212e19ae2d75ec6495288 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 20 Apr 2020 18:39:59 -0400
Subject: [PATCH v6] Fix detaching tables with inherited row triggers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The old behavior is buggy, and the intended behavior was not previously
documented.  So define the behavior that the trigger is removed if the table is
detached.  It is an error if a table being attached already has a trigger of
the same.  This differs from the behavior for indexes and constraints.

Should backpatch to v11 with s/OidIsValid(tgparentid)/isPartitionTrigger/.

See also: 86f575948c77

Author: Justin Pryzby 
Reviewed-by: Amit Langote 
Reviewed-by: Álvaro Herrera 
Discussion: https://www.postgresql.org/message-id/flat/20200408152412.GZ2228%40telsasoft.com
---
 doc/src/sgml/ref/create_trigger.sgml   |  1 +
 src/backend/commands/tablecmds.c   | 66 ++
 src/test/regress/expected/triggers.out | 45 ++
 src/test/regress/sql/triggers.sql  | 21 
 4 files changed, 133 insertions(+)

diff --git a/doc/src/sgml/ref/create_trigger.sgml b/doc/src/sgml/ref/create_trigger.sgml
index bde3a63f90..303881fcfb 100644
--- a/doc/src/sgml/ref/create_trigger.sgml
+++ b/doc/src/sgml/ref/create_trigger.sgml
@@ -526,6 +526,7 @@ UPDATE OF column_name1 [, column_name2INSTEAD OF.
   
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 037d457c3d..3ebbd5d013 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -547,6 +547,7 @@ static void QueuePartitionConstraintValidation(List **wqueue, Relation scanrel,
 			   List *partConstraint,
 			   bool validate_default);
 static void CloneRowTriggersToPartition(Relation parent, Relation partition);
+static void DropClonedTriggersFromPartition(Oid partitionId);
 static ObjectAddress ATExecDetachPartition(Relation rel, RangeVar *name);
 static ObjectAddress ATExecAttachPartitionIdx(List **wqueue, Relation rel,
 			  RangeVar *name);
@@ -16797,6 +16798,9 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
 	}
 	table_close(classRel, RowExclusiveLock);
 
+	/* Drop any triggers that were cloned on creation/attach. */
+	DropClonedTriggersFromPartition(RelationGetRelid(partRel));
+
 	/*
 	 * Detach any foreign keys that are inherited.  This includes creating
 	 * additional action triggers.
@@ -16881,6 +16885,68 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
 	return address;
 }
 
+/*
+ * DropClonedTriggersFromPartition
+ *		subroutine for ATExecDetachPartition to remove any triggers that were
+ *		cloned to the partition when it was created-as-partition or attached.
+ *		This undoes what CloneRowTriggersToPartition did.
+ */
+static void
+DropClonedTriggersFromPartition(Oid partitionId)
+{
+	ScanKeyData skey;
+	SysScanDesc	scan;
+	HeapTuple	trigtup;
+	Relation	tgrel;
+	ObjectAddresses *objects;
+
+	objects = new_object_addresses();
+
+	/*
+	 * Scan pg_trigger to search for all triggers on this rel.
+	 */
+	ScanKeyInit(, Anum_pg_trigger_tgrelid, BTEqualStrategyNumber,
+F_OIDEQ, ObjectIdGetDatum(partitionId));
+	tgrel = table_open(TriggerRelationId, RowExclusiveLock);
+	scan = systable_beginscan(tgrel, TriggerRelidNameIndexId,
+			  true, NULL, 1, );
+	while (HeapTupleIsValid(trigtup = systable_getnext(scan)))
+	{
+		

Re: Poll: are people okay with function/operator table redesign?

2020-04-20 Thread Alvaro Herrera
On 2020-Apr-20, Tom Lane wrote:

> Alvaro Herrera  writes:
> > There's one with a separate column for the operator, without types, at
> > the left (the "with names" example at
> > https://postgr.es/m/14380.1587242...@sss.pgh.pa.us ).  That seemed
> > pretty promising -- not sure why it was discarded.
> 
> Well, I wouldn't say it was discarded --- but there sure wasn't
> a groundswell of support.

Ah.

> Looking at it again, I'd be inclined not to bother with the
> morerows trick but just to have an operator name entry in each row.
> This table is a bit of an outlier anyway, I'm finding --- very few
> of the operator tables have multiple entries per operator name.

No disagreement here.  'morerows' attribs are always a messy business.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Poll: are people okay with function/operator table redesign?

2020-04-20 Thread Tom Lane
Alvaro Herrera  writes:
> There's one with a separate column for the operator, without types, at
> the left (the "with names" example at
> https://postgr.es/m/14380.1587242...@sss.pgh.pa.us ).  That seemed
> pretty promising -- not sure why it was discarded.

Well, I wouldn't say it was discarded --- but there sure wasn't
a groundswell of support.

Looking at it again, I'd be inclined not to bother with the
morerows trick but just to have an operator name entry in each row.
This table is a bit of an outlier anyway, I'm finding --- very few
of the operator tables have multiple entries per operator name.

regards, tom lane




Re: new heapcheck contrib module

2020-04-20 Thread Peter Geoghegan
On Mon, Apr 20, 2020 at 1:40 PM Robert Haas  wrote:
> On Mon, Apr 20, 2020 at 4:30 PM Andres Freund  wrote:
> > A few billion CLogTruncationLock acquisitions in short order will likely
> > have at least as big an impact as ShareUpdateExclusiveLock held for the
> > duration of the check. That's not really a relevant concern or
> > txid_status().  Per-tuple lock acquisitions aren't great.
>
> Yeah, that's true. Doing it for every tuple is going to be too much, I
> think. I was hoping we could avoid that.

What about the visibility map? It would be nice if pg_visibility was
merged into amcheck, since it mostly provides integrity checking for
the visibility map. Maybe we could just merge the functions that
perform verification, and leave other functions (like
pg_truncate_visibility_map()) where they are. We could keep the
current interface for functions like pg_check_visible(), but also
allow the same verification to occur in passing, as part of a higher
level check.

It wouldn't be so bad if pg_visibility was an expert-only tool. But
ISTM that the verification performed by code like
collect_corrupt_items() could easily take place at the same time as
the new checks that Mark proposes. Possibly only some of the time. It
can work in a totally additive way. (Though like Andres I don't really
like the current "helper" functions used to iterate through a heap
relation; they seem like they'd make this harder.)

-- 
Peter Geoghegan




Re: Poll: are people okay with function/operator table redesign?

2020-04-20 Thread Alvaro Herrera
On 2020-Apr-20, Tom Lane wrote:

> Victor Yegorov  writes:
> > While table 9.5 with functions looks quite nice, I quite dislike 9.4 with
> > operators.
> > Previously, I could lookup operator in the leftmost column and read on.
> > Right now I have to look through the whole table (well, not really, but
> > still) to find the operator.
> 
> Aside from the alternatives already discussed,

There's one with a separate column for the operator, without types, at
the left (the "with names" example at
https://postgr.es/m/14380.1587242...@sss.pgh.pa.us ).  That seemed
pretty promising -- not sure why it was discarded.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: new heapcheck contrib module

2020-04-20 Thread Peter Geoghegan
On Mon, Apr 20, 2020 at 12:40 PM Mark Dilger
 wrote:
> Ok, I'll work in that direction and repost when I have something along those 
> lines.

Great, thanks!

It also occurs to me that the B-Tree checks that amcheck already has
have one remaining blindspot: While the heapallindexed verification
option has the ability to detect the absence of an index tuple that
the dummy CREATE INDEX that we perform under the hood says should be
in the index, it cannot do the opposite: It cannot detect the presence
of a malformed tuple that shouldn't be there at all, unless the index
tuple itself is corrupt. That could miss an inconsistent page image
when a few tuples have been VACUUMed away, but still appear in the
index.

In order to do that, we'd have to have something a bit like the
validate_index() heap scan that CREATE INDEX CONCURRENTLY uses. We'd
have to get a list of heap TIDs that any index tuple might be pointing
to, and then make sure that there were no TIDs in the index that were
not in that list -- tuples that were pointing to nothing in the heap
at all. This could use the index_bulk_delete() interface. This is the
kind of verification option that I might work on for debugging
purposes, but not the kind of thing I could really recommend to
ordinary users outside of exceptional cases. This is the kind of thing
that argues for more or less providing all of the verification
functionality we have through both high level and low level
interfaces. This isn't likely to be all that valuable most of the
time, and users shouldn't have to figure that out for themselves the
hard way. (BTW, I think that this could be implemented in an
index-AM-agnostic way, I think, so perhaps you can consider adding it
too, if you have time.)

One last thing for now: take a look at amcheck's
bt_tuple_present_callback() function. It has comments about HOT chain
corruption that you may find interesting. Note that this check played
a role in the "freeze the dead" corruption bug [1] -- it detected that
our initial fix for that was broken. It seems like it would be a good
idea to go back through the reproducers we've seen for some of the
more memorable corruption bugs, and actually make sure that your tool
detects them where that isn't clear. History doesn't repeat itself,
but it often rhymes.

[1] 
https://postgr.es/m/cah2-wznm4rcrhfaiwkpwtpew2bxdtgrozk7jwwgucxeh3d1...@mail.gmail.com
-- 
Peter Geoghegan




Re: design for parallel backup

2020-04-20 Thread Andres Freund
Hi,

On 2020-04-20 16:36:16 -0400, Robert Haas wrote:
> My suspicion is that it has mostly to do with adequately utilizing the
> hardware resources on the server side. If you are network-constrained,
> adding more connections won't help, unless there's something shaping
> the traffic which can be gamed by having multiple connections.
> However, as things stand today, at any given point in time the base
> backup code on the server will EITHER be attempting a single
> filesystem I/O or a single network I/O, and likewise for the client.

Well, kinda, but not really. Both file reads (server)/writes(client) and
network send(server)/recv(client) are buffered by the OS, and the file
IO is entirely sequential.

That's not true for checksum computations / compressions to the same
degree. They're largely bottlenecked in userland, without the kernel
doing as much async work.


> If a backup client - either current or hypothetical - is compressing
> and encrypting, then it doesn't have either a filesystem I/O or a
> network I/O in progress while it's doing so. You take not only the hit
> of the time required for compression and/or encryption, but also use
> that much less of the available network and/or I/O capacity.

I don't think it's really the time for network/file I/O that's the
issue. Sure memcpy()'ing from the kernel takes time, but compared to
encryption/compression it's not that much.  Especially for compression,
it's not really lack of cycles for networking that prevent a higher
throughput, it's that after buffering a few MB there's just no point
buffering more, given compression will plod along with 20-100MB/s.


> While I agree that some of these problems could likely be addressed in
> other ways, parallelism seems to offer an approach that could solve
> multiple issues at the same time. If you want to address it without
> that, you need asynchronous filesystem I/O and asynchronous network
> I/O and both of those on both the client and server side, plus
> multithreaded compression and multithreaded encryption and maybe some
> other things. That sounds pretty hairy and hard to get right.

I'm not really convinced. You're complicating the wire protocol by
having multiple tar files with overlapping contents. With the
consequence that clients need additional logic to deal with that. We'll
not get one manifest, but multiple ones, etc.

We already do network IO non-blocking, and leaving the copying to
kernel, the kernel does the actual network work asynchronously. Except
for file boundaries the kernel does asynchronous read IO for us (but we
should probably hint it to do that even at the start of a new file).

I think we're quite a bit away from where we need to worry about making
encryption multi-threaded:
andres@awork3:~/src/postgresql$ openssl speed -evp aes-256-ctr
Doing aes-256-ctr for 3s on 16 size blocks: 81878709 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 64 size blocks: 71062203 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 256 size blocks: 31738391 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 1024 size blocks: 10043519 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 8192 size blocks: 1346933 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 16384 size blocks: 674680 aes-256-ctr's in 3.00s
OpenSSL 1.1.1f  31 Mar 2020
built on: Tue Mar 31 21:59:59 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g 
-O2 -fdebug-prefix-map=/build/openssl-hsg853/openssl-1.1.1f=. 
-fstack-protector-strong -Wformat -Werror=format-security 
-DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ 
-DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 
-DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM 
-DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM 
-DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes  
16384 bytes
aes-256-ctr 436686.45k  1515993.66k  2708342.70k  3428187.82k  3678025.05k  
3684652.37k


So that really just leaves compression (and perhaps cryptographic
checksumming). Given that we can provide nearly all of the benefits of
multi-stream parallelism in a compatible way by using
parallelism/threads at that level, I just have a hard time believing the
complexity of doing those tasks in parallel is bigger than multi-stream
parallelism. And I'd be fairly unsurprised if you'd end up with a lot
more "bubbles" in the pipeline when using multi-stream parallelism.

Greetings,

Andres Freund




Re: new heapcheck contrib module

2020-04-20 Thread Robert Haas
On Mon, Apr 20, 2020 at 4:30 PM Andres Freund  wrote:
> A few billion CLogTruncationLock acquisitions in short order will likely
> have at least as big an impact as ShareUpdateExclusiveLock held for the
> duration of the check. That's not really a relevant concern or
> txid_status().  Per-tuple lock acquisitions aren't great.

Yeah, that's true. Doing it for every tuple is going to be too much, I
think. I was hoping we could avoid that.

> I think it might be doable to not need either. E.g. we could set the
> checking backend's xmin to relfrozenxid, and set somethign like
> PROC_IN_VACUUM. That should, I think, prevent clog from being truncated
> in a problematic way (clog truncations look at PROC_IN_VACUUM backends),
> while not blocking vacuum.

Hmm, OK, I don't know if that would be OK or not.

> The similar concern for ReadNewTransactionId() can probably more easily
> be addressed, by only calling ReadNewTransactionId() when encountering
> an xid that's newer than the last value read.

Yeah, if we can cache some things to avoid repetitive calls, that would be good.

> I think it'd be good to set PROC_IN_VACUUM (or maybe a separate version
> of it) while checking anyway. Reading the full relation can take quite a
> while, and we shouldn't prevent hot pruning while doing so.
>
> There's some things we'd need to figure out to be able to use
> PROC_IN_VACUUM, as that's really only safe in some
> circumstances. Possibly it'd be easiest to address that if we'd make the
> check a procedure...

I think we sure want to set things up so that we do this check without
holding a snapshot, if we can. Not sure exactly how to get there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: new heapcheck contrib module

2020-04-20 Thread Peter Geoghegan
On Mon, Apr 20, 2020 at 12:42 PM Andres Freund  wrote:
> This is something we really really really need. I'm very excited to see
> progress!

+1

My experience with amcheck was that the requirement that we document
and verify pretty much every invariant (the details of which differ
slightly based on the B-Tree version in use) has had intangible
benefits. It helped me come up with a simpler, better design in the
first place. Also, many of the benchmarks that I perform get to be a
stress-test of the feature itself. It saves quite a lot of testing
work in the long run.

> I wonder if a mode where heapcheck optionally would only checks
> non-frozen (perhaps also non-all-visible) regions of a table would be a
> good idea? Would make it a lot more viable to run this regularly on
> bigger databases. Even if there's a window to not check some data
> (because it's frozen before the next heapcheck run).

That's a great idea. It could also make it practical to use the
rootdescend verification option to verify indexes selectively -- if
you don't have too many blocks to check on average, the overhead is
tolerable. This is the kind of thing that naturally belongs in the
higher level interface that I sketched already.

> We also had a *lot* of bugs that we'd have found a lot earlier, possibly
> even during development, if we had a way to easily perform these checks.

I can think of a case where it was quite unclear what the invariants
for the heap even were, at least temporarily. And this was in the
context of fixing a bug that was really quite nasty. Formally defining
the invariants in one place, and taking a position on exactly what
correct looks like seems like a very valuable exercise. Even without
the tool catching a single bug.

> I have a hard time believing this is going to be really
> reliable. E.g. the alignment requirements will vary between platforms,
> leading to different layouts. In particular, MAXALIGN differs between
> platforms.

Over on another thread, I suggested that Mark might want to have a
corruption test framework that exposes some of the bufpage.c routines.
The idea is that you can destructively manipulate a page using the
logical page interface. Something that works one level below the
access method, but one level above the raw page image. It probably
wouldn't test everything that Mark wants to test, but it would test
some things in a way that seems maintainable to me.

-- 
Peter Geoghegan




Re: design for parallel backup

2020-04-20 Thread Robert Haas
On Mon, Apr 20, 2020 at 4:19 PM Andres Freund  wrote:
> Why do we want parallelism here. Or to be more precise: What do we hope
> to accelerate by making what part of creating a base backup
> parallel. There's several potential bottlenecks, and I think it's
> important to know the design priorities to evaluate a potential design.
>
> Bottlenecks (not ordered by importance):
> - compression performance (likely best solved by multiple compression
>   threads and a better compression algorithm)
> - unencrypted network performance (I'd like to see benchmarks showing in
>   which cases multiple TCP streams help / at which bandwidth it starts
>   to help)
> - encrypted network performance, i.e. SSL overhead (not sure this is an
>   important problem on modern hardware, given hardware accelerated AES)
> - checksumming overhead (a serious problem for cryptographic checksums,
>   but presumably not for others)
> - file IO (presumably multiple facets here, number of concurrent
>   in-flight IOs, kernel page cache overhead when reading TBs of data)
>
> I'm not really convinced that design addressing the more crucial
> bottlenecks really needs multiple fe/be connections. But that seems to
> be have been the focus of the discussion so far.

I haven't evaluated this. Both BART and pgBackRest offer parallel
backup options, and I'm pretty sure both were performance tested and
found to be very significantly faster, but I didn't write the code for
either, nor have I evaluated either to figure out exactly why it was
faster.

My suspicion is that it has mostly to do with adequately utilizing the
hardware resources on the server side. If you are network-constrained,
adding more connections won't help, unless there's something shaping
the traffic which can be gamed by having multiple connections.
However, as things stand today, at any given point in time the base
backup code on the server will EITHER be attempting a single
filesystem I/O or a single network I/O, and likewise for the client.
If a backup client - either current or hypothetical - is compressing
and encrypting, then it doesn't have either a filesystem I/O or a
network I/O in progress while it's doing so. You take not only the hit
of the time required for compression and/or encryption, but also use
that much less of the available network and/or I/O capacity.

While I agree that some of these problems could likely be addressed in
other ways, parallelism seems to offer an approach that could solve
multiple issues at the same time. If you want to address it without
that, you need asynchronous filesystem I/O and asynchronous network
I/O and both of those on both the client and server side, plus
multithreaded compression and multithreaded encryption and maybe some
other things. That sounds pretty hairy and hard to get right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: new heapcheck contrib module

2020-04-20 Thread Andres Freund
Hi,

On 2020-04-20 15:59:49 -0400, Robert Haas wrote:
> On Mon, Apr 20, 2020 at 3:42 PM Andres Freund  wrote:
> > I don't think random interspersed uses of CLogTruncationLock are a good
> > idea. If you move to only checking visibility after tuple fits into
> > [relfrozenxid, nextXid), then you don't need to take any locks here, as
> > long as a lock against vacuum is taken (which I think this should do
> > anyway).
> 
> I think it would be *really* good to avoid ShareUpdateExclusiveLock
> here. Running with only AccessShareLock would be a big advantage. I
> agree that any use of CLogTruncationLock should not be "random", but I
> don't see why the same method we use to make txid_status() safe to
> expose to SQL shouldn't also be used here.

A few billion CLogTruncationLock acquisitions in short order will likely
have at least as big an impact as ShareUpdateExclusiveLock held for the
duration of the check. That's not really a relevant concern or
txid_status().  Per-tuple lock acquisitions aren't great.

I think it might be doable to not need either. E.g. we could set the
checking backend's xmin to relfrozenxid, and set somethign like
PROC_IN_VACUUM. That should, I think, prevent clog from being truncated
in a problematic way (clog truncations look at PROC_IN_VACUUM backends),
while not blocking vacuum.

The similar concern for ReadNewTransactionId() can probably more easily
be addressed, by only calling ReadNewTransactionId() when encountering
an xid that's newer than the last value read.


I think it'd be good to set PROC_IN_VACUUM (or maybe a separate version
of it) while checking anyway. Reading the full relation can take quite a
while, and we shouldn't prevent hot pruning while doing so.


There's some things we'd need to figure out to be able to use
PROC_IN_VACUUM, as that's really only safe in some
circumstances. Possibly it'd be easiest to address that if we'd make the
check a procedure...

Greetings,

Andres Freund




Re: design for parallel backup

2020-04-20 Thread Robert Haas
Thanks for your thoughts.

On Mon, Apr 20, 2020 at 4:02 PM Peter Eisentraut
 wrote:
> That would clearly be a good goal.  Non-parallel backup should ideally
> be parallel backup with one worker.

Right.

> But it doesn't follow that the proposed design is wrong.  It might just
> be that the design of the existing backup should change.
>
> I think making the wire format so heavily tied to the tar format is
> dubious.  There is nothing particularly fabulous about the tar format.
> If the server just sends a bunch of files with metadata for each file,
> the client can assemble them in any way they want: unpacked, packed in
> several tarball like now, packed all in one tarball, packed in a zip
> file, sent to S3, etc.

Yeah, that's true, and I agree that there's something a little
unsatisfying and dubious about the current approach. However, I am not
sure that there is sufficient reason to change it to something else,
either. After all, what purpose would such a change serve? The client
can already do any of the things you mention here, provided that it
can interpret the data sent by the server, and pg_basebackup already
has code to do exactly this. Right now, we have pretty good
pg_basebackup compatibility across server versions, and if we change
the format, then we won't, unless we make both the client and the
server understand both formats. I'm not completely averse to such a
change if it has sufficient benefits to make it worthwhile, but it's
not clear to me that it does.

> Another thing I would like to see sometime is this: Pull a minimal
> basebackup, start recovery and possibly hot standby before you have
> received all the files.  When you need to access a file that's not there
> yet, request that as a priority from the server.  If you nudge the file
> order a little with perhaps prewarm-like data, you could get a mostly
> functional standby without having to wait for the full basebackup to
> finish.  Pull a file on request is a requirement for this.

True, but that can always be implemented as a separate feature. I
won't be sad if that feature happens to fall out of work in this area,
but I don't think the possibility that we'll some day have such
advanced wizardry should bias the design of this feature very much.
One pretty major problem with this is that you can't open for
connections until you've reached a consistent state, and you can't say
that you're in a consistent state until you've replayed all the WAL
generated during the backup, and you can't say that you're at the end
of the backup until you've copied all the files. So, without some
clever idea, this would only allow you to begin replay sooner; it
would not allow you to accept connections sooner. I suspect that makes
it significantly less appealing.

> > So, my new idea for parallel backup is that the server will return
> > tarballs, but just more of them. Right now, you get base.tar and
> > ${tablespace_oid}.tar for each tablespace. I propose that if you do a
> > parallel backup, you should get base-${N}.tar and
> > ${tablespace_oid}-${N}.tar for some or all values of N between 1 and
> > the number of workers, with the server deciding which files ought to
> > go in which tarballs.
>
> I understand the other side of this:  Why not compress or encrypt the
> backup already on the server side?  Makes sense.  But this way seems
> weird and complicated.  If I want a backup, I want one file, not an
> unpredictable set of files.  How do I even know I have them all?  Do we
> need a meta-manifest?

Yes, that's a problem, but...

> A format such as ZIP would offer more flexibility, I think.  You can
> build a single target file incrementally, you can compress or encrypt
> each member file separately, thus allowing some compression etc. on the
> server.  I'm not saying it's perfect for this, but some more thinking
> about the archive formats would potentially give some possibilities.

...I don't think this really solves anything. I expect you would have
to write the file more or less sequentially, and I think that Amdahl's
law will not be kind to us.

> All things considered, we'll probably want more options and more ways of
> doing things.

Yes. That's why I'm trying to figure out how to create a flexible framework.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: design for parallel backup

2020-04-20 Thread Andres Freund
Hi,

On 2020-04-15 11:57:29 -0400, Robert Haas wrote:
> Over at 
> http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9eitzwmod2mpqr_rif...@mail.gmail.com
> there's a proposal for a parallel backup patch which works in the way
> that I have always thought parallel backup would work: instead of
> having a monolithic command that returns a series of tarballs, you
> request individual files from a pool of workers. Leaving aside the
> quality-of-implementation issues in that patch set, I'm starting to
> think that the design is fundamentally wrong and that we should take a
> whole different approach. The problem I see is that it makes a
> parallel backup and a non-parallel backup work very differently, and
> I'm starting to realize that there are good reasons why you might want
> them to be similar.
>
> Specifically, as Andres recently pointed out[1], almost anything that
> you might want to do on the client side, you might also want to do on
> the server side. We already have an option to let the client compress
> each tarball, but you might also want the server to, say, compress
> each tarball[2]. Similarly, you might want either the client or the
> server to be able to encrypt each tarball, or compress but with a
> different compression algorithm than gzip. If, as is presently the
> case, the server is always returning a set of tarballs, it's pretty
> easy to see how to make this work in the same way on either the client
> or the server, but if the server returns a set of tarballs in
> non-parallel backup cases, and a set of tarballs in parallel backup
> cases, it's a lot harder to see how that any sort of server-side
> processing should work, or how the same mechanism could be used on
> either the client side or the server side.
>
> So, my new idea for parallel backup is that the server will return
> tarballs, but just more of them. Right now, you get base.tar and
> ${tablespace_oid}.tar for each tablespace. I propose that if you do a
> parallel backup, you should get base-${N}.tar and
> ${tablespace_oid}-${N}.tar for some or all values of N between 1 and
> the number of workers, with the server deciding which files ought to
> go in which tarballs. This is more or less the naming convention that
> BART uses for its parallel backup implementation, which, incidentally,
> I did not write. I don't really care if we pick something else, but it
> seems like a sensible choice. The reason why I say "some or all" is
> that some workers might not get any of the data for a given
> tablespace. In fact, it's probably desirable to have different workers
> work on different tablespaces as far as possible, to maximize parallel
> I/O, but it's quite likely that you will have more workers than
> tablespaces. So you might end up, with pg_basebackup -j4, having the
> server send you base-1.tar and base-2.tar and base-4.tar, but not
> base-3.tar, because worker 3 spent all of its time on user-defined
> tablespaces, or was just out to lunch.

One question I have not really seen answered well:

Why do we want parallelism here. Or to be more precise: What do we hope
to accelerate by making what part of creating a base backup
parallel. There's several potential bottlenecks, and I think it's
important to know the design priorities to evaluate a potential design.

Bottlenecks (not ordered by importance):
- compression performance (likely best solved by multiple compression
  threads and a better compression algorithm)
- unencrypted network performance (I'd like to see benchmarks showing in
  which cases multiple TCP streams help / at which bandwidth it starts
  to help)
- encrypted network performance, i.e. SSL overhead (not sure this is an
  important problem on modern hardware, given hardware accelerated AES)
- checksumming overhead (a serious problem for cryptographic checksums,
  but presumably not for others)
- file IO (presumably multiple facets here, number of concurrent
  in-flight IOs, kernel page cache overhead when reading TBs of data)

I'm not really convinced that design addressing the more crucial
bottlenecks really needs multiple fe/be connections. But that seems to
be have been the focus of the discussion so far.

Greetings,

Andres Freund




Re: new heapcheck contrib module

2020-04-20 Thread Robert Haas
[ retrying from the email address I intended to use ]

On Mon, Apr 20, 2020 at 3:42 PM Andres Freund  wrote:
> I don't think random interspersed uses of CLogTruncationLock are a good
> idea. If you move to only checking visibility after tuple fits into
> [relfrozenxid, nextXid), then you don't need to take any locks here, as
> long as a lock against vacuum is taken (which I think this should do
> anyway).

I think it would be *really* good to avoid ShareUpdateExclusiveLock
here. Running with only AccessShareLock would be a big advantage. I
agree that any use of CLogTruncationLock should not be "random", but I
don't see why the same method we use to make txid_status() safe to
expose to SQL shouldn't also be used here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: design for parallel backup

2020-04-20 Thread Peter Eisentraut

On 2020-04-15 17:57, Robert Haas wrote:

Over at 
http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9eitzwmod2mpqr_rif...@mail.gmail.com
there's a proposal for a parallel backup patch which works in the way
that I have always thought parallel backup would work: instead of
having a monolithic command that returns a series of tarballs, you
request individual files from a pool of workers. Leaving aside the
quality-of-implementation issues in that patch set, I'm starting to
think that the design is fundamentally wrong and that we should take a
whole different approach. The problem I see is that it makes a
parallel backup and a non-parallel backup work very differently, and
I'm starting to realize that there are good reasons why you might want
them to be similar.


That would clearly be a good goal.  Non-parallel backup should ideally 
be parallel backup with one worker.


But it doesn't follow that the proposed design is wrong.  It might just 
be that the design of the existing backup should change.


I think making the wire format so heavily tied to the tar format is 
dubious.  There is nothing particularly fabulous about the tar format. 
If the server just sends a bunch of files with metadata for each file, 
the client can assemble them in any way they want: unpacked, packed in 
several tarball like now, packed all in one tarball, packed in a zip 
file, sent to S3, etc.


Another thing I would like to see sometime is this: Pull a minimal 
basebackup, start recovery and possibly hot standby before you have 
received all the files.  When you need to access a file that's not there 
yet, request that as a priority from the server.  If you nudge the file 
order a little with perhaps prewarm-like data, you could get a mostly 
functional standby without having to wait for the full basebackup to 
finish.  Pull a file on request is a requirement for this.



So, my new idea for parallel backup is that the server will return
tarballs, but just more of them. Right now, you get base.tar and
${tablespace_oid}.tar for each tablespace. I propose that if you do a
parallel backup, you should get base-${N}.tar and
${tablespace_oid}-${N}.tar for some or all values of N between 1 and
the number of workers, with the server deciding which files ought to
go in which tarballs.


I understand the other side of this:  Why not compress or encrypt the 
backup already on the server side?  Makes sense.  But this way seems 
weird and complicated.  If I want a backup, I want one file, not an 
unpredictable set of files.  How do I even know I have them all?  Do we 
need a meta-manifest?


A format such as ZIP would offer more flexibility, I think.  You can 
build a single target file incrementally, you can compress or encrypt 
each member file separately, thus allowing some compression etc. on the 
server.  I'm not saying it's perfect for this, but some more thinking 
about the archive formats would potentially give some possibilities.


All things considered, we'll probably want more options and more ways of 
doing things.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables

2020-04-20 Thread Justin Pryzby
On Mon, Apr 20, 2020 at 06:35:44PM +0900, Amit Langote wrote:
> On Mon, Apr 20, 2020 at 5:49 AM Justin Pryzby  wrote:
> > On Sun, Apr 19, 2020 at 03:13:29PM -0400, Alvaro Herrera wrote:
> > > What happens if you detach the parent?  I mean, should the trigger
> > > removal recurse to children?
> >
> > It think it should probably exactly undo what CloneRowTriggersToPartition 
> > does.
> > ..and I guess you're trying to politely say that it didn't.  I tried to fix 
> > in
> > v4 - please check if that's right.
> 
> Looks correct to me.  Maybe have a test cover that?

I included such a test with the v4 patch.

> > > > But if we remove trigger during DETACH, then it's *not* the same as a 
> > > > trigger
> > > > that was defined on the child, and I suggest that in v13 we should make 
> > > > that
> > > > visible.
> > >
> > > Hmm, interesting point -- whether the trigger is partition or not is
> > > important because it affects what happens on detach.  I agree that we
> > > should make it visible.  Is the proposed single bit "PARTITION" good
> > > enough, or should we indicate what's the ancestor table that defines the
> > > partition?
> >
> > Yea, it's an obvious thing to do.
> 
> This:
> 
> +  "false AS tgisinternal"),
> + (pset.sversion >= 13000 ?
> +  "pg_partition_root(t.tgrelid) AS parent" :
> +  "'' AS parent"),
> + oid);
> 
> 
> looks wrong, because the actual partition root may not also be the
> trigger parent root, for example:

Sigh, right.

> The following gets the correct parent for me:
> 
> - (pset.sversion >= 13000 ?
> -  "pg_partition_root(t.tgrelid) AS parent" :
> -  "'' AS parent"),
> + (pset.sversion >= 13 ?
> +  "(SELECT relid"
> +  " FROM pg_trigger, 
> pg_partition_ancestors(t.tgrelid)"
> +  " WHERE tgname = t.tgname AND tgrelid = relid"
> +  " AND tgparentid = 0) AS parent" :
> +  " null AS parent"),

I'm happy to see that this doesn't require a recursive cte, at least.
I was trying to think how to break it by returning multiple results or results
out of order, but I think that can't happen.

> Also, how about, for consistency, making the parent table labeling of
> the trigger look similar to that for the foreign constraint, so
> Triggers:
> TABLE "f1" TRIGGER "trig" BEFORE INSERT ON f11 FOR EACH ROW EXECUTE 
> FUNCTION trigfunc()

I'll leave that for committer to decide.

I split into separate patches since only 0001 should be backpatched (with
s/OidIsValid(tgparentid)/isPartitionTrigger/).

-- 
Justin
>From dc24d8fdbe4435feaea9a99fcf3e1e1121cc8950 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 3 Apr 2020 22:43:26 -0500
Subject: [PATCH v5 1/2] fix detaching tables with inherited row triggers

The old behavior is buggy, and the intended behavior was not previously
documented.  So define the behavior that the trigger is removed if the table is
detached.  It is an error if a table being attached already has a trigger of
the same.  This differs from the behavior for indexes and constraints.

Should backpatch to v11 with s/OidIsValid(tgparentid)/isPartitionTrigger/.

See also:
86f575948c773b0ec5b0f27066e37dd93a7f0a96

Discussion:
https://www.postgresql.org/message-id/flat/20200408152412.GZ2228%40telsasoft.com
---
 doc/src/sgml/ref/create_trigger.sgml   |  1 +
 src/backend/commands/tablecmds.c   | 42 
 src/test/regress/expected/triggers.out | 45 ++
 src/test/regress/sql/triggers.sql  | 21 
 4 files changed, 109 insertions(+)

diff --git a/doc/src/sgml/ref/create_trigger.sgml b/doc/src/sgml/ref/create_trigger.sgml
index bde3a63f90..303881fcfb 100644
--- a/doc/src/sgml/ref/create_trigger.sgml
+++ b/doc/src/sgml/ref/create_trigger.sgml
@@ -526,6 +526,7 @@ UPDATE OF column_name1 [, column_name2INSTEAD OF.
   
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 037d457c3d..514c5ff844 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -16797,6 +16797,48 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
 	}
 	table_close(classRel, RowExclusiveLock);
 
+	/*
+	 * Drop any ROW triggers inherited from partitioned table.
+	 * This undoes what CloneRowTriggersToPartition did.
+	 */
+	{
+		ScanKeyData skey;
+		SysScanDesc	scan;
+		HeapTuple	trigtup;
+		Relation	tgrel;
+
+		ScanKeyInit(, Anum_pg_trigger_tgrelid, BTEqualStrategyNumber,
+F_OIDEQ, ObjectIdGetDatum(RelationGetRelid(partRel)));
+		tgrel = table_open(TriggerRelationId, RowExclusiveLock);
+		scan = systable_beginscan(tgrel, TriggerRelidNameIndexId,
+true, NULL, 1, );
+
+		while (HeapTupleIsValid(trigtup = systable_getnext(scan)))
+		{
+			

Re: new heapcheck contrib module

2020-04-20 Thread Andres Freund
Hi,

On 2020-04-20 10:59:28 -0700, Mark Dilger wrote:
> I have been talking with Robert about table corruption that occurs
> from time to time. The page checksum feature seems sufficient to
> detect most random corruption problems, but it can't detect "logical"
> corruption, where the page is valid but inconsistent with the rest of
> the database cluster. This can happen due to faulty or ill-conceived
> backup and restore tools, or bad storage, or user error, or bugs in
> the server itself. (Also, not everyone enables checksums.)

This is something we really really really need. I'm very excited to see
progress!


> From 2a1bc0bb9fa94bd929adc1a408900cb925ebcdd5 Mon Sep 17 00:00:00 2001
> From: Mark Dilger 
> Date: Mon, 20 Apr 2020 08:05:58 -0700
> Subject: [PATCH v2] Adding heapcheck contrib module.
> 
> The heapcheck module introduces a new function for checking a heap
> relation and associated toast relation, if any, for corruption.

Why not add it to amcheck?


I wonder if a mode where heapcheck optionally would only checks
non-frozen (perhaps also non-all-visible) regions of a table would be a
good idea? Would make it a lot more viable to run this regularly on
bigger databases. Even if there's a window to not check some data
(because it's frozen before the next heapcheck run).


> The attached module provides the means to scan a relation and sanity
> check it. Currently, it checks xmin and xmax values against
> relfrozenxid and relminmxid, and also validates TOAST pointers. If
> people like this, it could be expanded to perform additional checks.


> The postgres backend already defends against certain forms of
> corruption, by checking the page header of each page before allowing
> it into the page cache, and by checking the page checksum, if enabled.
> Experience shows that broken or ill-conceived backup and restore
> mechanisms can result in a page, or an entire file, being overwritten
> with an earlier version of itself, restored from backup.  Pages thus
> overwritten will appear to have valid page headers and checksums,
> while potentially containing xmin, xmax, and toast pointers that are
> invalid.

We also had a *lot* of bugs that we'd have found a lot earlier, possibly
even during development, if we had a way to easily perform these checks.


> contrib/heapcheck introduces a function, heapcheck_relation, that
> takes a regclass argument, scans the given heap relation, and returns
> rows containing information about corruption found within the table.
> The main focus of the scan is to find invalid xmin, xmax, and toast
> pointer values.  It also checks for structural corruption within the
> page (such as invalid t_hoff values) that could lead to the backend
> aborting should the function blindly trust the data as it finds it.


> +typedef struct CorruptionInfo
> +{
> + BlockNumber blkno;
> + OffsetNumber offnum;
> + int16   lp_off;
> + int16   lp_flags;
> + int16   lp_len;
> + int32   attnum;
> + int32   chunk;
> + char   *msg;
> +}CorruptionInfo;

Adding a short comment explaining what this is for would be good.


> +/* Internal implementation */
> +void record_corruption(HeapCheckContext * ctx, char *msg);
> +TupleDescheapcheck_relation_tupdesc(void);
> +
> +void beginRelBlockIteration(HeapCheckContext * ctx);
> +bool relBlockIteration_next(HeapCheckContext * ctx);
> +void endRelBlockIteration(HeapCheckContext * ctx);
> +
> +void beginPageTupleIteration(HeapCheckContext * ctx);
> +bool pageTupleIteration_next(HeapCheckContext * ctx);
> +void endPageTupleIteration(HeapCheckContext * ctx);
> +
> +void beginTupleAttributeIteration(HeapCheckContext * ctx);
> +bool tupleAttributeIteration_next(HeapCheckContext * ctx);
> +void endTupleAttributeIteration(HeapCheckContext * ctx);
> +
> +void beginToastTupleIteration(HeapCheckContext * ctx,
> +  struct 
> varatt_external *toast_pointer);
> +void endToastTupleIteration(HeapCheckContext * ctx);
> +bool toastTupleIteration_next(HeapCheckContext * ctx);
> +
> +bool TransactionIdStillValid(TransactionId xid, FullTransactionId 
> *fxid);
> +bool HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * 
> ctx);
> +void check_toast_tuple(HeapCheckContext * ctx);
> +bool check_tuple_attribute(HeapCheckContext * ctx);
> +void check_tuple(HeapCheckContext * ctx);
> +
> +List*check_relation(Oid relid);
> +void check_relation_relkind(Relation rel);

Why aren't these static?


> +/*
> + * record_corruption
> + *
> + *   Record a message about corruption, including information
> + *   about where in the relation the corruption was found.
> + */
> +void
> +record_corruption(HeapCheckContext * ctx, char *msg)
> +{

Given that you went through 

Re: new heapcheck contrib module

2020-04-20 Thread Mark Dilger



> On Apr 20, 2020, at 12:37 PM, Peter Geoghegan  wrote:
> 
> I mean an interface that's friendly to DBAs, that verifies an entire 
> database. No custom sql query required. Something that provides a reasonable 
> mix of verification options based on high level directives. All verification 
> methods can be combined in a granular, possibly randomized fashion. Maybe we 
> can make this run in parallel. 
> 
> For example, maybe your heap checker code sometimes does index probes for a 
> subset of indexes and heap tuples. It's not hard to combine it with the 
> rootdescend stuff from amcheck. It should be composable. 
> 
> The interface you've chosen is a good starting point. But let's not miss an 
> opportunity to make everything work together. 

Ok, I'll work in that direction and repost when I have something along those 
lines.

Thanks again for your input.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: new heapcheck contrib module

2020-04-20 Thread Peter Geoghegan
I mean an interface that's friendly to DBAs, that verifies an entire
database. No custom sql query required. Something that provides a
reasonable mix of verification options based on high level directives. All
verification methods can be combined in a granular, possibly randomized
fashion. Maybe we can make this run in parallel.

For example, maybe your heap checker code sometimes does index probes for a
subset of indexes and heap tuples. It's not hard to combine it with the
rootdescend stuff from amcheck. It should be composable.

The interface you've chosen is a good starting point. But let's not miss an
opportunity to make everything work together.

Peter Geoghegan
(Sent from my phone)


Re: relocating the server's backup manifest code

2020-04-20 Thread Robert Haas
On Sat, Apr 18, 2020 at 8:12 PM Michael Paquier  wrote:
> I would suggest to
> still use BackupManifest instead of Manifest in those functions and
> structures though, ...

Done in the attached, which also adds "backup_" to the type names.

After further examination, I think the Copyright header issue is
entirely separate. If someone wants to standardize that across the
source tree, cool, but this patch just duplicated the header from the
file out of which it was moving code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0001-Rename-exposed-identifiers-to-say-backup-manifest.patch
Description: Binary data


Re: new heapcheck contrib module

2020-04-20 Thread Mark Dilger



> On Apr 20, 2020, at 11:31 AM, Peter Geoghegan  wrote:
> 
> IMV, the problem that we have with amcheck is that it's too hard to
> use in a top down kind of way. Perhaps there is an opportunity to
> provide a more top-down interface to an expanded version of amcheck
> that does heap checking. Something with a high level practical focus,
> in addition to the low level functions. I'm not saying that Mark
> should be required to solve that problem, but it certainly seems worth
> considering now.

Thanks for your quick response and interest in this submission!

Can you elaborate on "top-down"?  I'm not sure what that means in this context.

I don't mind going further with this project if I understand what you are 
suggesting.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: new heapcheck contrib module

2020-04-20 Thread Peter Geoghegan
On Mon, Apr 20, 2020 at 11:19 AM Robert Haas  wrote:
> I wondered if people would suggest that. Didn't take long.

You were the one that pointed out that my first version of
contrib/amcheck, which was called "contrib/btreecheck", should have
a more general name. And rightly so!

The basic interface used for the heap checker functions seem very
similar to what amcheck already offers for B-Tree indexes, so it seems
very natural to distribute them together.

IMV, the problem that we have with amcheck is that it's too hard to
use in a top down kind of way. Perhaps there is an opportunity to
provide a more top-down interface to an expanded version of amcheck
that does heap checking. Something with a high level practical focus,
in addition to the low level functions. I'm not saying that Mark
should be required to solve that problem, but it certainly seems worth
considering now.

> The documentation would need some updating, but that's doable.

It would also probably need a bit of renaming, so that analogous
function names are used.


--
Peter Geoghegan




Re: new heapcheck contrib module

2020-04-20 Thread Robert Haas
On Mon, Apr 20, 2020 at 2:09 PM Peter Geoghegan  wrote:
> Cool. Why not make it part of contrib/amcheck?

I wondered if people would suggest that. Didn't take long.

The documentation would need some updating, but that's doable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: new heapcheck contrib module

2020-04-20 Thread Peter Geoghegan
On Mon, Apr 20, 2020 at 10:59 AM Mark Dilger
 wrote:
> The attached module provides the means to scan a relation and sanity check 
> it. Currently, it checks xmin and xmax values against relfrozenxid and 
> relminmxid, and also validates TOAST pointers. If people like this, it could 
> be expanded to perform additional checks.

Cool. Why not make it part of contrib/amcheck?

We talked about the kinds of checks that we'd like to have for a tool
like this before:

https://postgr.es/m/20161017014605.ga1220...@tornado.leadboat.com

-- 
Peter Geoghegan




Re: design for parallel backup

2020-04-20 Thread Robert Haas
On Mon, Apr 20, 2020 at 8:50 AM Amit Kapila  wrote:
> It is not apparent how you are envisioning this division on the
> server-side.  I think in the currently proposed patch, each worker on
> the client-side requests the specific files. So, how are workers going
> to request such numbered files and how we will ensure that the work
> division among workers is fair?

I think that the workers would just say "give me my share of the base
backup" and then the server would divide up the files as it went. It
would probably keep a queue of whatever files still need to be
processed in shared memory and each process would pop items from the
queue to send to its client.

> I think it also depends to some extent what we decide in the nearby
> thread [1] related to support of compression/encryption.  Say, if we
> want to support a new compression on client-side then we need to
> anyway process the contents of each tar file in which case combining
> into single tar file might be okay but not sure what is the right
> thing here. I think this part needs some more thoughts.

Yes, it needs more thought, but the central idea is to try to create
something that is composable. For example, if we have to do LZ4
compression, and code to do GPG encryption, than we should be able to
do both without adding any more code. Ideally, we should also be able
to either of those operations either on the client side or on the
server side, using the same code either way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-20 Thread Robert Haas
On Mon, Apr 20, 2020 at 12:10 AM Dilip Kumar  wrote:
> I have started reviewing these patches.  I think, the fixes looks right to me.
>
> + LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
> + mapping->head_offset = oldSnapshotControl->head_offset;
> + mapping->head_timestamp = oldSnapshotControl->head_timestamp;
> + mapping->count_used = oldSnapshotControl->count_used;
> + for (int i = 0; i < OLD_SNAPSHOT_TIME_MAP_ENTRIES; ++i)
> + mapping->xid_by_minute[i] = oldSnapshotControl->xid_by_minute[i];
> + LWLockRelease(OldSnapshotTimeMapLock);
>
> I think memcpy would be a better choice instead of looping it for all
> the entries, since we are doing this under a lock?

When I did it that way, it complained about "const" and I couldn't
immediately figure out how to fix it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




new heapcheck contrib module

2020-04-20 Thread Mark Dilger
Hackers,

I have been talking with Robert about table corruption that occurs from time to 
time. The page checksum feature seems sufficient to detect most random 
corruption problems, but it can't detect "logical" corruption, where the page 
is valid but inconsistent with the rest of the database cluster. This can 
happen due to faulty or ill-conceived backup and restore tools, or bad storage, 
or user error, or bugs in the server itself. (Also, not everyone enables 
checksums.)

The attached module provides the means to scan a relation and sanity check it. 
Currently, it checks xmin and xmax values against relfrozenxid and relminmxid, 
and also validates TOAST pointers. If people like this, it could be expanded to 
perform additional checks.

There was a prior v1 patch, discussed offlist with Robert, not posted.  Here is 
v2:



v2-0001-Adding-heapcheck-contrib-module.patch
Description: Binary data


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Re: more ALTER .. DEPENDS ON EXTENSION fixes

2020-04-20 Thread Alvaro Herrera
On 2020-Mar-06, Ibrar Ahmed wrote:

> I think we need a tab-completion patch too for this.

Thanks, I pushed this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: more ALTER .. DEPENDS ON EXTENSION fixes

2020-04-20 Thread Alvaro Herrera
I pushed this (to pg13 only) using the originally proposed "NO DEPENDS"
syntax.  It's trivial to change to REVOKE DEPENDS on REMOVE DEPENDS if
we decide to do that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Adding missing object access hook invocations

2020-04-20 Thread Alvaro Herrera
On 2020-Apr-20, Michael Paquier wrote:

> Unfortunately, we are past feature freeze so this will have to wait
> until v14 opens for business to be merged, and I'll take care of it.
> Or would others prefer to not wait one extra year for those changes to
> be released?

I think it's fine to put this in at this time.  It's not a new feature.
The only thing this needs is to go through a new release cycle so that
people can adjust to the new hook invocations as necessary.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Poll: are people okay with function/operator table redesign?

2020-04-20 Thread Tom Lane
Victor Yegorov  writes:
> While table 9.5 with functions looks quite nice, I quite dislike 9.4 with
> operators.

BTW, I think a big part of the problem with table 9.4 as it's being
rendered in the web style right now is that the type placeholders
(numeric_type etc) are being rendered in a ridiculously overemphasized
fashion, causing them to overwhelm all else.  Do we really want
 to be rendered that way?  I'd think plain italic,
comparable to the rendering of , would be more appropriate.

I could make this page use  for that purpose of course,
but it seems like semantically the wrong thing.

regards, tom lane




Re: More efficient RI checks - take 2

2020-04-20 Thread Corey Huinker
>
> I can imagine removal of the SPI from the current implementation (and
> constructing the plans "manually"), but note that the queries I use in my
> patch are no longer that trivial. So the SPI makes sense to me because it
> ensures regular query planning.
>

As an intermediate step, in the case where we have one row, it should be
simple enough to extract that row manually, and do an SPI call with fixed
values rather than the join to the ephemeral table, yes?


> As for the tuplestore, I'm not sure the startup cost is a problem: if
> you're
> concerned about the 1-row case, the row should usually be stored in memory.
>



> > and once that is done, we could see about step #2.
>
> As I said during my review of your patch last year, I think the RI
> semantics
> has too much in common with that of triggers. I'd need more info to imagine
> such a change.
>

As a general outline, I think that DML would iterate over the 2 sets of
potentially relevant RI definitions rather than iterating over the
triggers.

The similarities between RI and general triggers are obvious, which
explains why they went that route initially, but they're also a crutch, but
since all RI operations boil down to either an iteration over a tuplestore
to do lookups in an index (when checking for referenced rows), or a hash
join of the transient data against the un-indexed table when checking for
referencing rows, and people who know this stuff far better than me seem to
think that SPI overhead is best avoided when possible. I'm looking forward
to having more time to spend on this.


Re: More efficient RI checks - take 2

2020-04-20 Thread Antonin Houska
Corey Huinker  wrote:

> These numbers are very promising, and much more in line with my initial
> expectations. Obviously the impact on single-row DML is of major concern,
> though.

Yes, I agree.

> In doing my initial attempt, the feedback I was getting was that the people
> who truly understood the RI checks fell into the following groups:

> 1. people who wanted to remove the SPI calls from the triggers
> 2. people who wanted to completely refactor RI to not use triggers
> 3. people who wanted to completely refactor triggers
> 
> While #3 is clearly beyond the scope for an endeavor like this, #1 seems
> like it would nearly eliminate the 1-row penalty (we'd still have the
> TupleStore initi penalty, but it would just be a handy queue structure, and
> maybe that cost would be offset by removing the SPI overhead),

I can imagine removal of the SPI from the current implementation (and
constructing the plans "manually"), but note that the queries I use in my
patch are no longer that trivial. So the SPI makes sense to me because it
ensures regular query planning.

As for the tuplestore, I'm not sure the startup cost is a problem: if you're
concerned about the 1-row case, the row should usually be stored in memory.

> and once that is done, we could see about step #2.

As I said during my review of your patch last year, I think the RI semantics
has too much in common with that of triggers. I'd need more info to imagine
such a change.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Poll: are people okay with function/operator table redesign?

2020-04-20 Thread Tom Lane
Victor Yegorov  writes:
> While table 9.5 with functions looks quite nice, I quite dislike 9.4 with
> operators.
> Previously, I could lookup operator in the leftmost column and read on.
> Right now I have to look through the whole table (well, not really, but
> still) to find the operator.

Aside from the alternatives already discussed, the only other idea
that's come to my mind is to write operator entries in a style like

|| as in: text || text → text
Concatenates the two strings.
'Post' || 'greSQL' → PostgreSQL

Not sure that that's any better, but it is another alternative.

regards, tom lane




Re: Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-20 Thread Magnus Hagander
On Mon, Apr 20, 2020 at 4:12 PM Tom Lane  wrote:

> Stephen Frost  writes:
> > Ugh.  That doesn't make it correct though..  We really should be using
> > has_privs_of_role() for these cases (and that goes for all of the
> > default role cases- some of which are correct and others are not, it
> > seems).
>
> I have a different concern about this patch: while reading statistical
> values is fine, do we REALLY want pg_read_all_stats to enable
> pg_stat_get_activity(), ie viewing other sessions' command strings?
> That opens security considerations that don't seem to me to be covered
> by the description of the role.
>

It already did allow that, and that's fully documented.

The patch only adds the ability to get at it through functions, but not
through views. (And the pg_stat_progress_* views).

The pg_stat_activity change is only:
@@ -669,8 +671,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
nulls[16] = true;

/* Values only available to role member or
pg_read_all_stats */
-   if (has_privs_of_role(GetUserId(), beentry->st_userid) ||
-   is_member_of_role(GetUserId(),
DEFAULT_ROLE_READ_ALL_STATS))
+   if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
SockAddrzero_clientaddr;
char   *clipped_activity;


Which moves the check into the macro, but doesn't change how it works.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-20 Thread Tom Lane
Stephen Frost  writes:
> Ugh.  That doesn't make it correct though..  We really should be using
> has_privs_of_role() for these cases (and that goes for all of the
> default role cases- some of which are correct and others are not, it
> seems).

I have a different concern about this patch: while reading statistical
values is fine, do we REALLY want pg_read_all_stats to enable
pg_stat_get_activity(), ie viewing other sessions' command strings?
That opens security considerations that don't seem to me to be covered
by the description of the role.

regards, tom lane




Re: More efficient RI checks - take 2

2020-04-20 Thread Antonin Houska
Pavel Stehule  wrote:

>> st 8. 4. 2020 v 18:36 odesílatel Antonin Houska  napsal:
> 
>>  Some performance comparisons are below. (Besides the execution time, please
>>  note the difference in the number of trigger function executions.) In 
>> general,
>>  the checks are significantly faster if there are many rows to process, and a
>>  bit slower when we only need to check a single row. However I'm not sure 
>> about
>>  the accuracy if only a single row is measured (if a single row check is
>>  performed several times, the execution time appears to fluctuate).
> 
> It is hard task to choose good strategy for immediate constraints, but for
> deferred constraints you know how much rows should be checked, and then you
> can choose better strategy.
> 
> Is possible to use estimation for choosing method of RI checks?

The exact number of rows ("batch size") is always known before the query is
executed. So one problem to solve is that, when only one row is affected, we
need to convince the planner that the "transient table" really contains a
single row. Otherwise it can, for example, produce a hash join where the hash
eventually contains a single row.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Problem with logical replication

2020-04-20 Thread Masahiko Sawada
On Thu, 16 Apr 2020 at 17:48, Dilip Kumar  wrote:
>
> While try to setup a cascading replication, I have observed that if we
> set the REPLICA IDENTITY to FULL on the subscriber side then there is
> an Assert hit.
>
> After analysis I have found that, when we set the REPLICA IDENTITY to
> FULL on subscriber side (because I wanted to make this a publisher for
> another subscriber).
> then it will set relation->rd_replidindex to InvalidOid refer below code 
> snippet
> RelationGetIndexList()
> {
> 
> if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex))
> relation->rd_replidindex = pkeyIndex;
> else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
> relation->rd_replidindex = candidateIndex;
> else
> relation->rd_replidindex = InvalidOid;
> }
>
> But, while appying the update and if the table have an index we have
> this assert in build_replindex_scan_key
>
> static bool
> build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
> TupleTableSlot *searchslot)
> {
> ...
> Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
> }
>
> To me it appears like this assert is not correct.  Attached patch has
> removed this assert and things works fine.
>
>
> #0  0x7ff2a0c8d5d7 in raise () from /lib64/libc.so.6
> #1  0x7ff2a0c8ecc8 in abort () from /lib64/libc.so.6
> #2  0x00aa7c7d in ExceptionalCondition (conditionName=0xc1bb30
> "RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel)",
> errorType=0xc1bad9 "FailedAssertion",
> fileName=0xc1bb1c "execReplication.c", lineNumber=60) at assert.c:67
> #3  0x007153c3 in build_replindex_scan_key
> (skey=0x7fff25711560, rel=0x7ff2a1b0b800, idxrel=0x7ff2a1b0bd98,
> searchslot=0x21328c8) at execReplication.c:60
> #4  0x007156ac in RelationFindReplTupleByIndex
> (rel=0x7ff2a1b0b800, idxoid=16387, lockmode=LockTupleExclusive,
> searchslot=0x21328c8, outslot=0x2132bb0) at execReplication.c:141
> #5  0x008aeba5 in FindReplTupleInLocalRel (estate=0x2150170,
> localrel=0x7ff2a1b0b800, remoterel=0x214a7c8, remoteslot=0x21328c8,
> localslot=0x7fff25711f28) at worker.c:989
> #6  0x008ae6f2 in apply_handle_update_internal
> (relinfo=0x21327b0, estate=0x2150170, remoteslot=0x21328c8,
> newtup=0x7fff25711fd0, relmapentry=0x214a7c8) at worker.c:820
> #7  0x008ae609 in apply_handle_update (s=0x7fff25719560) at 
> worker.c:788
> #8  0x008af8b1 in apply_dispatch (s=0x7fff25719560) at worker.c:1362
> #9  0x008afd52 in LogicalRepApplyLoop (last_received=22926832)
> at worker.c:1570
> #10 0x008b0c3a in ApplyWorkerMain (main_arg=0) at worker.c:2114
> #11 0x00869c15 in StartBackgroundWorker () at bgworker.c:813
> #12 0x0087d28f in do_start_bgworker (rw=0x20a07a0) at 
> postmaster.c:5852
> #13 0x0087d63d in maybe_start_bgworkers () at postmaster.c:6078
> #14 0x0087c685 in sigusr1_handler (postgres_signal_arg=10) at
> postmaster.c:5247
> #15 
> #16 0x7ff2a0d458d3 in __select_nocancel () from /lib64/libc.so.6
> #17 0x00878153 in ServerLoop () at postmaster.c:1691
> #18 0x00877b42 in PostmasterMain (argc=3, argv=0x2079120) at
> postmaster.c:1400
> #19 0x0077f256 in main (argc=3, argv=0x2079120) at main.c:210
>
> To reproduce this issue
> run start1.sh
> then execute below commands on publishers.
> insert into pgbench_accounts values(1,2);
> update pgbench_accounts set b=30 where a=1;
>

I could reproduce this issue by the steps you shared. For the bug fix
patch, I basically agree to remove that assertion from
build_replindex_scan_key() but I think it's better to update the
assertion instead of removal and update the following comment:

* This is not generic routine, it expects the idxrel to be replication
* identity of a rel and meet all limitations associated with that.
*/
static bool
build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
 TupleTableSlot *searchslot)
{

An alternative solution would be that logical replication worker
determines the access path based on its replica identity instead of
seeking the chance to use the primary key as follows:

@@ -981,7 +981,7 @@ FindReplTupleInLocalRel(EState *estate, Relation localrel,

*localslot = table_slot_create(localrel, >es_tupleTable);

-   idxoid = GetRelationIdentityOrPK(localrel);
+   idxoid = RelationGetReplicaIndex(localrel);
Assert(OidIsValid(idxoid) ||
   (remoterel->replident == REPLICA_IDENTITY_FULL));

That way, we can avoid such mismatch between replica identity and an
index for index scans. But a downside is that it will end up with a
sequential scan even if the local table has the primary key. IIUC if
the table has the primary key, a logical replication worker can use
the primary key for the update and delete even if its replica identity
is FULL, because the columns of the primary key are always a subset of
all columns. So I'll look at this closely but I agree with your 

Re: [Proposal] Global temporary tables

2020-04-20 Thread tushar

On 4/20/20 2:59 PM, 曾文旌 wrote:

Please check my new patch.


Thanks Wenjing. Please refer this below scenario , getting error -  
ERROR:  could not read block 0 in file "base/16466/t4_16472": read only 
0 of 8192 bytes


Steps to reproduce

Connect to psql terminal,create a table ( create global temp table t2 (n 
int primary key ) on commit delete rows;)

exit from psql terminal and execute (./clusterdb -t t2 -d postgres -v)
connect to psql terminal and one by one execute these below sql statements
(
cluster verbose t2 using t2_pkey;
cluster verbose t2 ;
alter table t2 add column i int;
cluster verbose t2 ;
cluster verbose t2 using t2_pkey;
create unique index ind on t2(n);
create unique index concurrently  ind1 on t2(n);
select * from t2;
)
This last SQL - will throw this error -  - ERROR:  could not read block 
0 in file "base/16466/t4_16472": read only 0 of 8192 bytes


--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company





Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-20 Thread Amit Kapila
On Sun, Apr 19, 2020 at 3:46 PM Juan José Santamaría Flecha
 wrote:
>
>
> On Sat, Apr 18, 2020 at 6:07 AM Amit Kapila  wrote:
>>
>> On Sat, Apr 18, 2020 at 12:14 AM Juan José Santamaría Flecha
>>  wrote:
>> >
>> > We can get a match for those locales in non-ISO format by enumerating 
>> > available locales with EnumSystemLocalesEx(), and trying to find a match.
>>
>> I have not reviewed or tested the new patch but one thing I would like
>> to see is the impact of setting LC_MESAGGES with different locale
>> information.  Basically, the error messages after setting the locale
>> with _create_locale and with the new method being discussed.  This
>> will help us in ensuring that we didn't break anything which was
>> working with prior versions of MSVC.  Can you or someone try to test
>> and share the results of the same?
>
>
> I cannot find a single place where all supported locales are listed, but I 
> have created a small test program (WindowsNLSLocales.c) based on: 
> [_] format locales [1], additional supported language 
> strings [2], and additional supported country and region strings [3]. Based 
> on the results from this test program, it is possible to to do a good job 
> with the [_] types using the proposed logic, but the two 
> later cases are Windows specific, and there is no way arround a lookup-table.
>
> The attached results (WindowsNLSLocales.ods) come from Windows 10 (1903) and 
> Visual C++ build 1924, 64-bit.
>

I think these are quite intensive tests but I wonder do we need to
test some locales with code_page?  Generally speaking, in this case it
should not matter as we are trying to get the locale name but no harm
in testing.  Also, I think it would be good if we can test how this
impacts the error messages as Davinder is trying to do.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: design for parallel backup

2020-04-20 Thread Amit Kapila
On Wed, Apr 15, 2020 at 9:27 PM Robert Haas  wrote:
>
> Over at 
> http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9eitzwmod2mpqr_rif...@mail.gmail.com
> there's a proposal for a parallel backup patch which works in the way
> that I have always thought parallel backup would work: instead of
> having a monolithic command that returns a series of tarballs, you
> request individual files from a pool of workers. Leaving aside the
> quality-of-implementation issues in that patch set, I'm starting to
> think that the design is fundamentally wrong and that we should take a
> whole different approach. The problem I see is that it makes a
> parallel backup and a non-parallel backup work very differently, and
> I'm starting to realize that there are good reasons why you might want
> them to be similar.
>
> Specifically, as Andres recently pointed out[1], almost anything that
> you might want to do on the client side, you might also want to do on
> the server side. We already have an option to let the client compress
> each tarball, but you might also want the server to, say, compress
> each tarball[2]. Similarly, you might want either the client or the
> server to be able to encrypt each tarball, or compress but with a
> different compression algorithm than gzip. If, as is presently the
> case, the server is always returning a set of tarballs, it's pretty
> easy to see how to make this work in the same way on either the client
> or the server, but if the server returns a set of tarballs in
> non-parallel backup cases, and a set of tarballs in parallel backup
> cases, it's a lot harder to see how that any sort of server-side
> processing should work, or how the same mechanism could be used on
> either the client side or the server side.
>
> So, my new idea for parallel backup is that the server will return
> tarballs, but just more of them. Right now, you get base.tar and
> ${tablespace_oid}.tar for each tablespace. I propose that if you do a
> parallel backup, you should get base-${N}.tar and
> ${tablespace_oid}-${N}.tar for some or all values of N between 1 and
> the number of workers, with the server deciding which files ought to
> go in which tarballs.
>

It is not apparent how you are envisioning this division on the
server-side.  I think in the currently proposed patch, each worker on
the client-side requests the specific files. So, how are workers going
to request such numbered files and how we will ensure that the work
division among workers is fair?

> This is more or less the naming convention that
> BART uses for its parallel backup implementation, which, incidentally,
> I did not write. I don't really care if we pick something else, but it
> seems like a sensible choice. The reason why I say "some or all" is
> that some workers might not get any of the data for a given
> tablespace. In fact, it's probably desirable to have different workers
> work on different tablespaces as far as possible, to maximize parallel
> I/O, but it's quite likely that you will have more workers than
> tablespaces. So you might end up, with pg_basebackup -j4, having the
> server send you base-1.tar and base-2.tar and base-4.tar, but not
> base-3.tar, because worker 3 spent all of its time on user-defined
> tablespaces, or was just out to lunch.
>
> Now, if you use -Fp, those tar files are just going to get extracted
> anyway by pg_basebackup itself, so you won't even know they exist.
> However, if you use -Ft, you're going to end up with more files than
> before. This seems like something of a wart, because you wouldn't
> necessarily expect that the set of output files produced by a backup
> would depend on the degree of parallelism used to take it. However,
> I'm not sure I see a reasonable alternative. The client could try to
> glue all of the related tar files sent by the server together into one
> big tarfile, but that seems like it would slow down the process of
> writing the backup by forcing the different server connections to
> compete for the right to write to the same file.
>

I think it also depends to some extent what we decide in the nearby
thread [1] related to support of compression/encryption.  Say, if we
want to support a new compression on client-side then we need to
anyway process the contents of each tar file in which case combining
into single tar file might be okay but not sure what is the right
thing here. I think this part needs some more thoughts.

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmoYr7%2B-0_vyQoHbTP5H3QGZFgfhnrn6ewDteF%3DkUqkG%3DFw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: It is not documented that pg_promote can exit standby mode

2020-04-20 Thread Fujii Masao



On 2020/04/18 2:46, Tom Lane wrote:

Fujii Masao  writes:

Thanks for the report and the patch! It looks good to me.
Barring any objection, I will commit this patch.


It might be worth writing "pg_promote() is called"
(adding parentheses) to make it clearer that a function is being
referred to.  No objection otherwise.


Yes. Also Masahiro-san reported me, off-list, that there are other places
where pg_promote is mentioned without parentheses. I think it's better to
add parentheses there. Attached is the updated version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a71ca62463..a14df06292 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4114,7 +4114,7 @@ ANY num_sync 
( 

Re: Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-20 Thread Stephen Frost
Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Mon, Apr 20, 2020 at 12:43 PM Andrey M. Borodin 
> wrote:
> > > 16 апр. 2020 г., в 17:46, Magnus Hagander 
> > написал(а):
> > > If we do that, it may be better that we define "PGSTAT_VIEW_PRIV()" or
> > > something like and replace the all occurances of the idiomatic
> > > condition with it.
> > >
> > > You mean something like the attached?
> > >
> > > 
> >
> > Is it correct that we use DEFAULT_ROLE_READ_ALL_STATS regardless of
> > inheritance? I'm not familiar with what is inherited and what is not, so I
> > think it's better to ask explicitly.
> >
> > +#define HAS_PGSTAT_PERMISSIONS(role)(is_member_of_role(GetUserId(),
> > DEFAULT_ROLE_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))
> 
>  It is consistent with all the other uses of DEFAULT_ROLE_READ_ALL_STATS
> that I can find.

Ugh.  That doesn't make it correct though..  We really should be using
has_privs_of_role() for these cases (and that goes for all of the
default role cases- some of which are correct and others are not, it
seems).

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-20 Thread Magnus Hagander
On Mon, Apr 20, 2020 at 12:43 PM Andrey M. Borodin 
wrote:

>
>
> > 16 апр. 2020 г., в 17:46, Magnus Hagander 
> написал(а):
> >
> >
> > If we do that, it may be better that we define "PGSTAT_VIEW_PRIV()" or
> > something like and replace the all occurances of the idiomatic
> > condition with it.
> >
> > You mean something like the attached?
> >
> > 
>
> Is it correct that we use DEFAULT_ROLE_READ_ALL_STATS regardless of
> inheritance? I'm not familiar with what is inherited and what is not, so I
> think it's better to ask explicitly.
>
> +#define HAS_PGSTAT_PERMISSIONS(role)(is_member_of_role(GetUserId(),
> DEFAULT_ROLE_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))
>

 It is consistent with all the other uses of DEFAULT_ROLE_READ_ALL_STATS
that I can find.


Besides this, the patch looks good to me.
>

Thanks, I've pushed it now.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-20 Thread Andrey M. Borodin



> 16 апр. 2020 г., в 17:46, Magnus Hagander  написал(а):
> 
> 
> If we do that, it may be better that we define "PGSTAT_VIEW_PRIV()" or
> something like and replace the all occurances of the idiomatic
> condition with it.
> 
> You mean something like the attached? 
> 
> 

Is it correct that we use DEFAULT_ROLE_READ_ALL_STATS regardless of 
inheritance? I'm not familiar with what is inherited and what is not, so I 
think it's better to ask explicitly.

+#define HAS_PGSTAT_PERMISSIONS(role)(is_member_of_role(GetUserId(), 
DEFAULT_ROLE_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))

Besides this, the patch looks good to me.
Thanks!

Best regards, Andrey Borodin,



Re: Poll: are people okay with function/operator table redesign?

2020-04-20 Thread Victor Yegorov
вс, 19 апр. 2020 г. в 20:00, Tom Lane :

> In the meantime I plan to push forward with the markup approach we've
> got.  The editorial content should still work if we find a better
> markup answer, and I'm willing to do the work of replacing the markup
> as long as somebody else figures out what it should be.
>

I am following this thread as a frequent documentation user.

While table 9.5 with functions looks quite nice, I quite dislike 9.4 with
operators.
Previously, I could lookup operator in the leftmost column and read on.
Right now I have to look through the whole table (well, not really, but
still) to find the operator.

-- 
Victor Yegorov


Re: 001_rep_changes.pl stalls

2020-04-20 Thread Fujii Masao




On 2020/04/20 16:02, Noah Misch wrote:

On Mon, Apr 20, 2020 at 02:30:08PM +0900, Fujii Masao wrote:

+* Block if we have unsent data.  XXX For logical replication, 
let
+* WalSndWaitForWal(), handle any other blocking; idle 
receivers need
+* its additional actions.  For physical replication, also 
block if
+* caught up; its send_data does not block.

It might be better to s/WalSndWaitForWal()/send_data()? Because not only
WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking.
WalSndWriteData() is called also under send_data, i.e., XLogSendLogical().


Thanks for reviewing.  WalSndWriteData() blocks when we have unsent data,
which is the same cause for blocking in WalSndLoop().  Since the comment you
quote says we let WalSndWaitForWal() "handle any other blocking", I don't
think your proposed change makes it more correct.


I was misreading this as something like "any other blocking than
the blocking in WalSndLoop()". Ok, I have no more comments on
the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: [Proposal] Global temporary tables

2020-04-20 Thread Prabhat Sahu
> I think this is expected, and user test_gtt does not have permission to
> vacuum the system table.
> This has nothing to do with GTT.
>
> Hi Wenjing, Thanks for the explanation.
Thanks for the new patch. I have verified the crash, Now its resolved.

-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com


Re: where should I stick that backup?

2020-04-20 Thread Amit Kapila
On Thu, Apr 16, 2020 at 3:44 AM Andres Freund  wrote:
>
> > I think having a simple framework in pg_basebackup for plugging in new
> > algorithms would make it noticeably simpler to add LZ4 or whatever
> > your favorite compression algorithm is. And I think having that
> > framework also be able to use shell commands, so that users don't have
> > to wait a decade or more for new choices to show up, is also a good
> > idea.
>
> As long as here's sensible defaults, and so that the user doesn't have
> to specify paths to binaries for the common cases, I'm OK with that. I'm
> not ok with requiring the user to specify shell fragments for things
> that should be built in.
>
> If we think the appropriate way to implement extensible compression is
> by piping to commandline binaries ([1]),
>

I can see how such a scheme could be useful for backups but how do we
restore such a backup?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables

2020-04-20 Thread Amit Langote
On Mon, Apr 20, 2020 at 5:49 AM Justin Pryzby  wrote:
> On Sun, Apr 19, 2020 at 03:13:29PM -0400, Alvaro Herrera wrote:
> > On 2020-Apr-18, Justin Pryzby wrote:
> > > I haven't heard a compelling argument for or against either way.
> > >
> > > Maybe the worst behavior might be if, during ATTACH, we searched for a 
> > > matching
> > > trigger, and "merged" it (marked it inherited) if it matched.  That could 
> > > be
> > > bad if someone *wanted* two triggers, which seems unlikely, but to each 
> > > their
> > > own.
> >
> > I think the simplicity argument trumps the other ones, so I agree to go
> > with your v3 patch proposed downthread.
> >
> > What happens if you detach the parent?  I mean, should the trigger
> > removal recurse to children?
>
> It think it should probably exactly undo what CloneRowTriggersToPartition 
> does.
> ..and I guess you're trying to politely say that it didn't.  I tried to fix in
> v4 - please check if that's right.

Looks correct to me.  Maybe have a test cover that?

> > > It occured to me that we don't currently distinguish between a trigger on 
> > > a
> > > child table, and a trigger on a parent table which was recursively 
> > > created on a
> > > child.  That makes sense for indexes and constraints, since the objects 
> > > persist
> > > if the table is detached, so it doesn't matter how it was defined.
> > >
> > > But if we remove trigger during DETACH, then it's *not* the same as a 
> > > trigger
> > > that was defined on the child, and I suggest that in v13 we should make 
> > > that
> > > visible.
> >
> > Hmm, interesting point -- whether the trigger is partition or not is
> > important because it affects what happens on detach.  I agree that we
> > should make it visible.  Is the proposed single bit "PARTITION" good
> > enough, or should we indicate what's the ancestor table that defines the
> > partition?
>
> Yea, it's an obvious thing to do.

This:

+  "false AS tgisinternal"),
+ (pset.sversion >= 13000 ?
+  "pg_partition_root(t.tgrelid) AS parent" :
+  "'' AS parent"),
+ oid);


looks wrong, because the actual partition root may not also be the
trigger parent root, for example:

create table f (a int references p) partition by list (a);
create table f1 partition of f for values in (1) partition by list (a);
create table f11 partition of f for values in (1);
create function trigfunc() returns trigger language plpgsql as $$
begin raise notice '%', new; return new; end; $$;
create trigger trig before insert on f1 for each row execute function
trigfunc();
\d f11
 Table "public.f11"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: f1 FOR VALUES IN (1)
Triggers:
trig BEFORE INSERT ON f11 FOR EACH ROW EXECUTE FUNCTION
trigfunc(), ON TABLE f

Here, ON TABLE should say "f1".

The following gets the correct parent for me:

- (pset.sversion >= 13000 ?
-  "pg_partition_root(t.tgrelid) AS parent" :
-  "'' AS parent"),
+ (pset.sversion >= 13 ?
+  "(SELECT relid"
+  " FROM pg_trigger, pg_partition_ancestors(t.tgrelid)"
+  " WHERE tgname = t.tgname AND tgrelid = relid"
+  " AND tgparentid = 0) AS parent" :
+  " null AS parent"),

The server version number being compared against was missing a zero in
your patch.

Also, how about, for consistency, making the parent table labeling of
the trigger look similar to that for the foreign constraint, so
instead of:

Triggers:
trig BEFORE INSERT ON f11 FOR EACH ROW EXECUTE FUNCTION
trigfunc(), ON TABLE f1

how about:

Triggers:
TABLE "f1" TRIGGER "trig" BEFORE INSERT ON f11 FOR EACH ROW
EXECUTE FUNCTION trigfunc()

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: [Proposal] Global temporary tables

2020-04-20 Thread 曾文旌


> 2020年4月17日 下午7:26,Prabhat Sahu  写道:
> 
> On Fri, Apr 17, 2020 at 2:44 PM 曾文旌  > wrote:
> 
> I improved the logic of the warning message so that when the gap between 
> relfrozenxid of GTT is small,
> it will no longer be alarmed message.
> 
> Hi Wenjing,
> Thanks for the patch(v26), I have verified the previous related issues, and 
> are working fine now.
> Please check the below scenario VACUUM from a non-super user.
> 
> -- Create user "test_gtt", connect it , create gtt, VACUUM gtt and VACUUM / 
> VACUUM FULL
> postgres=# CREATE USER test_gtt;
> CREATE ROLE
> postgres=# \c postgres test_gtt
> You are now connected to database "postgres" as user "test_gtt".
> postgres=> CREATE GLOBAL TEMPORARY TABLE gtt1(c1 int);
> CREATE TABLE
> 
> -- VACUUM gtt is working fine, whereas we are getting huge WARNING for VACUUM 
> / VACUUM FULL as below:
> postgres=> VACUUM gtt1 ;
> VACUUM
> postgres=> VACUUM;
> WARNING:  skipping "pg_statistic" --- only superuser or database owner can 
> vacuum it
> WARNING:  skipping "pg_type" --- only superuser or database owner can vacuum 
> it
> WARNING:  skipping "pg_toast_2600" --- only table or database owner can 
> vacuum it
> WARNING:  skipping "pg_toast_2600_index" --- only table or database owner can 
> vacuum it
> 
> ... ... 
> ... ... 
> 
> WARNING:  skipping "_pg_foreign_tables" --- only table or database owner can 
> vacuum it
> WARNING:  skipping "foreign_table_options" --- only table or database owner 
> can vacuum it
> WARNING:  skipping "user_mapping_options" --- only table or database owner 
> can vacuum it
> WARNING:  skipping "user_mappings" --- only table or database owner can 
> vacuum it
> VACUUM 
I think this is expected, and user test_gtt does not have permission to vacuum 
the system table.
This has nothing to do with GTT.


Wenjing

> 
> -- 
> With Regards,
> Prabhat Kumar Sahu
> EnterpriseDB: http://www.enterprisedb.com 



smime.p7s
Description: S/MIME cryptographic signature


Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-20 Thread davinder singh
On Mon, Apr 20, 2020 at 10:10 AM Amit Kapila 
wrote:

> Yes, I am planning to look into it.  Actually, I think the main thing
> here is to ensure that we don't break something which was working with
> _create_locale API.

I am trying to understand how lc_messages affect the error messages on
Windows,
but I haven't seen any change in the error message like on the Linux system
we change lc_messages.
Can someone help me with this? Please let me know if there is any
configuration setting that I need to adjust.

-- 
Regards,
Davinder
EnterpriseDB: http://www.enterprisedb.com


Re: 001_rep_changes.pl stalls

2020-04-20 Thread Kyotaro Horiguchi
At Mon, 20 Apr 2020 00:59:54 -0700, Noah Misch  wrote in 
> On Mon, Apr 20, 2020 at 04:15:40PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch  wrote in 
> > > On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
> > > > At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi 
> > > >  wrote in 
> > > > > By the way, if latch is consumed in WalSndLoop, succeeding call to
> > > > > WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
> > > > > cause missing wakeups? (in other words, overlooking of wakeup latch).
> > > > 
> > > > - Since the only source other than timeout of walsender wakeup is latch,
> > > > - we should avoid wasteful consuming of latch. (It is the same issue
> > > > - with [1]).
> > > > 
> > > > + Since walsender is wokeup by LSN advancement via latch, we should
> > > > + avoid wasteful consuming of latch. (It is the same issue with [1]).
> > > > 
> > > > 
> > > > > If wakeup signal is not remembered on walsender (like
> > > > > InterruptPending), WalSndPhysical cannot enter a sleep with
> > > > > confidence.
> > > 
> > > No; per latch.h, "What must be avoided is placing any checks for 
> > > asynchronous
> > > events after WaitLatch and before ResetLatch, as that creates a race
> > > condition."  In other words, the thing to avoid is calling ResetLatch()
> > > without next examining all pending work that a latch would signal.  Each
> > > walsender.c WaitLatch call does follow the rules.
> > 
> > I didn't meant that, of course.  I thought of more or less the same
> > with moving the trigger from latch to signal then the handler sets a
> > flag and SetLatch().  If we use bare latch, we should avoid false
> > entering to sleep, which also makes thinks compolex.
> 
> I don't understand.  If there's a defect, can you write a test case or
> describe a sequence of events (e.g. at line X, variable Y has value Z)?

Indeed.  Anyway the current version cannot have such a possible issue.

Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Possible cache reference leak by removeExtObjInitPriv

2020-04-20 Thread Kyotaro Horiguchi
At Fri, 17 Apr 2020 13:07:15 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > Recently a cache reference leak was reported then fixed [1].
> > I happened to notice a similar possible leakage in
> > removeEtObjInitPriv. I haven't found a way to reach the code, but can
> > be forcibly caused by tweaking the condition.
> > Please find the attached.
> 
> Ugh.  recordExtObjInitPriv has the same problem.

Thanks for commit it.

> I wonder whether there is any way to teach Coverity, or some other
> static analyzer, to look for code paths that leak cache refcounts.
> It seems isomorphic to detecting memory leaks, which Coverity is
> reasonably good at.

Indeed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-20 Thread Dilip Kumar
On Mon, Apr 20, 2020 at 12:29 PM Thomas Munro  wrote:
>
> On Mon, Apr 20, 2020 at 6:35 PM Dilip Kumar  wrote:
> > On Mon, Apr 20, 2020 at 11:24 AM Thomas Munro  
> > wrote:
> > >
> > > On Sat, Apr 18, 2020 at 9:27 PM Dilip Kumar  wrote:
> > > > On Sat, Apr 18, 2020 at 11:47 AM Thomas Munro  
> > > > wrote:
> > > Is this an improvement?  I realise that there is still nothing to
> > > actually verify that early pruning has actually happened.  I haven't
> > > thought of a good way to do that yet (stats, page inspection, ...).
> >
> > Could we test the early pruning using xid-burn patch?  Basically,  in
> > xid_by_minute we have some xids with the current epoch.  Now, we burns
> > more than 2b xid and then if we try to vacuum we might hit the case of
> > early pruning no.  Do you wnated to this case or you had some other
> > case in mind which you wnated to test?
>
> I mean I want to verify that VACUUM or heap prune actually removed a
> tuple that was visible to an old snapshot.  An idea I just had: maybe
> sto_using_select.spec should check the visibility map (somehow).  For
> example, the sto_using_select.spec (the version in the patch I just
> posted) just checks that after time 00:11, the old snapshot gets a
> snapshot-too-old error.  Perhaps we could add a VACUUM before that,
> and then check that the page has become all visible, meaning that the
> dead tuple our snapshot could see has now been removed.

Okay, got your point.  Can we try to implement some test functions
that can just call visibilitymap_get_status function internally?  I
agree that we will have to pass the correct block number but that we
can find using TID.   Or for testing, we can create a very small
relation that just has 1 block?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: 001_rep_changes.pl stalls

2020-04-20 Thread Noah Misch
On Mon, Apr 20, 2020 at 04:15:40PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch  wrote in 
> > On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
> > > At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi 
> > >  wrote in 
> > > > By the way, if latch is consumed in WalSndLoop, succeeding call to
> > > > WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
> > > > cause missing wakeups? (in other words, overlooking of wakeup latch).
> > > 
> > > - Since the only source other than timeout of walsender wakeup is latch,
> > > - we should avoid wasteful consuming of latch. (It is the same issue
> > > - with [1]).
> > > 
> > > + Since walsender is wokeup by LSN advancement via latch, we should
> > > + avoid wasteful consuming of latch. (It is the same issue with [1]).
> > > 
> > > 
> > > > If wakeup signal is not remembered on walsender (like
> > > > InterruptPending), WalSndPhysical cannot enter a sleep with
> > > > confidence.
> > 
> > No; per latch.h, "What must be avoided is placing any checks for 
> > asynchronous
> > events after WaitLatch and before ResetLatch, as that creates a race
> > condition."  In other words, the thing to avoid is calling ResetLatch()
> > without next examining all pending work that a latch would signal.  Each
> > walsender.c WaitLatch call does follow the rules.
> 
> I didn't meant that, of course.  I thought of more or less the same
> with moving the trigger from latch to signal then the handler sets a
> flag and SetLatch().  If we use bare latch, we should avoid false
> entering to sleep, which also makes thinks compolex.

I don't understand.  If there's a defect, can you write a test case or
describe a sequence of events (e.g. at line X, variable Y has value Z)?




Re: 001_rep_changes.pl stalls

2020-04-20 Thread Kyotaro Horiguchi
At Mon, 20 Apr 2020 14:30:08 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/04/18 16:01, Noah Misch wrote:
> > On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote:
> >>> 4. Keep the WalSndLoop() wait, but condition it on !logical.  This is
> >>> the
> >>> minimal fix, but it crudely punches through the abstraction between
> >>> WalSndLoop() and its WalSndSendDataCallback.
> >>
> >> (4) also looks good because it's simple, if we can redesign those
> >> functions in good shape.
> > Let's do that.  I'm attaching the replacement implementation and the
> > revert of
> > v1.
> 
> Thanks for the patch! Though referencing XLogSendLogical inside
> WalSndLoop()
> might be a bit ugly,, I'm fine with this change because it's simple
> and easier
> to understand.

I thought that if we do this, read_data returns boolean that indiates
whether wait for latch or incoming packet, or returns a wake event
mask.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: 001_rep_changes.pl stalls

2020-04-20 Thread Kyotaro Horiguchi
At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch  wrote in 
> On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
> > At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > By the way, if latch is consumed in WalSndLoop, succeeding call to
> > > WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
> > > cause missing wakeups? (in other words, overlooking of wakeup latch).
> > 
> > - Since the only source other than timeout of walsender wakeup is latch,
> > - we should avoid wasteful consuming of latch. (It is the same issue
> > - with [1]).
> > 
> > + Since walsender is wokeup by LSN advancement via latch, we should
> > + avoid wasteful consuming of latch. (It is the same issue with [1]).
> > 
> > 
> > > If wakeup signal is not remembered on walsender (like
> > > InterruptPending), WalSndPhysical cannot enter a sleep with
> > > confidence.
> 
> No; per latch.h, "What must be avoided is placing any checks for asynchronous
> events after WaitLatch and before ResetLatch, as that creates a race
> condition."  In other words, the thing to avoid is calling ResetLatch()
> without next examining all pending work that a latch would signal.  Each
> walsender.c WaitLatch call does follow the rules.

I didn't meant that, of course.  I thought of more or less the same
with moving the trigger from latch to signal then the handler sets a
flag and SetLatch().  If we use bare latch, we should avoid false
entering to sleep, which also makes thinks compolex.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: 001_rep_changes.pl stalls

2020-04-20 Thread Noah Misch
On Mon, Apr 20, 2020 at 02:30:08PM +0900, Fujii Masao wrote:
> +  * Block if we have unsent data.  XXX For logical replication, 
> let
> +  * WalSndWaitForWal(), handle any other blocking; idle 
> receivers need
> +  * its additional actions.  For physical replication, also 
> block if
> +  * caught up; its send_data does not block.
> 
> It might be better to s/WalSndWaitForWal()/send_data()? Because not only
> WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking.
> WalSndWriteData() is called also under send_data, i.e., XLogSendLogical().

Thanks for reviewing.  WalSndWriteData() blocks when we have unsent data,
which is the same cause for blocking in WalSndLoop().  Since the comment you
quote says we let WalSndWaitForWal() "handle any other blocking", I don't
think your proposed change makes it more correct.  Also, if someone wants to
refactor this, the place to look is WalSndWaitForWal(), not any other part of
send_data().




Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-20 Thread Thomas Munro
On Mon, Apr 20, 2020 at 6:35 PM Dilip Kumar  wrote:
> On Mon, Apr 20, 2020 at 11:24 AM Thomas Munro  wrote:
> >
> > On Sat, Apr 18, 2020 at 9:27 PM Dilip Kumar  wrote:
> > > On Sat, Apr 18, 2020 at 11:47 AM Thomas Munro  
> > > wrote:
> > Is this an improvement?  I realise that there is still nothing to
> > actually verify that early pruning has actually happened.  I haven't
> > thought of a good way to do that yet (stats, page inspection, ...).
>
> Could we test the early pruning using xid-burn patch?  Basically,  in
> xid_by_minute we have some xids with the current epoch.  Now, we burns
> more than 2b xid and then if we try to vacuum we might hit the case of
> early pruning no.  Do you wnated to this case or you had some other
> case in mind which you wnated to test?

I mean I want to verify that VACUUM or heap prune actually removed a
tuple that was visible to an old snapshot.  An idea I just had: maybe
sto_using_select.spec should check the visibility map (somehow).  For
example, the sto_using_select.spec (the version in the patch I just
posted) just checks that after time 00:11, the old snapshot gets a
snapshot-too-old error.  Perhaps we could add a VACUUM before that,
and then check that the page has become all visible, meaning that the
dead tuple our snapshot could see has now been removed.




Re: where should I stick that backup?

2020-04-20 Thread Amit Kapila
On Mon, Apr 13, 2020 at 5:57 AM Stephen Frost  wrote:
>
> There's a couple of other pieces here that I think bear mentioning.  The
> first is that pgBackRest has an actual 'restore' command- and that works
> with the filters and works with the storage drivers, so what you're
> looking at when it comes to these interfaces isn't just "put a file" but
> it's also "get a file".  That's actually quite important to have when
> you start thinking about these more complicated methods of doing
> backups.
>

I also think it is important to provide a way or interface to restore
the data user has backed up using whatever new API we provide as here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-20 Thread Dilip Kumar
On Mon, Apr 20, 2020 at 11:24 AM Thomas Munro  wrote:
>
> On Sat, Apr 18, 2020 at 9:27 PM Dilip Kumar  wrote:
> > On Sat, Apr 18, 2020 at 11:47 AM Thomas Munro  
> > wrote:
> > > I think I found another bug in MaintainOldSnapshotTimeMapping(): if
> > > you make time jump by more than old_snapshot_threshold in one go, then
> > > the map gets cleared and then no early pruning or snapshot-too-old
> > > errors happen.  That's why in 002_too_old.pl it currently advances
> > > time by 10 minutes twice, instead of 20 minutes once.  To be
> > > continued.
> >
> > IMHO that doesn't seems to be a problem.  Because even if we jump more
> > than old_snapshot_threshold in one go we don't clean complete map
> > right.  The latest snapshot timestamp will become the headtimestamp.
> > So in TransactionIdLimitedForOldSnapshots if (current_ts -
> > old_snapshot_threshold) is still >= head_timestap then we can still do
> > early pruning.
>
> Right, thanks.  I got confused about that, and misdiagnosed something
> I was seeing.
>
> Here's a new version:
>
> 0004: Instead of writing a new kind of TAP test to demonstrate
> snapshot-too-old errors, I adjusted the existing isolation tests to
> use the same absolute time control technique.  Previously I had
> invented a way to do isolation tester-like stuff in TAP tests, which
> might be interesting but strange new perl is not necessary for this.
>
> 0005: Truncates the time map when the CLOG is truncated.  Its test is
> now under src/test/module/snapshot_too_old/t/001_truncate.sql.
>
> These apply on top of Robert's patches, but the only dependency is on
> his patch 0001 "Expose oldSnapshotControl.", because now I have stuff
> in src/test/module/snapshot_too_old/test_sto.c that wants to mess with
> that object too.
>
> Is this an improvement?  I realise that there is still nothing to
> actually verify that early pruning has actually happened.  I haven't
> thought of a good way to do that yet (stats, page inspection, ...).

Could we test the early pruning using xid-burn patch?  Basically,  in
xid_by_minute we have some xids with the current epoch.  Now, we burns
more than 2b xid and then if we try to vacuum we might hit the case of
early pruning no.  Do you wnated to this case or you had some other
case in mind which you wnated to test?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-20 Thread Fujii Masao



On 2020/03/10 6:56, Andres Freund wrote:

Hi,

On 2020-03-06 16:33:18 -0300, Alvaro Herrera wrote:

On 2020-Mar-06, Michael Paquier wrote:

On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote:

Seems reasonable, but it would be better if people proposed these
kinds of changes closer to the beginning of the release cycle rather
than in the crush at the end.


+1, to both points.


Why?  Are you saying that there's some actual risk of breaking
something?  We're not even near beta or feature freeze yet.

I'm not seeing the reason for the "please propose this sooner in the
cycle" argument.  It has already been proposed sooner -- seven years
sooner.  We're not waiting for users to complain anymore; clearly nobody
cared.


Yea. There are changes that are so invasive that it's useful to go very
early, but in this case I'm not seeing it?

+1 for removing non-fast promotions.


Patch attached. I will add this into the first CF for v14.


FWIW, I find "fallback promotion" a confusing description.


Yeah, so I changed the subject.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 11e32733c4..35fc700b7b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -298,9 +298,6 @@ boolwal_receiver_create_temp_slot = false;
 /* are we currently in standby mode? */
 bool   StandbyMode = false;
 
-/* whether request for fast promotion has been made yet */
-static bool fast_promote = false;
-
 /*
  * if recoveryStopsBefore/After returns true, it saves information of the stop
  * point here
@@ -6311,7 +6308,7 @@ StartupXLOG(void)
DBState dbstate_at_startup;
XLogReaderState *xlogreader;
XLogPageReadPrivate private;
-   boolfast_promoted = false;
+   boolpromoted = false;
struct stat st;
 
/*
@@ -7698,14 +7695,14 @@ StartupXLOG(void)
 * the rule that TLI only changes in shutdown checkpoints, which
 * allows some extra error checking in xlog_redo.
 *
-* In fast promotion, only create a lightweight end-of-recovery 
record
+* In promotion, only create a lightweight end-of-recovery 
record
 * instead of a full checkpoint. A checkpoint is requested 
later,
 * after we're fully out of recovery mode and already accepting
 * queries.
 */
if (bgwriterLaunched)
{
-   if (fast_promote)
+   if (LocalPromoteIsTriggered)
{
checkPointLoc = ControlFile->checkPoint;
 
@@ -7716,7 +7713,7 @@ StartupXLOG(void)
record = ReadCheckpointRecord(xlogreader, 
checkPointLoc, 1, false);
if (record != NULL)
{
-   fast_promoted = true;
+   promoted = true;
 
/*
 * Insert a special WAL record to mark 
the end of
@@ -7733,7 +7730,7 @@ StartupXLOG(void)
}
}
 
-   if (!fast_promoted)
+   if (!promoted)
RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
  
CHECKPOINT_IMMEDIATE |
  
CHECKPOINT_WAIT);
@@ -7924,12 +7921,12 @@ StartupXLOG(void)
WalSndWakeup();
 
/*
-* If this was a fast promotion, request an (online) checkpoint now. 
This
+* If this was a promotion, request an (online) checkpoint now. This
 * isn't required for consistency, but the last restartpoint might be 
far
 * back, and in case of a crash, recovering from it might take a longer
 * than is appropriate now that we're not in standby mode anymore.
 */
-   if (fast_promoted)
+   if (promoted)
RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
@@ -12532,29 +12529,15 @@ CheckForStandbyTrigger(void)
if (LocalPromoteIsTriggered)
return true;
 
-   if (IsPromoteSignaled())
+   /*
+* In 9.1 and 9.2 the postmaster unlinked the promote file inside the
+* signal handler. It now leaves the file in place and lets the
+* Startup process do the unlink.
+*/
+   if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, _buf) == 0)
{
-   /*
-* In 9.1 and 9.2 the postmaster unlinked the promote file 
inside the
-* signal handler. It now leaves the file in place and 

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-20 Thread Fujii Masao




On 2020/04/18 0:31, Tom Lane wrote:

Kyotaro Horiguchi  writes:

At Fri, 17 Apr 2020 16:03:34 +0900, Fujii Masao  
wrote in

I agree that it might be worth considering the removal of am_sync for
the master branch or v14. But I think that it should not be
back-patched.



Ah! Agreed.


Yeah, that's not necessary to fix the bug.  I'd be inclined to leave
it for v14 at this point.

I don't much like the patch Fujii-san posted, though.  An important part
of the problem, IMO, is that SyncRepGetSyncStandbysPriority is too
complicated and it's unclear what dependencies it has on the set of
priorities in shared memory being consistent.  His patch does not improve
that situation; if anything it makes it worse.


Understood.



If we're concerned about not breaking ABI in the back branches, what
I propose we do about that is just leave SyncRepGetSyncStandbys in
place but not used by the core code, and remove it only in HEAD.
We can do an absolutely minimal fix for the assertion failure, in
case anybody is calling that code, by just dropping the Assert and
letting SyncRepGetSyncStandbys return NIL if it falls out.  (Or we
could let it return the incomplete list, which'd be the behavior
you get today in a non-assert build.)

Also, I realized while re-reading my patch that Kyotaro-san is onto
something about the is_sync_standby flag not being necessary: instead
we can just have the new function SyncRepGetCandidateStandbys return
a reduced count.  I'd initially believed that it was necessary for
that function to return the rejected candidate walsenders along with
the accepted ones, but that was a misunderstanding.  I still don't
want its API spec to say anything about ordering of the result array,
but we don't need to.

So that leads me to the attached.  I propose applying this to the
back branches except for the rearrangement of WALSnd field order.
In HEAD, I'd remove SyncRepGetSyncStandbys and subroutines altogether.


Thanks for making and committing the patch!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: where should I stick that backup?

2020-04-20 Thread Amit Kapila
On Mon, Apr 20, 2020 at 8:18 AM Amit Kapila  wrote:
>
> On Sat, Apr 18, 2020 at 8:35 PM Robert Haas  wrote:
> >
> > On Fri, Apr 17, 2020 at 7:44 PM Andres Freund  wrote:
> > > This suggest that pipes do have a considerably higher overhead on
> > > windows, but that it's not all that terrible if one takes care to use
> > > large buffers in each pipe element.
> > >
> > > It's notable though that even the simplest use of a pipe does add a
> > > considerable overhead compared to using the files directly.
> >
> > Thanks for these results. I think that this shows that it's probably
> > not a great idea to force everything to go through pipes in every
> > case, but on the other hand, there's no reason to be a particularly
> > scared of the performance implications of letting some things go
> > through pipes. For instance, if we decide that LZ4 compression is
> > going to be a good choice for most users, we might want to do that
> > in-process rather than via pipes.
> >
>
> How will the user know how to use this compressed backup?  I mean to
> say if we use some compression algorithm to compress the data then the
> user should know how to decompress and use the backup.   IIUC, if
> currently, the user uses tar format to backup, it can simply untar it
> and start the server but will that be possible if we provide some
> in-built compression methods like LZ4?
>

One idea could be that we can write something like BACKUP COMPRESSION:
 in backup_label file and
then probably recovery can take care of decompressing it.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com