Re: Protocol problem with GSSAPI encryption?

2020-04-05 Thread Michael Paquier
On Fri, Feb 21, 2020 at 12:35:03AM +, Andrew Gierth wrote:
> > "Stephen" == Stephen Frost  writes:
> 
>  >> I figure something along these lines for the fix. Anyone in a
>  >> position to test this?
> 
>  Stephen> At least at first blush, I tend to agree with your analysis
>  Stephen> and patch.
> 
>  Stephen> I'll see about getting this actually set up and tested in the
>  Stephen> next week or so (and maybe there's some way to also manage to
>  Stephen> have a regression test for it..).
> 
> *poke*

Second *poke*.
--
Michael


signature.asc
Description: PGP signature


001_rep_changes.pl stalls

2020-04-05 Thread Noah Misch
Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is
important for promptly updating pg_stat_replication.  When caught up, we
should impose that logic before every sleep.  The one-line fix is to sleep in
WalSndLoop() only when pq_is_send_pending(), not when caught up.


On my regular development machine, src/test/subscription/t/001_rep_changes.pl
stalls for ~10s at this wait_for_catchup:

  $node_publisher->safe_psql('postgres', "DELETE FROM tab_rep");

  # Restart the publisher and check the state of the subscriber which
  # should be in a streaming state after catching up.
  $node_publisher->stop('fast');
  $node_publisher->start;

  $node_publisher->wait_for_catchup('tap_sub');

That snippet emits three notable physical WAL records.  There's a
Transaction/COMMIT at the end of the DELETE, an XLOG/CHECKPOINT_SHUTDOWN, and
an XLOG/FPI_FOR_HINT.

The buildfarm has stalled there, but it happens probably less than half the
time.  Examples[1] showing the stall:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mandrill&dt=2020-03-20%2017%3A09%3A53&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=thorntail&dt=2020-03-22%2019%3A51%3A38&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-19%2003%3A35%3A01&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-20%2015%3A15%3A01&stg=subscription-check

Here's the most-relevant walsender call tree:

WalSndLoop
  XLogSendLogical (caller invokes once per loop iteration, via send_data 
callback)
XLogReadRecord (caller invokes once)
  ReadPageInternal (caller invokes twice in this test; more calls are 
possible)
logical_read_xlog_page (caller skips when page is same as last call, 
else invokes 1-2 times via state->read_page() callback, registered in 
StartLogicalReplication)
  WalSndWaitForWal (caller invokes once; has fast path)


The cause is a race involving the flow of reply messages (send_feedback()
messages) from logical apply worker to walsender.  Here are two sequencing
patterns; the more-indented parts are what differ.  Stalling pattern:

  sender reads Transaction/COMMIT and sends the changes
  receiver applies the changes
  receiver send_feedback() reports progress up to Transaction/COMMIT
sender accepts the report
sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are 
no-ops for logical rep
sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop()
receiver wal_receiver_status_interval elapses; receiver reports progress up 
to Transaction/COMMIT
  sender wakes up, accepts the report
  sender calls WalSndWaitForWal(), which sends a keepalive due to 
"MyWalSnd->write < sentPtr"
  receiver gets keepalive, send_feedback() reports progress up to 
XLOG/FPI_FOR_HINT

Non-stalling pattern (more prevalent with lower machine performance):

  sender reads Transaction/COMMIT and sends the changes
sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are 
no-ops for logical rep
sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop()
  receiver applies the changes
  receiver send_feedback() reports progress up to Transaction/COMMIT
  sender wakes up, accepts the report
  sender calls WalSndWaitForWal(), which sends a keepalive due to 
"MyWalSnd->write < sentPtr"
  receiver gets keepalive, send_feedback() reports progress up to 
XLOG/FPI_FOR_HINT


The fix is to test "MyWalSnd->write < sentPtr" before more sleeps.  The test
is unnecessary when sleeping due to pq_is_send_pending(); in that case, the
receiver is not idle and will reply before idling.  I changed WalSndLoop() to
sleep only for pq_is_send_pending().  For all other sleep reasons, the sleep
will happen in WalSndWaitForWal().  Attached.  I don't know whether this is
important outside of testing scenarios.  I lean against back-patching, but I
will back-patch if someone thinks this qualifies as a performance bug.

Thanks,
nm

[1] I spot-checked only my animals, since I wanted to experiment on an
affected animal.
Author: Noah Misch 
Commit: Noah Misch 

When WalSndCaughtUp, sleep only in WalSndWaitForWal().

Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
< sentPtr.  That is important in logical replication.  When the latest
physical LSN yields no logical replication messages (a common case),
that keepalive elicits a reply, and processing the reply updates
pg_stat_replication.replay_lsn.  WalSndLoop() lacks that; when
WalSndLoop() slept, replay_lsn advancement could stall until
wal_receiver_status_interval elapsed.  This sometimes stalled
src/test/subscription/t/001_rep_changes.pl for up to 10s.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
index 9e56115..d9c

Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-05 Thread Masahiko Sawada
On Tue, 31 Mar 2020 at 14:13, Masahiko Sawada
 wrote:
>
> On Tue, 31 Mar 2020 at 12:58, Amit Kapila  wrote:
> >
> > On Mon, Mar 30, 2020 at 12:31 PM Masahiko Sawada
> >  wrote:
> > >
> > > The patch for vacuum conflicts with recent changes in vacuum. So I've
> > > attached rebased one.
> > >
> >
> > + /*
> > + * Next, accumulate buffer usage.  (This must wait for the workers to
> > + * finish, or we might get incomplete data.)
> > + */
> > + for (i = 0; i < nworkers; i++)
> > + InstrAccumParallelQuery(&lps->buffer_usage[i]);
> > +
> >
> > This should be done for launched workers aka
> > lps->pcxt->nworkers_launched.  I think a similar problem exists in
> > create index related patch.
>
> You're right. Fixed in the new patches.
>
> On Mon, 30 Mar 2020 at 17:00, Julien Rouhaud  wrote:
> >
> > Just minor nitpicking:
> >
> > +   int i;
> >
> > Assert(!IsParallelWorker());
> > Assert(ParallelVacuumIsActive(lps));
> > @@ -2166,6 +2172,13 @@ lazy_parallel_vacuum_indexes(Relation *Irel, 
> > IndexBulkDeleteResult **stats,
> > /* Wait for all vacuum workers to finish */
> > WaitForParallelWorkersToFinish(lps->pcxt);
> >
> > +   /*
> > +* Next, accumulate buffer usage.  (This must wait for the workers to
> > +* finish, or we might get incomplete data.)
> > +*/
> > +   for (i = 0; i < nworkers; i++)
> > +   InstrAccumParallelQuery(&lps->buffer_usage[i]);
> >
> > We now allow declaring a variable in those loops, so it may be better to 
> > avoid
> > declaring i outside the for scope?
>
> We can do that but I was not sure if it's good since other codes
> around there don't use that. So I'd like to leave it for committers.
> It's a trivial change.
>

I've updated the buffer usage patch for parallel index creation as the
previous patch conflicts with commit df3b181499b40.

This comment in commit df3b181499b40 seems the comment which had been
replaced by Amit with a better sentence when introducing buffer usage
to parallel vacuum.

+   /*
+* Estimate space for WalUsage -- PARALLEL_KEY_WAL_USAGE
+*
+* WalUsage during execution of maintenance command can be used by an
+* extension that reports the WAL usage, such as pg_stat_statements. We
+* have no way of knowing whether anyone's looking at pgWalUsage, so do it
+* unconditionally.
+*/

Would the following sentence in lazyvacuum.c be also better for
parallel create index?

* If there are no extensions loaded that care, we could skip this.  We
* have no way of knowing whether anyone's looking at pgBufferUsage or
* pgWalUsage, so do it unconditionally.

The attached patch changes to the above comment and removed the code
that is used to un-support only buffer usage accumulation.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


bufferusage_create_index_v3.patch
Description: Binary data


Re: 2pc leaks fds

2020-04-05 Thread Andres Freund
Hi,

On 2020-04-06 14:26:48 +0900, Michael Paquier wrote:
> 2PC shines with the code of xlogreader.c in this case because it keeps
> opening and closing XLogReaderState for a short amount of time.  So it
> is not surprising to me to see this error only months after the fact
> because recovery or pg_waldump just use one XLogReaderState.

Well, it doesn't exactly signal that people (including me, up to just
now) are testing their changes all that carefully...


> From what I can see, the error is that the code only bothers closing
> WALOpenSegment->seg when switching to a new segment, but we need also
> to close it when finishing the business in XLogReaderFree().

Yea, I came to the same conclusion and locally fixed it the same way
(except having the close a bit earlier in XLogReaderFree()).


> diff --git a/src/backend/access/transam/xlogreader.c 
> b/src/backend/access/transam/xlogreader.c
> index f3fea5132f..7e25e2050a 100644
> --- a/src/backend/access/transam/xlogreader.c
> +++ b/src/backend/access/transam/xlogreader.c
> @@ -144,6 +144,9 @@ XLogReaderFree(XLogReaderState *state)
>   if (state->main_data)
>   pfree(state->main_data);
>  
> + if (state->seg.ws_file >= 0)
> + close(state->seg.ws_file);
> +
>   pfree(state->errormsg_buf);
>   if (state->readRecordBuf)
>   pfree(state->readRecordBuf);

But I'm not sure it's quite the right idea. I'm not sure I fully
understand the design of 0dc8ead46, but it looks to me like it's
intended to allow users of the interface to have different ways of
opening files. If we just close() the fd that'd be a bit more limited.

OTOH, I think all but one (XLogPageRead()) of the current users of
XLogReader use WALRead(), which also close()s the fd (before calling the
WALSegmentOpen callback).


The XLogReader code flow has gotten quite complicated
:(. XLogReaderReadRecord()-> state->read_page() ->
logical_read_xlog_page etc -> WALRead() -> wal_segment_open callback etc.

There's been a fair bit of change, making the interface more generic /
powerful / reducing duplication, but not a lot of added / adapted
comments in the header...

Greetings,

Andres Freund




Re: 2pc leaks fds

2020-04-05 Thread Michael Paquier
On Sun, Apr 05, 2020 at 07:56:51PM -0700, Andres Freund wrote:
> I found this while trying to benchmark the effect of my snapshot changes
> on 2pc. I just used the attached pgbench file.
> 
> I've not yet reviewed the change sufficiently to pinpoint the issue.

Indeed.  It takes seconds to show up.

> It's a bit sad that nobody has hit this in the last few months :(.

2PC shines with the code of xlogreader.c in this case because it keeps
opening and closing XLogReaderState for a short amount of time.  So it
is not surprising to me to see this error only months after the fact
because recovery or pg_waldump just use one XLogReaderState.  From
what I can see, the error is that the code only bothers closing
WALOpenSegment->seg when switching to a new segment, but we need also
to close it when finishing the business in XLogReaderFree().

I am adding an open item, and attached is a patch to take care of the
problem.  Thoughts?
--
Michael
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f3fea5132f..7e25e2050a 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -144,6 +144,9 @@ XLogReaderFree(XLogReaderState *state)
 	if (state->main_data)
 		pfree(state->main_data);
 
+	if (state->seg.ws_file >= 0)
+		close(state->seg.ws_file);
+
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);


signature.asc
Description: PGP signature


Re: backup manifests and contemporaneous buildfarm failures

2020-04-05 Thread Fabien COELHO



Hello,


Do I need to precede those with some recursive chmod commands? Perhaps
the client should refuse to run if there is still something left after
these.


I think the latter would be a very good idea, just so that this sort of
failure is less obscure.  Not sure about whether a recursive chmod is
really going to be worth the cycles.  (On the other hand, the normal
case should be that there's nothing there anyway, so maybe it's not
going to be costly.)


Could it be a two-stage process to minimize cost but still be resilient?

  rmtree
  if (-d $DIR) {
emit warning
chmodtree
rmtree again
if (-d $DIR)
  emit error
  }

--
Fabien.




SyncRepLock acquired exclusively in default configuration

2020-04-05 Thread Andres Freund
Hi,

Due to the change below, when using the default postgres configuration
of ynchronous_commit = on, max_wal_senders = 10, will now acquire a new
exclusive lwlock after writing a commit record.

commit 48c9f4926562278a2fd2b85e7486c6d11705f177
Author: Simon Riggs 
Date:   2017-12-29 14:30:33 +

Fix race condition when changing synchronous_standby_names

A momentary window exists when synchronous_standby_names
changes that allows commands issued after the change to
continue to act as async until the change becomes visible.
Remove the race by using more appropriate test in syncrep.c

Author: Asim Rama Praveen and Ashwin Agrawal
Reported-by: Xin Zhang, Ashwin Agrawal, and Asim Rama Praveen
Reviewed-by: Michael Paquier, Masahiko Sawada

As far as I can tell there was no discussion about the added contention
due this change in the relevant thread [1].

The default configuration has an empty synchronous_standby_names. Before
this change we'd fall out of SyncRepWaitForLSN() before acquiring
SyncRepLock in exlusive mode. Now we don't anymore.


I'm really not ok with unneccessarily adding an exclusive lock
acquisition to such a crucial path.

Greetings,

Andres Freund

[1] 
https://postgr.es/m/CABrsG8j3kPD%2Bkbbsx_isEpFvAgaOBNGyGpsqSjQ6L8vwVUaZAQ%40mail.gmail.com




Re: Reinitialize stack base after fork (for the benefit of rr)?

2020-04-05 Thread Peter Geoghegan
On Sun, Apr 5, 2020 at 8:56 PM Andres Freund  wrote:
> Perhaps put it on a wiki page?

I added a new major section to the "getting a stack trace" wiki page:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Recording_Postgres_using_rr_Record_and_Replay_Framework

Feel free to add to and edit this section yourself.

> Were you doing this because of occasional failures in autovacuum
> workers?  If so, that shouldn't be necessary after the stack base change
> (previously workers IIRC also could start with the wrong stack base -
> but didn't end up checking stack depth except for expression indexes).

No, just a personal preference for things like this.

-- 
Peter Geoghegan




Re: Index Skip Scan

2020-04-05 Thread Dilip Kumar
On Sun, Apr 5, 2020 at 9:39 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
> >
> > I was just wondering how the distinct will work with the "skip scan"
> > if we have some filter? I mean every time we select the unique row
> > based on the prefix key and that might get rejected by an external
> > filter right?
>
> Not exactly. In the case of index-only scan, we skipping to the first
> unique position, and then use already existing functionality
> (_bt_readpage with stepping to the next pages) to filter out those
> records that do not pass the condition.

I agree but that will work if we have a valid index clause, but
"b%100=0" condition will not create an index clause, right?  However,
if we change the query to
select distinct(a) from t where b=100 then it works fine because this
condition will create an index clause.

 There are even couple of tests
> in the patch for this. In case of index scan, when there are some
> conditions, current implementation do not consider skipping.
>
> > So I tried an example to check this.
>
> Can you tell on which version of the patch you were testing?

I have tested on v33.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Reinitialize stack base after fork (for the benefit of rr)?

2020-04-05 Thread Andres Freund
Hi,

On 2020-04-05 20:35:50 -0700, Peter Geoghegan wrote:
> I have found that it's useful to use rr to debug Postgres by following
> certain recipes. I'll share some of the details now, in case anybody
> else wants to start using rr and isn't sure where to start.

Perhaps put it on a wiki page?

> I have a script that records a postgres session using rr with these options:
> 
> rr record -M /code/postgresql/$BRANCH/install/bin/postgres \
>   -D /code/postgresql/$BRANCH/data \
>   --log_line_prefix="%m %p " \
>   --autovacuum=off \

Were you doing this because of occasional failures in autovacuum
workers?  If so, that shouldn't be necessary after the stack base change
(previously workers IIRC also could start with the wrong stack base -
but didn't end up checking stack depth except for expression indexes).

Greetings,

Andres Freund




Re: Reinitialize stack base after fork (for the benefit of rr)?

2020-04-05 Thread Peter Geoghegan
On Sun, Apr 5, 2020 at 6:54 PM Andres Freund  wrote:
> I just pushed that.

Great!

I have found that it's useful to use rr to debug Postgres by following
certain recipes. I'll share some of the details now, in case anybody
else wants to start using rr and isn't sure where to start.

I have a script that records a postgres session using rr with these options:

rr record -M /code/postgresql/$BRANCH/install/bin/postgres \
  -D /code/postgresql/$BRANCH/data \
  --log_line_prefix="%m %p " \
  --autovacuum=off \
  --effective_cache_size=1GB \
  --random_page_cost=4.0 \
  --work_mem=4MB \
  --maintenance_work_mem=64MB \
  --fsync=off \
  --log_statement=all \
  --log_min_messages=DEBUG5 \
  --max_connections=50 \
  --shared_buffers=32MB

Most of these settings were taken from a similar script that I use to
run Valgrind, so the particulars may not matter much -- though it's
useful to make the server logs as verbose as possible (you'll see why
in a minute).

I find it quite practical to run "make installcheck" against the
server, recording the entire execution. I find that it's not that much
slower than just running the tests against a regular debug build of
Postgres. It's still much faster than Valgrind, for example.
(Replaying the recording seems to be where having a high end machine
helps a lot.)

Once the tests are done, I stop Postgres in the usual way (Ctrl + C).
The recording is saved to the $HOME/.local/share/rr/ directory on my
Linux distro -- rr creates a directory for each distinct recording in
this parent directory. rr also maintains a symlink (latest-trace) that
points to the latest recording directory, which I rely on most of the
time when replaying a recording (it's the default). I am careful to
not leave too many recordings around, since they're large enough that
that could become a concern.

The record/Postgres terminal has output that looks like this:

[rr 1786705 1241867]2020-04-04 21:55:05.018 PDT 1786705 DEBUG:
CommitTransaction(1) name: unnamed; blockState: STARTED; state:
INPROGRESS, xid/subid/cid: 63992/1/2
[rr 1786705 1241898]2020-04-04 21:55:05.019 PDT 1786705 DEBUG:
StartTransaction(1) name: unnamed; blockState: DEFAULT; state:
INPROGRESS, xid/subid/cid: 0/1/0
[rr 1786705 1241902]2020-04-04 21:55:05.019 PDT 1786705 LOG:
statement: CREATE TYPE test_type_empty AS ();
[rr 1786705 1241906]2020-04-04 21:55:05.020 PDT 1786705 DEBUG:
CommitTransaction(1) name: unnamed; blockState: STARTED; state:
INPROGRESS, xid/subid/cid: 63993/1/1
[rr 1786705 1241936]2020-04-04 21:55:05.020 PDT 1786705 DEBUG:
StartTransaction(1) name: unnamed; blockState: DEFAULT; state:
INPROGRESS, xid/subid/cid: 0/1/0
[rr 1786705 1241940]2020-04-04 21:55:05.020 PDT 1786705 LOG:
statement: DROP TYPE test_type_empty;
[rr 1786705 1241944]2020-04-04 21:55:05.021 PDT 1786705 DEBUG:  drop
auto-cascades to composite type test_type_empty
[rr 1786705 1241948]2020-04-04 21:55:05.021 PDT 1786705 DEBUG:  drop
auto-cascades to type test_type_empty[]
[rr 1786705 1241952]2020-04-04 21:55:05.021 PDT 1786705 DEBUG:
MultiXact: setting OldestMember[2] = 9
[rr 1786705 1241956]2020-04-04 21:55:05.021 PDT 1786705 DEBUG:
CommitTransaction(1) name: unnamed; blockState: STARTED; state:
INPROGRESS, xid/subid/cid: 63994/1/3

The part of each log line in square brackets comes from rr (since we
used -M when recording) -- the first number is a PID, the second an
event number. I usually don't care about the PIDs, though, since the
event number alone unambiguously identifies a particular "event" in a
particular backend (rr recordings are single threaded, even when there
are multiple threads or processes). Suppose I want to get to the
"CREATE TYPE test_type_empty AS ();" query -- I can get to the end of
the query by replaying the recording with this option:

$ rr replay -M -g 1241902

Replaying the recording like this takes me to the point where the
Postgres backend prints the log message at the end of executing the
query I mentioned -- I get a familiar gdb debug server (rr implements
a gdb backend). This isn't precisely the point of execution that
interests me, but it's close enough. I can easily set a breakpoint to
the precise function I'm interested in, and then "reverse-continue" to
get there by going backwards.

I can also find the point where a particular backend starts by using
the fork option instead. So for the PID 1786705, that would look like:

$ rr replay -M -f 1786705

(Don't try to use the similar -p option, since that starts a debug
server when the pid has been exec'd.)

rr really shines when debugging things like tap tests, where there is
complex scaffolding that may run multiple Postgres servers. You can
run an entire "rr record make check", without having to worry about
how that scaffolding works. Once you have useful event numbers to work
off of, it doesn't take too long to get an interactive debugging
session in the backend of interest by applying the same techniques.

Note that saving the output of a recording using standard tools like

Re: WAL usage calculation patch

2020-04-05 Thread Amit Kapila
On Sat, Apr 4, 2020 at 2:50 PM Julien Rouhaud  wrote:
>
> On Sat, Apr 04, 2020 at 02:39:32PM +0530, Amit Kapila wrote:
> > On Sat, Apr 4, 2020 at 2:24 PM Julien Rouhaud  wrote:
> > > >
> > > > We can add if we want but I am not able to convince myself for that.
> > > > Do you have any use case in mind?  I think in most of the cases
> > > > (except for hint-bit WAL) it will be zero. If we are not sure of this
> > > > we can also discuss it separately in a new thread once this
> > > > patch-series is committed and see if anybody else sees the value of it
> > > > and if so adding the code should be easy.
> > >
> > >
> > > I'm mostly thinking of people trying to investigate possible slowdowns on 
> > > a
> > > hot-standby replica with a primary without wal_log_hints.  If they 
> > > explicitly
> > > ask for WAL information, we should provide them, even if it's quite 
> > > unlikely to
> > > happen.
> > >
> >
> > Yeah, possible but I am not completely sure.  I would like to hear the
> > opinion of others if any before adding code for this.  How about if we
> > first commit pg_stat_statements and wait for this till Monday and if
> > nobody responds we can commit the current patch but would start a new
> > thread and try to get the opinion of others?
>
>
> I'm fine with it.
>

I have pushed pg_stat_statements and Explain related patches.  I am
now looking into (auto)vacuum patch and have few comments.

@@ -614,6 +616,9 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,

  TimestampDifference(starttime, endtime, &secs, &usecs);

+ memset(&walusage, 0, sizeof(WalUsage));
+ WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
+
  read_rate = 0;
  write_rate = 0;
  if ((secs > 0) || (usecs > 0))
@@ -666,7 +671,13 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
  (long long) VacuumPageDirty);
  appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate:
%.3f MB/s\n"),
  read_rate, write_rate);
- appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
+ appendStringInfo(&buf, _("system usage: %s\n"), pg_rusage_show(&ru0));
+ appendStringInfo(&buf,
+ _("WAL usage: %ld records, %ld full page writes, "
+UINT64_FORMAT " bytes"),
+ walusage.wal_records,
+ walusage.wal_num_fpw,
+ walusage.wal_bytes);

Here, we are not displaying Buffers related data, so why do we think
it is important to display WAL data?  I see some point in displaying
Buffers and WAL data in a vacuum (verbose), but I feel it is better to
make a case for both the statistics together rather than just
displaying one and leaving other.  I think the other change related to
autovacuum stats seems okay to me.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




2pc leaks fds

2020-04-05 Thread Andres Freund
Hi,

Using 2PC with master very quickly leads to:

2020-04-05 19:42:18.368 PDT [2298126][5/2009:0] LOG:  out of file descriptors: 
Too many open files; release and retry
2020-04-05 19:42:18.368 PDT [2298126][5/2009:0] STATEMENT:  COMMIT PREPARED 
'ptx_2';

This started with:

commit 0dc8ead46363fec6f621a12c7e1f889ba73b55a9 (HEAD -> master)
Author: Alvaro Herrera 
Date:   2019-11-25 15:04:54 -0300

Refactor WAL file-reading code into WALRead()


I found this while trying to benchmark the effect of my snapshot changes
on 2pc. I just used the attached pgbench file.

andres@awork3:~/build/postgres/dev-assert/vpath$ pgbench -n -s 500 -c 4 -j 4 -T 
10 -P1 -f ~/tmp/pgbench-write-2pc.sql
progress: 1.0 s, 3723.8 tps, lat 1.068 ms stddev 0.305
client 2 script 0 aborted in command 8 query 0: ERROR:  could not seek to end 
of file "base/14036/16396": Too many open files
client 1 script 0 aborted in command 8 query 0: ERROR:  could not seek to end 
of file "base/14036/16396": Too many open files
client 3 script 0 aborted in command 8 query 0: ERROR:  could not seek to end 
of file "base/14036/16396": Too many open files
client 0 script 0 aborted in command 8 query 0: ERROR:  could not seek to end 
of file "base/14036/16396": Too many open files
transaction type: /home/andres/tmp/pgbench-write-2pc.sql

I've not yet reviewed the change sufficiently to pinpoint the issue.


It's a bit sad that nobody has hit this in the last few months :(.

Greetings,

Andres Freund


pgbench-write-2pc.sql
Description: application/sql


Re: ALTER tbl rewrite loses CLUSTER ON index

2020-04-05 Thread Michael Paquier
On Fri, Apr 03, 2020 at 03:54:38PM +0900, Michael Paquier wrote:
> Now, would it be better to apply the refactoring patch for HEAD before
> feature freeze, or are people fine if this is committed a bit after?
> Patch 0002 is neither a new feature, nor an actual bug, and just some
> code cleanup, but I am a bit worried about applying that cleanup on
> HEAD just after the freeze.

I have worked more on this one this morning and just applied the bug
fix down to 9.5, and the refactoring on HEAD.  Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Reinitialize stack base after fork (for the benefit of rr)?

2020-04-05 Thread Andres Freund
On 2020-04-04 21:02:56 -0700, Peter Geoghegan wrote:
> On Fri, Mar 27, 2020 at 11:22 AM Andres Freund  wrote:
> > I've found rr [1] very useful to debug issues in postgres. The ability
> > to hit a bug, and then e.g. identify a pointer with problematic
> > contents, set a watchpoint on its contents, and reverse-continue is
> > extremely powerful.
> 
> I agree that rr is very useful. It would be great if we had a totally
> smooth workflow for debugging using rr.

I just pushed that.




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-04-05 Thread Kyotaro Horiguchi
At Sat, 4 Apr 2020 15:32:12 -0700, Noah Misch  wrote in 
> On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote:
> > Shouldn't the CF entry get closed?
> 
> Once the buildfarm is clean for a day, sure.  The buildfarm has already
> revealed a missing perl2host call.

Thank you for (re-) committing this and the following fix. I hope this
doesn't bring in another failure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: nbtree: assertion failure in _bt_killitems() for posting tuple

2020-04-05 Thread Peter Geoghegan
On Tue, Mar 24, 2020 at 1:00 AM Anastasia Lubennikova
 wrote:
> > I think you're right. However, it still seems like we should check
> > that "kitem->indexOffset" is consistent among all of the BTScanPosItem
> > entries that we have for each TID that we believe to be from the same
> > posting list tuple.

The assertion failure happens in the logical replication worker
because it uses a dirty snapshot, which cannot release the pin per
commit 2ed5b87f. This means that the leaf page can change between the
time that we observe an item is dead, and the time we reach
_bt_killitems(), even though _bt_killitems() does get to kill items.

I am thinking about pushing a fix along the lines of the attached
patch. This preserves the assertion, while avoiding the check in cases
where it doesn't apply, such as when a dirty snapshot is in use.

-- 
Peter Geoghegan


v1-0001-Fix-assertion-failure-in-_bt_killitems.patch
Description: Binary data


Re: CLOBBER_CACHE_ALWAYS regression instability

2020-04-05 Thread Andres Freund
Hi,

On 2020-04-05 19:54:19 -0400, Alvaro Herrera wrote:
> Isn't it the case that you can create an inner block with a constant
> whose size is determined by a containing block's variable?  I mean as in
> the attached, which refuses to compile because of our -Werror=vla -- but
> if I remove it, it compiles fine and works in my system.

IIRC msvc doesn't support VLAs. And there's generally a slow push
towards deprecating them (they've e.g. been moved to optional in C11).

They don't tend to make a lot of sense for sizes that aren't tightly
bound. In contrast to palloc etc, there's no good way to catch
allocation errors. Most of the time you'll just get a SIGBUS or such,
but sometimes you'll just end up overwriting data (if the allocation is
large enough to not touch the guard pages).

Both alloca/vlas also add some per-call overhead.

Allocating the common size on-stack, and the uncommon ones on heap
should be cheaper, and handles the cases of large allocations much
better.

Greetings,

Andres Freund




archive recovery fetching wrong segments

2020-04-05 Thread Grigory Smolkin

Hello, hackers!

I`m investigating a complains from our clients about archive recovery 
speed been very slow, and I`ve noticed a really strange and, I think, a 
very dangerous recovery behavior.


When running multi-timeline archive recovery, for every requested segno 
startup process iterates through every timeline in restore target 
timeline history, starting from highest timeline and ending in current, 
and tries to fetch the segno in question from this timeline.


Consider the following example.
Timelines:
ARCHIVE INSTANCE 'node'
 

 TLI  Parent TLI  Switchpoint  Min Segno Max 
Segno N segments  Size   Zratio  N backups  Status
 

 3    2   0/AEFFEDE0   000300AE 
000300D5  40  41MB   15.47   0 OK
 2    1   0/A08768D0   000200A0 
000200AE  15  14MB   17.24   0 OK
 1    0   0/0  00010001 
000100BB  187 159MB  18.77   1  OK



Backup:
 

 Instance  Version  ID  Recovery Time   Mode  WAL Mode  
TLI  Time  Data   WAL  Zratio  Start LSN  Stop LSN   Status
 

 node  11   Q8C8IH  2020-04-06 02:13:31+03  FULL ARCHIVE 1/0    
3s  23MB  16MB    1.00  0/228  0/3B8  OK



So when we are trying to restore this backup, located on Timeline 1, to 
the restore target on Timeline 3, we are getting this in the PostgreSQL 
log:


2020-04-05 23:24:36 GMT [28508]: [5-1] LOG:  restored log file 
"0003.history" from archive
INFO: PID [28511]: pg_probackup archive-get WAL file: 
00030002, remote: none, threads: 1/1, batch: 20
ERROR: PID [28511]: pg_probackup archive-get failed to deliver WAL file 
00030002, prefetched: 0/20, time elapsed: 0ms
INFO: PID [28512]: pg_probackup archive-get WAL file: 
00020002, remote: none, threads: 1/1, batch: 20
ERROR: PID [28512]: pg_probackup archive-get failed to deliver WAL file 
00020002, prefetched: 0/20, time elapsed: 0ms
INFO: PID [28513]: pg_probackup archive-get WAL file: 
00010002, remote: none, threads: 1/1, batch: 20
INFO: PID [28513]: pg_probackup archive-get copied WAL file 
00010002
2020-04-05 23:24:36 GMT [28508]: [6-1] LOG:  restored log file 
"00010002" from archive

...

Before requesting 00010002 recovery tries to fetch 
00030002 and 00020002 and that goes for 
every segment, restored from the archive.
This tremendously slows down recovery speed, especially if archive is 
located on remote machine with high latency network.
And it also may lead to feeding recovery with wrong WAL segment, located 
on the next timeline.


Is there a reason behind this behavior?

Also I`ve  attached a patch, which fixed this issue for me, but I`m not 
sure, that chosen approach is sound and didn`t break something.


--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6cd4fde2b..b4ddb246c2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3714,12 +3714,30 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 	foreach(cell, tles)
 	{
 		TimeLineID	tli = ((TimeLineHistoryEntry *) lfirst(cell))->tli;
+		XLogRecPtr	switchpoint = ((TimeLineHistoryEntry *) lfirst(cell))->begin;
 
 		if (tli < curFileTLI)
 			break;/* don't bother looking at too-old TLIs */
 
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
 		{
+			/* If current timeline is from the future ...  */
+			if (tli > curFileTLI && switchpoint != InvalidXLogRecPtr)
+			{
+XLogSegNo	switchSegno;
+
+XLByteToSeg(switchpoint, switchSegno, wal_segment_size);
+
+/* ... and requested segno is not the segment with switchpoint, then
+ * skip current timeline */
+if (segno < switchSegno)
+{
+	elog(WARNING, "Skipping timeline: %u, switch segno: %u, current segno: %u", tli, switchSegno, segno);
+	continue;
+}
+			}
+
+
 			fd = XLogFileRead(segno, emode, tli,
 			  XLOG_FROM_ARCHIVE, true);
 			if (fd != -1)


Re: CLOBBER_CACHE_ALWAYS regression instability

2020-04-05 Thread Alvaro Herrera
On 2020-Apr-05, Tom Lane wrote:

> What I wish we had was alloca(), so you don't need a FUNC_MAX_ARGS-sized
> array to parse a two-argument function call.  Too bad C99 didn't add
> that.  (But some sniffing around suggests that an awful lot of systems
> have it anyway ... even MSVC.  Hmmm.)

Isn't it the case that you can create an inner block with a constant
whose size is determined by a containing block's variable?  I mean as in
the attached, which refuses to compile because of our -Werror=vla -- but
if I remove it, it compiles fine and works in my system.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9c3b6ad916..415ea17c68 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -103,7 +103,6 @@ ParseFuncOrColumn(ParseState *pstate, List *funcname, List *fargs,
 	Node	   *first_arg = NULL;
 	int			nargs;
 	int			nargsplusdefs;
-	Oid			actual_arg_types[FUNC_MAX_ARGS];
 	Oid		   *declared_arg_types;
 	List	   *argnames;
 	List	   *argdefaults;
@@ -115,6 +114,9 @@ ParseFuncOrColumn(ParseState *pstate, List *funcname, List *fargs,
 	char		aggkind = 0;
 	ParseCallbackState pcbstate;
 
+	{
+		Oid		actual_arg_types[list_length(fargs)];
+
 	/*
 	 * If there's an aggregate filter, transform it using transformWhereClause
 	 */
@@ -888,6 +890,7 @@ ParseFuncOrColumn(ParseState *pstate, List *funcname, List *fargs,
 	/* if it returns a set, remember it for error checks at higher levels */
 	if (retset)
 		pstate->p_last_srf = retval;
+	}
 
 	return retval;
 }


Re: Make MemoryContextMemAllocated() more precise

2020-04-05 Thread Andres Freund
Hi,

On 2020-03-27 17:21:10 -0700, Jeff Davis wrote:
> Attached refactoring patch. There's enough in here that warrants
> discussion that I don't think this makes sense for v13 and I'm adding
> it to the July commitfest.

IDK, adding a commit to v13 that we know we should do architecturally
differently in v14, when the difference in complexity between the two
patches isn't actually *that* big...

I'd like to see others jump in here...


> I still think we should do something for v13, such as the originally-
> proposed patch[1]. It's not critical, but it simply reports a better
> number for memory consumption. Currently, the memory usage appears to
> jump, often right past work mem (by a reasonable but noticable amount),
> which could be confusing.

Is that really a significant issue for most work mem sizes? Shouldn't
the way we increase sizes lead to the max difference between the
measurements to be somewhat limited?


>   * there's a new MemoryContextCount() that simply calculates the
> statistics without printing anything, and returns a struct
> - it supports flags to indicate which stats should be
>   calculated, so that some callers can avoid walking through
>   blocks/freelists
>   * it adds a new statistic for "new space" (i.e. untouched)
>   * it eliminates specialization of the memory context printing
> - the only specialization was for generation.c to output the
>   number of chunks, which can be done easily enough for the
>   other types, too

That sounds like a good direction.



> + if (flags & MCXT_STAT_NBLOCKS)
> + counters.nblocks = nblocks;
> + if (flags & MCXT_STAT_NCHUNKS)
> + counters.nchunks = set->nChunks;
> + if (flags & MCXT_STAT_FREECHUNKS)
> + counters.freechunks = freechunks;
> + if (flags & MCXT_STAT_TOTALSPACE)
> + counters.totalspace = set->memAllocated;
> + if (flags & MCXT_STAT_FREESPACE)
> + counters.freespace = freespace;
> + if (flags & MCXT_STAT_NEWSPACE)
> + counters.newspace = set->blocks->endptr - set->blocks->freeptr;

I'd spec it so that context implementations are allowed to
unconditionally fill fields, even when the flag isn't specified. The
branches quoted don't buy us anyting...



> diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
> index c9f2bbcb367..cc545852968 100644
> --- a/src/include/nodes/memnodes.h
> +++ b/src/include/nodes/memnodes.h
> @@ -29,11 +29,21 @@
>  typedef struct MemoryContextCounters
>  {
>   Sizenblocks;/* Total number of malloc 
> blocks */
> + Sizenchunks;/* Total number of chunks 
> (used+free) */
>   Sizefreechunks; /* Total number of free chunks 
> */
>   Sizetotalspace; /* Total bytes requested from 
> malloc */
>   Sizefreespace;  /* The unused portion of 
> totalspace */
> + Sizenewspace;   /* Allocated but never held any 
> chunks */

I'd add some reasoning as to why this is useful.


>  } MemoryContextCounters;
>  
> +#define MCXT_STAT_NBLOCKS(1 << 0)
> +#define MCXT_STAT_NCHUNKS(1 << 1)
> +#define MCXT_STAT_FREECHUNKS (1 << 2)
> +#define MCXT_STAT_TOTALSPACE (1 << 3)
> +#define MCXT_STAT_FREESPACE  (1 << 4)
> +#define MCXT_STAT_NEWSPACE   (1 << 5)

s/MCXT_STAT/MCXT_STAT_NEED/?


> +#define MCXT_STAT_ALL((1 << 6) - 1)

Hm, why not go for ~0 or such?


Greetings,

Andres Freund




WAL page magic errors (and plenty others) got hard to debug.

2020-04-05 Thread Andres Freund
Hi,

When starting with on a data directory with an older WAL page magic we
currently make that hard to debug. E.g.:

2020-04-05 15:31:04.314 PDT [1896669][:0] LOG:  database system was shut down 
at 2020-04-05 15:24:56 PDT
2020-04-05 15:31:04.314 PDT [1896669][:0] LOG:  invalid primary checkpoint 
record
2020-04-05 15:31:04.314 PDT [1896669][:0] PANIC:  could not locate a valid 
checkpoint record
2020-04-05 15:31:04.315 PDT [1896668][:0] LOG:  startup process (PID 1896669) 
was terminated by signal 6: Aborted
2020-04-05 15:31:04.315 PDT [1896668][:0] LOG:  aborting startup due to startup 
process failure
2020-04-05 15:31:04.316 PDT [1896668][:0] LOG:  database system is shut down

As far as I can tell this is not just the case for a wrong page magic,
but for all page level validation errors.

I think this largely originates in:

commit 0668719801838aa6a8bda330ff9b3d20097ea844
Author: Heikki Linnakangas 
Date:   2018-05-05 01:34:53 +0300

Fix scenario where streaming standby gets stuck at a continuation record.

If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.

Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.

Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.

Discussion: 
https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com

which added the following to XLogPageRead():

+/*
+ * Check the page header immediately, so that we can retry immediately if
+ * it's not valid. This may seem unnecessary, because XLogReadRecord()
+ * validates the page header anyway, and would propagate the failure up to
+ * ReadRecord(), which would retry. However, there's a corner case with
+ * continuation records, if a record is split across two pages such that
+ * we would need to read the two pages from different sources. For
+ * example, imagine a scenario where a streaming replica is started up,
+ * and replay reaches a record that's split across two WAL segments. The
+ * first page is only available locally, in pg_wal, because it's already
+ * been recycled in the master. The second page, however, is not present
+ * in pg_wal, and we should stream it from the master. There is a recycled
+ * WAL segment present in pg_wal, with garbage contents, however. We would
+ * read the first page from the local WAL segment, but when reading the
+ * second page, we would read the bogus, recycled, WAL segment. If we
+ * didn't catch that case here, we would never recover, because
+ * ReadRecord() would retry reading the whole record from the beginning.
+ *
+ * Of course, this only catches errors in the page header, which is what
+ * happens in the case of a recycled WAL segment. Other kinds of errors or
+ * corruption still has the same problem. But this at least fixes the
+ * common case, which can happen as part of normal operation.
+ *
+ * Validating the page header is cheap enough that doing it twice
+ * shouldn't be a big deal from a performance point of view.
+ */
+if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+{
+/* reset any error XLogReaderValidatePageHeader() might have set */
+xlogreader->errormsg_buf[0] = '\0';
+goto next_record_is_invalid;
+}
+

I really can't follow the logic of just intentionally and silently
throwing the error message away here. Isn't this basically hiding *all*
page level error messages?

And even in the scenarios where this were the right thing, I feel like
not even outputting a debugging message makes debugging situations in
which this is encountered unnecessarily hard.

Greetings,

Andres Freund




Re: d25ea01275 and partitionwise join

2020-04-05 Thread Tom Lane
Amit Langote  writes:
> Okay, I tried that in the updated patch. I didn't try to come up with
> examples that might break it though.

I looked through this.

* Wasn't excited about inventing makeCoalesceExpr(); the fact that it only
had two potential call sites seemed to make it not worth the trouble.
Plus, as defined, it could not handle the general case of COALESCE, which
can have N arguments; so that seemed like a recipe for confusion.

* I think your logic for building the coalesce combinations was just
wrong.  We need combinations of left-hand inputs with right-hand inputs,
not left-hand with left-hand or right-hand with right-hand.  Also the
nesting should already have happened in the inputs, we don't need to
try to generate it here.  The looping code was pretty messy as well.

* I don't think using partopcintype is necessarily right; that could be
a polymorphic type, for instance.  Safer to copy the type of the input
expressions.  Likely we could have got away with using partcollation,
but for consistency I copied that as well.

* You really need to update the data structure definitional comments
when you make a change like this.

* I did not like putting a test case that requires
enable_partitionwise_aggregate in the partition_join test; that seems
misplaced.  But it's not necessary to the test, is it?

* I did not follow the point of your second test case.  The WHERE
constraint on p1.a allows the planner to strength-reduce the joins,
which is why there's no full join in that explain result, but then
we aren't going to get to this code at all.

I repaired the above in the attached.

I'm actually sort of pleasantly surprised that this worked; I was
not sure that building COALESCEs like this would provide the result
we needed.  But it seems okay -- it does fix the behavior in this
3-way test case, as well as the 4-way join you showed at the top
of the thread.  It's fairly dependent on the fact that the planner
won't rearrange full joins; otherwise, the COALESCE structures we
build here might not match those made at parse time.  But that's
not likely to change anytime soon; and this is hardly the only
place that would break, so I'm not sweating about it.  (I have
some vague ideas about getting rid of the COALESCEs as part of
the Var redefinition I've been muttering about, anyway; there
might be a cleaner fix for this if that happens.)

Anyway, I think this is probably OK for now.  Given that the
nullable_partexprs lists are only used in one place, it's pretty
hard to see how it would break anything.

regards, tom lane

diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index af1fb48..e1cc11c 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -17,6 +17,7 @@
 #include 
 
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/appendinfo.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
@@ -1890,7 +1891,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel,
 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
 JoinType jointype)
 {
-	int			partnatts = joinrel->part_scheme->partnatts;
+	PartitionScheme part_scheme = joinrel->part_scheme;
+	int			partnatts = part_scheme->partnatts;
 
 	joinrel->partexprs = (List **) palloc0(sizeof(List *) * partnatts);
 	joinrel->nullable_partexprs =
@@ -1899,7 +1901,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel,
 	/*
 	 * The joinrel's partition expressions are the same as those of the input
 	 * rels, but we must properly classify them as nullable or not in the
-	 * joinrel's output.
+	 * joinrel's output.  (Also, we add some more partition expressions if
+	 * it's a FULL JOIN.)
 	 */
 	for (int cnt = 0; cnt < partnatts; cnt++)
 	{
@@ -1910,6 +1913,7 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel,
 		const List *inner_null_expr = inner_rel->nullable_partexprs[cnt];
 		List	   *partexpr = NIL;
 		List	   *nullable_partexpr = NIL;
+		ListCell   *lc;
 
 		switch (jointype)
 		{
@@ -1969,6 +1973,31 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel,
 outer_null_expr);
 nullable_partexpr = list_concat(nullable_partexpr,
 inner_null_expr);
+
+/*
+ * Also add CoalesceExprs corresponding to each possible
+ * full-join output variable (that is, left side coalesced to
+ * right side), so that we can match equijoin expressions
+ * using those variables.  We assume no type coercions are
+ * needed to make the join outputs.
+ */
+foreach(lc, list_concat_copy(outer_expr, outer_null_expr))
+{
+	Node	   *larg = (Node *) lfirst(lc);
+	ListCell   *lc2;
+
+	foreach(lc2, list_concat_copy(inner_expr, inner_null_expr))
+	{
+		Node	   *rarg = (Node *) lfirst(lc2);
+		CoalesceExpr *c = makeNode(CoalesceExpr);
+
+		c->coalescetype = exprType(larg);
+		c->coalescecollid = exprCollation(larg);
+		c->args = list_make2(larg, rar

Re: range_agg

2020-04-05 Thread Alvaro Herrera
v17 is a rebase fixing a minor parse_coerce.c edit; v16 lasted little
:-(  I chose to change the wording of the conflicting comment in
enforce_generic_type_consistency():

 * 3) Similarly, if return type is ANYRANGE or ANYMULTIRANGE, and any
 *argument is ANYRANGE or ANYMULTIRANGE, use that argument's
 *actual type, range type or multirange type as the function's return
 *type.

This wording is less precise, in that it doesn't say exactly which of
the three types is the actual result for each of the possible four cases
(r->r, r->m, m->m, m->r) but I think it should be straightforward.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add A Glossary

2020-04-05 Thread Alvaro Herrera
On 2020-Apr-05, Fabien COELHO wrote:

> > > As the definitions are short and to the point, maybe the HTML display
> > > could (also) "hover" the definitions when the mouse passes over the word,
> > > using the "title" attribute?
> > 
> > I like that idea, if it doesn't conflict with accessibility standards
> > (maybe that's just titles on images, not sure).
> 
> The following worked fine:
> 
>   Title Tag Test
>   The ACID
>   property is great.
>   

I don't see myself patching the stylesheet as would be needed to do
this.

> > I suggest we pursue this idea in another thread, as we'd probably want to
> > do it for acronyms as well.
> 
> Or not. I'd test committer temperature before investing time because it
> would mean that backpatching the doc would be a little harder.

TBH I can't get very excited about this idea.  Maybe other documentation
champions would be happier about doing that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: CLOBBER_CACHE_ALWAYS regression instability

2020-04-05 Thread Andres Freund
On 2020-04-05 15:38:29 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > We could print the error using :LAST_ERROR_MESSAGE after removing a
> > potential trailing "at character ..." if we're worried about the loss of
> > specificity.
> 
> Oh, actually it seems that :LAST_ERROR_MESSAGE is already just the
> primary message, without any "at character N" addon, so this would be
> a very easy way to ameliorate that complaint.  ("at character N" is
> added by libpq's pqBuildErrorMessage3 in TERSE mode, but psql does
> not use that when filling LAST_ERROR_MESSAGE.)

Heh. I though it worked differently because I just had typed some
gibberish and got an error in :LAST_ERROR_MESSAGE that ended in "at or
near ...". But that's scan.l adding that explicitly...




Re: VACUUM memory management

2020-04-05 Thread Justin Pryzby
On Wed, Dec 11, 2019 at 09:29:17PM +0500, Ibrar Ahmed wrote:
> > Did you modify Claudio's patch or write a totally new one?
> 
> I wrote completely new patch. I tried multiple techniques like using a list
> instead of fixed size array which I thought was most suitable here, but
> leave that because of conflict with Parallel Vacuum.

Using a list will hardly work, or certainly not well, since it needs to be
searched by the ambulkdelete callback.

> >> If you wrote a totally new one, have you compared your work with
> >> Claudio's, to see if he covered
> >> anything you might need to cover?
> >
> > I checked the patch, and it does not do anything special which my patch is
> not doing except one thing. The patch is claiming to increase the limit of
> 1GB along with that, but I have not touched that. In my case, we are still
> under the limit of maintaines_work_mem but allocate memory in chunks. In
> that case, you have the leverage to set a big value of maintaness_work_mem
> (even if you don't need that)  because it will not allocate all the memory
> at the start.

After spending a bunch of time comparing them, I disagree.  Claudio's patch
does these:

 - avoid using multiple chunks if there's no indexes, therefore no need to
   avoid the high cost of index scans to avoid;
 - rather than doing an index scan for each chunk (bad), the callback function
   lazy_tid_reaped() does a custom binary search *over* chunks of different
   sizes and then *within* each chunk.  That's maybe slighly over-engineered,
   I'm not convinced that's needed (but I thought it was pretty clever), but
   someone thought that was important.
 - properly keep track of *total* number of dead tuples, eg for progress
   reporting, and for prev_dead_count for pages with no dead tuples;
 - lazy_record_dead_tuple() doubles allocation when running out of space for
   dead tuples; some people disagree with that (myself included) but I'm
   including it here since that's what it does.  This still seems nontrivial
   (to me) to adapt to work with parallel query.

-- 
Justin




Re: backup manifests and contemporaneous buildfarm failures

2020-04-05 Thread Tom Lane
Andrew Dunstan  writes:
> Hmm, the buildfarm client does this at the beginning of each run to
> remove anything that might be left over from a previous run:

> rmtree("inst");
> rmtree("$pgsql") unless ($from_source && !$use_vpath);

Right, the point is precisely that some versions of rmtree() fail
to remove a mode-0 subdirectory.

> Do I need to precede those with some recursive chmod commands? Perhaps
> the client should refuse to run if there is still something left after
> these.

I think the latter would be a very good idea, just so that this sort of
failure is less obscure.  Not sure about whether a recursive chmod is
really going to be worth the cycles.  (On the other hand, the normal
case should be that there's nothing there anyway, so maybe it's not
going to be costly.)  

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-05 Thread Andrew Dunstan


On 4/5/20 9:10 AM, Mikael Kjellström wrote:
> On 2020-04-04 04:43, Robert Haas wrote:
>
>> I think I've done about as much as I can do for tonight, though. Most
>> things are green now, and the ones that aren't are failing because of
>> stuff that is at least plausibly fixed. By morning it should be
>> clearer how much broken stuff is left, although that will be somewhat
>> complicated by at least sidewinder and seawasp needing manual
>> intervention to get back on track.
>
> I fixed sidewinder I think.  Should clear up the next time it runs.
>
> It was the mode on the directory it couldn't handle-  A regular rm -rf
> didn't work I had to do a chmod -R 700 on all directories to be able
> to manually remove it.
>
>


Hmm, the buildfarm client does this at the beginning of each run to
remove anything that might be left over from a previous run:


rmtree("inst");
rmtree("$pgsql") unless ($from_source && !$use_vpath);


Do I need to precede those with some recursive chmod commands? Perhaps
the client should refuse to run if there is still something left after
these.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Thoughts on "killed tuples" index hint bits support on standby

2020-04-05 Thread Peter Geoghegan
Hi Michail,

On Fri, Jan 24, 2020 at 6:16 AM Michail Nikolaev
 wrote:
> Some of issues your mentioned (reporting feedback to the another
> cascade standby, processing queries after restart and newer xid
> already reported) could be fixed in provided design, but your
> intention to have "independent correctness backstop" is a right thing
> to do.

Attached is a very rough POC patch of my own, which makes item
deletion occur "non-opportunistically" in unique indexes. The idea is
that we exploit the uniqueness property of unique indexes to identify
"version churn" from non-HOT updates. If any single value on a leaf
page has several duplicates, then there is a good chance that we can
safely delete some of them. It's worth going to the heap to check
whether that's safe selectively, at the point where we'd usually have
to split the page. We only have to free one or two items to avoid
splitting the page. If we can avoid splitting the page immediately, we
may well avoid splitting it indefinitely, or forever.

This approach seems to be super effective. It can leave the PK on
pgbench_accounts in pristine condition (no page splits) after many
hours with a pgbench-like workload that makes all updates non-HOT
updates. Even with many clients, and a skewed distribution. Provided
the index isn't tiny to begin with, we can always keep up with
controlling index bloat -- once the client backends themselves begin
to take active responsibility for garbage collection, rather than just
treating it as a nice-to-have. I'm pretty sure that I'm going to be
spending a lot of time developing this approach, because it really
works.

This seems fairly relevant to what you're doing. It makes almost all
index cleanup occur using the new delete infrastructure for some of
the most interesting workloads where deletion takes place in client
backends. In practice, a standby will almost be in the same position
as the primary in a workload that this approach really helps with,
since setting the LP_DEAD bit itself doesn't really need to happen (we
can go straight to deleting the items in the new deletion path).

To address the questions you've asked: I don't really like the idea of
introducing new rules around tuple visibility and WAL logging to set
more LP_DEAD bits like this at all. It seems very complicated. I
suspect that we'd be better off introducing ways of making the actual
deletes occur sooner on the primary, possibly much sooner, avoiding
any need for special infrastructure on the standby. This is probably
not limited to the special unique index case that my patch focuses on
-- we can probably push this general approach forward in a number of
different ways. I just started with unique indexes because that seemed
most promising. I have only worked on the project for a few days. I
don't really know how it will evolve.

-- 
Peter Geoghegan


0001-Non-opportunistically-delete-B-Tree-items.patch
Description: Binary data


Re: CLOBBER_CACHE_ALWAYS regression instability

2020-04-05 Thread Tom Lane
Andres Freund  writes:
> On 2020-04-05 15:04:30 -0400, Tom Lane wrote:
>> That would only reduce the chance of getting a stack overflow there,
>> and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal
>> which is going to be doing catalog accesses inside there too.

> It'd certainly not be bullet proof. But I don't think we ever were? If I
> understood you correctly we were just not noticing the stack overflow
> danger before? We did catalog accesses from within there before too,
> that's not changed by the addition of equal(), no?

Ah, you're right, the CCA aspect is not such a problem as long as there
are not check_stack_depth() calls inside the code that's run to load a
syscache or relcache entry.  Which there probably aren't, at least not
for system catalogs.  The reason we're seeing this on a CCA animal is
simply that a cache flush has occurred to force recomputeNamespacePath
to do some work.  (In theory it could happen on a non-CCA animal, given
unlucky timing of an sinval overrun.)

My point here though is that it's basically been blind luck that we've
not seen this before.  There's certainly no good reason to assume that
a check_stack_depth() call shouldn't happen while parsing a function
call, or within some other chunk of the parser that happens to set up
a transient error-position callback.  And it's only going to get more
likely in future, seeing for example my ambitions to extend the
executor so that run-time expression failures can also report error
cursors.  So I think that we should be looking for a permanent fix,
not a reduce-the-odds band-aid.

>> Seems like that adds a lot of potential for memory leakage?

> Depends on the case, I'd say. Certainly might be useful to add a helper
> for a corresponding conditional free.
> For parsing cases like this it could be better to bulk free at the
> end. Compared to the memory needed for all the transformed arguments etc
> it'd probably not matter in the short term (especially if only done for
> 4+ args).

What I wish we had was alloca(), so you don't need a FUNC_MAX_ARGS-sized
array to parse a two-argument function call.  Too bad C99 didn't add
that.  (But some sniffing around suggests that an awful lot of systems
have it anyway ... even MSVC.  Hmmm.)

regards, tom lane




Re: CLOBBER_CACHE_ALWAYS regression instability

2020-04-05 Thread Tom Lane
Andres Freund  writes:
> On 2020-04-05 14:33:19 -0400, Tom Lane wrote:
>> 1. Change the test to do "\set VERBOSITY sqlstate" so that all that
>> is printed is
>> ERROR:  54001
>> ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that
>> this wouldn't be too much of a loss of specificity.  (Or we could
>> give stack overflow its very own ERRCODE.)

> We could print the error using :LAST_ERROR_MESSAGE after removing a
> potential trailing "at character ..." if we're worried about the loss of
> specificity.

Oh, actually it seems that :LAST_ERROR_MESSAGE is already just the
primary message, without any "at character N" addon, so this would be
a very easy way to ameliorate that complaint.  ("at character N" is
added by libpq's pqBuildErrorMessage3 in TERSE mode, but psql does
not use that when filling LAST_ERROR_MESSAGE.)

regards, tom lane




Re: backup manifests

2020-04-05 Thread Andres Freund
Hi,

On 2020-04-03 15:22:23 -0400, Robert Haas wrote:
> I've pushed all the patches.

Seeing new warnings in an optimized build

/home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c: 
In function 'json_manifest_object_end':
/home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:591:2:
 warning: 'end_lsn' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
  591 |  context->perwalrange_cb(context, tli, start_lsn, end_lsn);
  |  ^
/home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:567:5:
 note: 'end_lsn' was declared here
  567 | end_lsn;
  | ^~~
/home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:591:2:
 warning: 'start_lsn' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
  591 |  context->perwalrange_cb(context, tli, start_lsn, end_lsn);
  |  ^
/home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:566:13:
 note: 'start_lsn' was declared here
  566 |  XLogRecPtr start_lsn,
  | ^

The warnings don't seem too unreasonable. The compiler can't see that
the error_cb inside json_manifest_parse_failure() is not expected to
return. Probably worth adding a wrapper around the calls to
context->error_cb and mark that as noreturn.

- Andres




Re: CLOBBER_CACHE_ALWAYS regression instability

2020-04-05 Thread Andres Freund
Hi,

On 2020-04-05 15:04:30 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Another avenue could be to make ParseFuncOrColumn et al use less stack,
> > and hope that it avoids the problem. It's a bit insane that we use this
> > much.
> 
> That would only reduce the chance of getting a stack overflow there,
> and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal
> which is going to be doing catalog accesses inside there too.

It'd certainly not be bullet proof. But I don't think we ever were? If I
understood you correctly we were just not noticing the stack overflow
danger before? We did catalog accesses from within there before too,
that's not changed by the addition of equal(), no?


Reminds me: I'll try to dust up my patch to make cache invalidation
processing non-recursive for 14 (I wrote an initial version as part of a
bugfix that we ended up fixing differently). Besides making
CLOBBER_CACHE_ALWAYS vastly less expensive, it also reduces the cost of
logical decoding substantially.


> > We don't have to go there in this case, but I've before wondered about
> > adding helpers that use an on-stack var for small allocations, and falls
> > back to palloc otherwise. Something boiling down to:
> 
> Seems like that adds a lot of potential for memory leakage?

Depends on the case, I'd say. Certainly might be useful to add a helper
for a corresponding conditional free.

For parsing cases like this it could be better to bulk free at the
end. Compared to the memory needed for all the transformed arguments etc
it'd probably not matter in the short term (especially if only done for
4+ args).

Greetings,

Andres Freund




Re: CLOBBER_CACHE_ALWAYS regression instability

2020-04-05 Thread Tom Lane
Andres Freund  writes:
> Another avenue could be to make ParseFuncOrColumn et al use less stack,
> and hope that it avoids the problem. It's a bit insane that we use this
> much.

That would only reduce the chance of getting a stack overflow there,
and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal
which is going to be doing catalog accesses inside there too.

> We don't have to go there in this case, but I've before wondered about
> adding helpers that use an on-stack var for small allocations, and falls
> back to palloc otherwise. Something boiling down to:

Seems like that adds a lot of potential for memory leakage?

regards, tom lane




Re: Add A Glossary

2020-04-05 Thread Alvaro Herrera
On 2020-Apr-05, Jürgen Purtz wrote:

> a) Some rearrangements of the sequence of terms to meet alphabetical order.

Thanks, will get this pushed.

> b)   -->   in
> two cases. Or should it be a ?

Ah, yeah, those should be linkend.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: CLOBBER_CACHE_ALWAYS regression instability

2020-04-05 Thread Andres Freund
Hi,

On 2020-04-05 14:33:19 -0400, Tom Lane wrote:
> It's a bit surprising perhaps that we run out of stack here and not
> somewhere else; but ParseFuncOrColumn and its subroutines consume
> quite a lot of stack, because of FUNC_MAX_ARGS-sized local arrays,
> so it's not *that* surprising.
> 
> So, what to do to re-stabilize the regression tests?  Even if
> we wanted to revert 8f59f6b9c0, that'd be kind of a band-aid fix,
> because there are lots of other possible ways that a parser error
> callback could be active at the point of the error.  A few other
> possibilities are:

> 
> 1. Change the test to do "\set VERBOSITY sqlstate" so that all that
> is printed is
>   ERROR:  54001
> ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that
> this wouldn't be too much of a loss of specificity.  (Or we could
> give stack overflow its very own ERRCODE.)

We could print the error using :LAST_ERROR_MESSAGE after removing a
potential trailing "at character ..." if we're worried about the loss of
specificity.


> On the whole I find #1 the least unpleasant, as well as the most
> likely to forestall future variants of this issue.  It won't dodge
> the PPC64 problem, but I'm willing to keep living with that one
> for now.

Another avenue could be to make ParseFuncOrColumn et al use less stack,
and hope that it avoids the problem. It's a bit insane that we use this
much.


We don't have to go there in this case, but I've before wondered about
adding helpers that use an on-stack var for small allocations, and falls
back to palloc otherwise. Something boiling down to:

Oid actual_arg_types_stack[3];
Oid *actual_arg_types;

if (list_length(fargs) <= lengthof(actual_arg_types_stack))
actual_arg_types = actual_arg_types_stack;
else
actual_arg_types = palloc(sizeof(*actual_arg_types) * 
list_length(fargs))

Greetings,

Andres Freund




CLOBBER_CACHE_ALWAYS regression instability

2020-04-05 Thread Tom Lane
Over in the thread at [1] we were wondering how come buildfarm member
hyrax has suddenly started to fail like this:

diff -U3 
/home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/expected/errors.out 
/home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/errors.out
--- 
/home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/expected/errors.out 
2018-11-21 13:48:48.34000 -0500
+++ 
/home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/errors.out  
2020-04-04 04:48:16.704699045 -0400
@@ -446,4 +446,4 @@
 'select infinite_recurse()' language sql;
 \set VERBOSITY terse
 select infinite_recurse();
-ERROR:  stack depth limit exceeded
+ERROR:  stack depth limit exceeded at character 8

I've now looked into this, and found that it's not at all hard to
duplicate; compile HEAD with -DCLOBBER_CACHE_ALWAYS, and run "select
infinite_recurse()", and you'll likely get the changed error message.
(The lack of other buildfarm failures is probably just because we
have so few animals doing CLOBBER_CACHE_ALWAYS builds frequently.)

The issue seems indeed to have been triggered by 8f59f6b9c0, because
that inserted an equal() call into recomputeNamespacePath().  equal()
includes a check_stack_depth() call.  We get the observed message if
this call is the one where the stack limit is hit, and it is invoked
inside ParseFuncOrColumn(), which has set up a parser error callback
to point at the infinite_recurse() call that it's trying to resolve.
That callback's been there a long time of course, so we may conclude
that no other code path reached from ParseFuncOrColumn contains a
stack depth check, or we'd likely have seen this before.

It's a bit surprising perhaps that we run out of stack here and not
somewhere else; but ParseFuncOrColumn and its subroutines consume
quite a lot of stack, because of FUNC_MAX_ARGS-sized local arrays,
so it's not *that* surprising.

So, what to do to re-stabilize the regression tests?  Even if
we wanted to revert 8f59f6b9c0, that'd be kind of a band-aid fix,
because there are lots of other possible ways that a parser error
callback could be active at the point of the error.  A few other
possibilities are:

1. Change the test to do "\set VERBOSITY sqlstate" so that all that
is printed is
ERROR:  54001
ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that
this wouldn't be too much of a loss of specificity.  (Or we could
give stack overflow its very own ERRCODE.)

2. Hack pcb_error_callback so that it suppresses the error position
report for ERRCODE_STATEMENT_TOO_COMPLEX, as it already does
for ERRCODE_QUERY_CANCELED.  That seems pretty unpleasant though.

3. Create a separate expected-file to match the variant output.
This would be a maintenance problem, but we could ameliorate that
by moving the test to its own regression script, which was something
that'd already been proposed to get around the PPC64 Linux kernel
signal-handling bug that's been causing intermittent failures on
most of the PPC64 buildfarm animals [2].

On the whole I find #1 the least unpleasant, as well as the most
likely to forestall future variants of this issue.  It won't dodge
the PPC64 problem, but I'm willing to keep living with that one
for now.

Thoughts?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoaUOS5X64nKgFxNV7JHN4sRkNAJYW2gHz-LMb0Ej4xHig%40mail.gmail.com
[2] https://www.postgresql.org/message-id/27924.1571068231%40sss.pgh.pa.us




Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

2020-04-05 Thread Michail Nikolaev
Hello, Peter.

>  I added
>  something about this to the nbtree README in commit 9f83468b353.

I have added some updates to your notes in the updated patch version.

I also was trying to keep the original wrapping of the paragraph, so
the patch looks too wordy.

Thanks,
Michail.


btree-race-and-docs.patch
Description: Binary data


Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Alexander Korotkov
On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
 wrote:
> On Sun, Apr 05, 2020 at 07:33:40PM +0300, Alexander Korotkov wrote:
> >On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra
> > wrote:
> >> On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:
> >> >On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
> >> > wrote:
> >> >> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
> >> >> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> >> >> >> Yeah, the opclass params patches got broken by 773df883e adding enum
> >> >> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
> >> >> >> Nikita to fix it in [1]. Until that happens, apply the patches on
> >> >> >> top of caba97a9d9 for review.
> >> >> >
> >> >> >This has been close to two months now, so I have the patch as RwF.
> >> >> >Feel free to update if you think that's incorrect.
> >> >>
> >> >> I see the opclass parameters patch got committed a couple days ago, so
> >> >> I've rebased the patch series on top of it. The pach was marked RwF
> >> >> since 2019-11, so I'll add it to the next CF.
> >> >
> >> >I think this patchset was marked RwF mainly because slow progress on
> >> >opclass parameters.  Now we got opclass parameters committed, and I
> >> >think this patchset is in a pretty good shape.  Moreover, opclass
> >> >parameters patch comes with very small examples.  This patchset would
> >> >be great showcase for opclass parameters.
> >> >
> >> >I'd like to give this patchset a chance for v13.  I'm going to make
> >> >another pass trough this patchset.  If I wouldn't find serious issues,
> >> >I'm going to commit it.  Any objections?
> >> >
> >>
> >> I'm an author of the patchset and I'd love to see it committed, but I
> >> think that might be a bit too rushed and unfair (considering it was not
> >> included in the current CF at all).
> >>
> >> I think the code is correct and I'm not aware of any bugs, but I'm not
> >> sure there was sufficient discussion about things like costing, choosing
> >> parameter values (e.g.  number of values in the multi-minmax or bloom
> >> filter parameters).
> >
> >Ok!
> >
> >> That being said, I think the first couple of patches (that modify how
> >> BRIN deals with multi-key scans and IS NULL clauses) are simple enough
> >> and non-controversial, so maybe we could get 0001-0003 committed, and
> >> leave the bloom/multi-minmax opclasses for v14.
> >
> >Regarding 0001-0003 I've couple of notes:
> >1) They should revise BRIN extensibility documentation section.
> >2) I think 0002 and 0003 should be merged.  NULL ScanKeys should be
> >still passed to consistent function when oi_regular_nulls == false.
> >
> >Assuming we're not going to get 0001-0003 into v13, I'm not so
> >inclined to rush on these three as well.  But you're willing to commit
> >them, you can count round of review on me.
> >
>
> I have no intention to get 0001-0003 committed. I think those changes
> are beneficial on their own, but the primary reason was to support the
> new opclasses (which require those changes). And those parts are not
> going to make it into v13 ...

OK, no problem.
Let's do this for v14.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Tomas Vondra

On Sun, Apr 05, 2020 at 07:33:40PM +0300, Alexander Korotkov wrote:

On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra
 wrote:

On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:
>On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
> wrote:
>> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
>> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
>> >> Yeah, the opclass params patches got broken by 773df883e adding enum
>> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
>> >> Nikita to fix it in [1]. Until that happens, apply the patches on
>> >> top of caba97a9d9 for review.
>> >
>> >This has been close to two months now, so I have the patch as RwF.
>> >Feel free to update if you think that's incorrect.
>>
>> I see the opclass parameters patch got committed a couple days ago, so
>> I've rebased the patch series on top of it. The pach was marked RwF
>> since 2019-11, so I'll add it to the next CF.
>
>I think this patchset was marked RwF mainly because slow progress on
>opclass parameters.  Now we got opclass parameters committed, and I
>think this patchset is in a pretty good shape.  Moreover, opclass
>parameters patch comes with very small examples.  This patchset would
>be great showcase for opclass parameters.
>
>I'd like to give this patchset a chance for v13.  I'm going to make
>another pass trough this patchset.  If I wouldn't find serious issues,
>I'm going to commit it.  Any objections?
>

I'm an author of the patchset and I'd love to see it committed, but I
think that might be a bit too rushed and unfair (considering it was not
included in the current CF at all).

I think the code is correct and I'm not aware of any bugs, but I'm not
sure there was sufficient discussion about things like costing, choosing
parameter values (e.g.  number of values in the multi-minmax or bloom
filter parameters).


Ok!


That being said, I think the first couple of patches (that modify how
BRIN deals with multi-key scans and IS NULL clauses) are simple enough
and non-controversial, so maybe we could get 0001-0003 committed, and
leave the bloom/multi-minmax opclasses for v14.


Regarding 0001-0003 I've couple of notes:
1) They should revise BRIN extensibility documentation section.
2) I think 0002 and 0003 should be merged.  NULL ScanKeys should be
still passed to consistent function when oi_regular_nulls == false.

Assuming we're not going to get 0001-0003 into v13, I'm not so
inclined to rush on these three as well.  But you're willing to commit
them, you can count round of review on me.



I have no intention to get 0001-0003 committed. I think those changes
are beneficial on their own, but the primary reason was to support the
new opclasses (which require those changes). And those parts are not
going to make it into v13 ...

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: segmentation fault using currtid and partitioned tables

2020-04-05 Thread Tom Lane
Jaime Casanova  writes:
> Another one caught by sqlsmith, on the regression database run this
> query (using any non-partitioned table works fine):
> select currtid('pagg_tab'::regclass::oid, '(0,156)'::tid) >= '(1,158)'::tid;

Hm, so

(1) currtid_byreloid and currtid_byrelname lack any check to see
if they're dealing with a relkind that lacks storage.

(2) The proximate cause of the crash is that rd_tableam is zero,
so that the interface functions in tableam.h just crash hard.
This seems like a pretty bad thing; does anyone want to promise
that there are no other oversights of the same ilk elsewhere,
and never will be?

I think it might be a good idea to make relations-without-storage
set up rd_tableam as a vector of dummy functions that will throw
some suitable complaint about "relation lacks storage".  NULL is
a horrible default for this.

regards, tom lane




Re: SQL/JSON: functions

2020-04-05 Thread Alexander Korotkov
On Mon, Mar 23, 2020 at 8:28 PM Nikita Glukhov  wrote:
> Attached 47th version of the patches.
>
> On 21.03.2020 22:38, Pavel Stehule wrote:
>
>
> On 21. 3. 2020 v 11:07 Nikita Glukhov  wrote:
>>
>> Attached 46th version of the patches.
>>
>> On 20.03.2020 22:34, Pavel Stehule wrote:
>>
>>
>> On 19.03.2020 23:57 Nikita Glukhov  wrote:
>>>
>>> Attached 45th version of the patches.
>>>
>>> Nodes JsonFormat, JsonReturning, JsonPassing, JsonBehavior were fixed.
>>>
>>> On 17.03.2020 21:35, Pavel Stehule wrote:

 User functions json[b]_build_object_ext() and json[b]_build_array_ext() 
 also
 can be easily removed.   But it seems harder to remove new aggregate 
 functions
 json[b]_objectagg() and json[b]_agg_strict(), because they can't be called
 directly from JsonCtorExpr node.
>>>
>>>
>>> I don't see reasons for another reduction now. Can be great if you can 
>>> finalize work what you plan for pg13.
>>>
>> I have removed json[b]_build_object_ext() and json[b]_build_array_ext().
>>
>> But json[b]_objectagg() and json[b]_agg_strict() are still present.
>> It seems that removing them requires majors refactoring of the execution
>> of Aggref and WindowFunc nodes.
>
> I have replaced aggregate function
>
> json[b]_objectagg(key any, val any, absent_on_null boolean, unique_keys 
> boolean)
>
> with three separate functions:
>
> json[b]_object_agg_strict(any, any)
> json[b]_object_agg_unique(any, any)
> json[b]_object_agg_unique_strict(any, any)
>
>
> This should be more correct than single aggregate with additional parameters.

I've following notes about this patchset.

1) Uniqueness checks using JsonbUniqueCheckContext and
JsonUniqueCheckContext have quadratic complexity over number of keys.
That doesn't look good especially for jsonb, which anyway sorts object
keys before object serialization.
2) We have two uniqueness checks for json type, which use
JsonbUniqueCheckContext and JsonUniqueState.  JsonUniqueState uses
stack of hashes, while JsonbUniqueCheckContext have just plain array
of keys.  I think we can make JsonUniqueState use single hash, where
object identifies would be part of hash key.  And we should replace
JsonbUniqueCheckContext with JsonUniqueState.  That would eliminate
extra entities and provide reasonable complexity for cases, which now
use JsonbUniqueCheckContext.
3) New SQL/JSON clauses don't use timezone and considered as immutable
assuming all the children are immutable.  Immutability is good, but
ignoring timezone in all the cases is plain wrong.  The first thing we
can do is to use timezone and make SQL/JSON clauses stable.  But that
limits their usage in functional and partial indexes.  I see couple of
things we can do next (one of them or probably both).
3.1) Provide user a way so specify that we should ignore timezone in
particular case (IGNORE TIMEZONE clause or something like that).  Then
SQL/JSON clause will be considered as immutable.
3.2) Automatically detect whether jsonpath might use timezone.  If
jsonpath doesn't use .datetime() method, it doesn't need timezone for
sure.  Also, from the datetime format specifiers we can get that we
don't compare non-timezoned values to timezoned values.  So, if we
detect this jsonpath never uses timezone, we can consider SQL/JSON
clause as immutable.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Alexander Korotkov
On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra
 wrote:
> On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:
> >On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
> > wrote:
> >> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
> >> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> >> >> Yeah, the opclass params patches got broken by 773df883e adding enum
> >> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
> >> >> Nikita to fix it in [1]. Until that happens, apply the patches on
> >> >> top of caba97a9d9 for review.
> >> >
> >> >This has been close to two months now, so I have the patch as RwF.
> >> >Feel free to update if you think that's incorrect.
> >>
> >> I see the opclass parameters patch got committed a couple days ago, so
> >> I've rebased the patch series on top of it. The pach was marked RwF
> >> since 2019-11, so I'll add it to the next CF.
> >
> >I think this patchset was marked RwF mainly because slow progress on
> >opclass parameters.  Now we got opclass parameters committed, and I
> >think this patchset is in a pretty good shape.  Moreover, opclass
> >parameters patch comes with very small examples.  This patchset would
> >be great showcase for opclass parameters.
> >
> >I'd like to give this patchset a chance for v13.  I'm going to make
> >another pass trough this patchset.  If I wouldn't find serious issues,
> >I'm going to commit it.  Any objections?
> >
>
> I'm an author of the patchset and I'd love to see it committed, but I
> think that might be a bit too rushed and unfair (considering it was not
> included in the current CF at all).
>
> I think the code is correct and I'm not aware of any bugs, but I'm not
> sure there was sufficient discussion about things like costing, choosing
> parameter values (e.g.  number of values in the multi-minmax or bloom
> filter parameters).

Ok!

> That being said, I think the first couple of patches (that modify how
> BRIN deals with multi-key scans and IS NULL clauses) are simple enough
> and non-controversial, so maybe we could get 0001-0003 committed, and
> leave the bloom/multi-minmax opclasses for v14.

Regarding 0001-0003 I've couple of notes:
1) They should revise BRIN extensibility documentation section.
2) I think 0002 and 0003 should be merged.  NULL ScanKeys should be
still passed to consistent function when oi_regular_nulls == false.

Assuming we're not going to get 0001-0003 into v13, I'm not so
inclined to rush on these three as well.  But you're willing to commit
them, you can count round of review on me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Alexander Korotkov
On Sun, Apr 5, 2020 at 6:51 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > I'd like to give this patchset a chance for v13.  I'm going to make
> > another pass trough this patchset.  If I wouldn't find serious issues,
> > I'm going to commit it.  Any objections?
>
> I think it is way too late to be reviving major features that nobody
> has been looking at for months, that indeed were never even listed
> in the final CF.  At this point in the cycle I think we should just be
> trying to get small stuff over the line, not shove in major patches
> and figure they can be stabilized later.
>
> In this particular case, the last serious work on the patchset seems
> to have been Tomas' revision of 2019-09-14, and he specifically stated
> then that the APIs still needed work.  That doesn't sound like
> "it's about ready to commit" to me.

OK, got it.  Thank you for the feedback.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Index Skip Scan

2020-04-05 Thread Dmitry Dolgov
> On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
>
> I was just wondering how the distinct will work with the "skip scan"
> if we have some filter? I mean every time we select the unique row
> based on the prefix key and that might get rejected by an external
> filter right?

Not exactly. In the case of index-only scan, we skipping to the first
unique position, and then use already existing functionality
(_bt_readpage with stepping to the next pages) to filter out those
records that do not pass the condition. There are even couple of tests
in the patch for this. In case of index scan, when there are some
conditions, current implementation do not consider skipping.

> So I tried an example to check this.

Can you tell on which version of the patch you were testing?




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Tomas Vondra

On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:

Hi!

On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
 wrote:

On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
>On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
>> Yeah, the opclass params patches got broken by 773df883e adding enum
>> reloptions. The breakage is somewhat extensive so I'll leave it up to
>> Nikita to fix it in [1]. Until that happens, apply the patches on
>> top of caba97a9d9 for review.
>
>This has been close to two months now, so I have the patch as RwF.
>Feel free to update if you think that's incorrect.

I see the opclass parameters patch got committed a couple days ago, so
I've rebased the patch series on top of it. The pach was marked RwF
since 2019-11, so I'll add it to the next CF.


I think this patchset was marked RwF mainly because slow progress on
opclass parameters.  Now we got opclass parameters committed, and I
think this patchset is in a pretty good shape.  Moreover, opclass
parameters patch comes with very small examples.  This patchset would
be great showcase for opclass parameters.

I'd like to give this patchset a chance for v13.  I'm going to make
another pass trough this patchset.  If I wouldn't find serious issues,
I'm going to commit it.  Any objections?



I'm an author of the patchset and I'd love to see it committed, but I
think that might be a bit too rushed and unfair (considering it was not
included in the current CF at all).

I think the code is correct and I'm not aware of any bugs, but I'm not
sure there was sufficient discussion about things like costing, choosing
parameter values (e.g.  number of values in the multi-minmax or bloom
filter parameters).

That being said, I think the first couple of patches (that modify how
BRIN deals with multi-key scans and IS NULL clauses) are simple enough
and non-controversial, so maybe we could get 0001-0003 committed, and
leave the bloom/multi-minmax opclasses for v14.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Tom Lane
Alexander Korotkov  writes:
> I'd like to give this patchset a chance for v13.  I'm going to make
> another pass trough this patchset.  If I wouldn't find serious issues,
> I'm going to commit it.  Any objections?

I think it is way too late to be reviving major features that nobody
has been looking at for months, that indeed were never even listed
in the final CF.  At this point in the cycle I think we should just be
trying to get small stuff over the line, not shove in major patches
and figure they can be stabilized later.

In this particular case, the last serious work on the patchset seems
to have been Tomas' revision of 2019-09-14, and he specifically stated
then that the APIs still needed work.  That doesn't sound like
"it's about ready to commit" to me.

regards, tom lane




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Alexander Korotkov
Hi!

On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
 wrote:
> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> >> Yeah, the opclass params patches got broken by 773df883e adding enum
> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
> >> Nikita to fix it in [1]. Until that happens, apply the patches on
> >> top of caba97a9d9 for review.
> >
> >This has been close to two months now, so I have the patch as RwF.
> >Feel free to update if you think that's incorrect.
>
> I see the opclass parameters patch got committed a couple days ago, so
> I've rebased the patch series on top of it. The pach was marked RwF
> since 2019-11, so I'll add it to the next CF.

I think this patchset was marked RwF mainly because slow progress on
opclass parameters.  Now we got opclass parameters committed, and I
think this patchset is in a pretty good shape.  Moreover, opclass
parameters patch comes with very small examples.  This patchset would
be great showcase for opclass parameters.

I'd like to give this patchset a chance for v13.  I'm going to make
another pass trough this patchset.  If I wouldn't find serious issues,
I'm going to commit it.  Any objections?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Consider low startup cost in add_partial_path

2020-04-05 Thread Tomas Vondra

Hi,

For the record, here is the relevant part of the Incremental Sort patch
series, updating add_partial_path and add_partial_path_precheck to also
consider startup cost.

The changes in the first two patches are pretty straight-forward, plus
there's a proposed optimization in the precheck function to only run
compare_pathkeys if entirely necessary. I'm currently evaluating those
changes and I'll post the results to the incremental sort thread.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 761b935584229243ecc6fd47d83e86d6b1b382c7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH 1/5] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +--
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c 
b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *   Because we don't consider parameterized paths here, we also don't
  *   need to consider the row counts as a measure of quality: every path 
will
- *   produce the same number of rows.  Neither do we need to consider 
startup
- *   costs: parallelism is only used for plans that will be run to 
completion.
- *   Therefore, this routine is much simpler than add_path: it needs to
- *   consider only pathkeys and total cost.
+ *   produce the same number of rows.  It may however matter how much the
+ *   path ordering matches the final ordering, needed by upper parts of the
+ *   plan. Because that will affect how expensive the incremental sort is,
+ *   we need to consider both the total and startup path, in addition to
+ *   pathkeys.
  *
  *   As with add_path, we pfree paths that are found to be dominated by
  *   another partial path; this requires that there be no other references 
to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
/* Compare pathkeys. */
keyscmp = compare_pathkeys(new_path->pathkeys, 
old_path->pathkeys);
 
-   /* Unless pathkeys are incompatible, keep just one of the two 
paths. */
+   /*
+* Unless pathkeys are incompatible, see if one of the paths 
dominates
+* the other (both in startup and total cost). It may happen 
that one
+* path has lower startup cost, the other has lower total cost.
+*
+* XXX Perhaps we could do this only when incremental sort is 
enabled,
+* and use the simpler version (comparing just total cost) 
otherwise?
+*/
if (keyscmp != PATHKEYS_DIFFERENT)
{
-   if (new_path->total_cost > old_path->total_cost * 
STD_FUZZ_FACTOR)
-   {
-   /* New path costs more; keep it only if 
pathkeys are better. */
-   if (keyscmp != PATHKEYS_BETTER1)
-   accept_new = false;
-   }
-   else if (old_path->total_cost > new_path->total_cost
-* STD_FUZZ_FACTOR)
+   PathCostComparison costcmp;
+
+   /*
+* Do a fuzzy cost comparison with standard fuzziness 
limit.
+*/
+   costcmp = compare_path_costs_fuzzily(new_path, old_path,
+   
 STD_FUZZ_FACTOR);
+
+   if (costcmp == COSTS_BETTER1)
{
-   /* Old path costs more; keep it only if 
pathkeys are better. */
-   if (keyscmp !=

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-05 Thread Tomas Vondra

On Sun, Apr 05, 2020 at 03:01:10PM +0200, Tomas Vondra wrote:

On Thu, Apr 02, 2020 at 09:40:45PM -0400, James Coleman wrote:

On Thu, Apr 2, 2020 at 8:46 PM James Coleman  wrote:


On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
 wrote:

...
5) Overall, I think the costing is OK. I'm sure we'll find cases that
will need improvements, but that's fine. However, we now have

- cost_tuplesort (used to be cost_sort)
- cost_full_sort
- cost_incremental_sort
- cost_sort

I find it a bit confusing that we have cost_sort and cost_full_sort. Why
don't we just keep using the dummy path in label_sort_with_costsize?
That seems to be the only external caller outside costsize.c. Then we
could either make cost_full_sort static or get rid of it entirely.


This another area of the patch I haven't really modified.


See attached for a cleanup of this; it removed cost_fullsort so
label_sort_with_costsize is back to how it was.

I've directly merged this into the patch series; if you'd like to see
the diff I can send that along.



Thanks. Attached is v54 of the patch, with some minor changes. The main
two changes are in add_partial_path_precheck(), firstly to also consider
startup_cost, as discussed before. The second change (in 0003) is a bit
of an experiment to make add_partial_precheck() cheaper by only calling
compare_pathkeys after checking the costs first (which should be cheaper
than the function call). add_path_precheck already does it in that order
anyway.



Oh, I forgot to mention a change in add_partial_path - I've removed the
reference/dependency on enable_incrementalsort. It seemed rather ugly,
and the results without it seem fine (I'm benchmarking only the case
with incremental sort enabled anyway). I also plan to look at the other
optimization we bolted on last week, i.e. checking length of pathkeys.
I'll see if that actually makes measurable difference.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: backup manifests and contemporaneous buildfarm failures

2020-04-05 Thread Mikael Kjellström

On 2020-04-04 04:43, Robert Haas wrote:


I think I've done about as much as I can do for tonight, though. Most
things are green now, and the ones that aren't are failing because of
stuff that is at least plausibly fixed. By morning it should be
clearer how much broken stuff is left, although that will be somewhat
complicated by at least sidewinder and seawasp needing manual
intervention to get back on track.


I fixed sidewinder I think.  Should clear up the next time it runs.

It was the mode on the directory it couldn't handle-  A regular rm -rf 
didn't work I had to do a chmod -R 700 on all directories to be able to 
manually remove it.


/Mikael




Re: pgbench - add pseudo-random permutation function

2020-04-05 Thread Fabien COELHO



Attached is an attempt at improving things. I have added a explicit note and 
hijacked an existing example to better illustrate the purpose of the 
function.


A significant part of the complexity of the patch is the overflow-handling 
implementation of (a * b % c) for 64 bits integers.


However this implementation is probably never used because int128 bits are 
available and the one line implementation takes precedence, or the size is 
small enough (less than 48 bits) so that there are no overflows with the 
primes involved, thus the operation is done directly on 64 bits integers.


I could remove the implementation and replace it with a "not available on 
this architecture" message, which would seldom be triggered: you would 
have to use a 32 bits arch (?) and test against a table with about 262 
Trows, which I guess does not exists anywhere. This approach would remove 
about 40% of the code & comments.


Thougths?

--
Fabien.




Re: Improving connection scalability: GetSnapshotData()

2020-04-05 Thread David Rowley
On Sun, 1 Mar 2020 at 21:47, Andres Freund  wrote:
> On 2020-03-01 00:36:01 -0800, Andres Freund wrote:
> > conns   tps mastertps pgxact-split
> >
> > 1   26842.49284526524.194821
> > 10  246923.158682   249224.782661
> > 50  695956.539704   709833.746374
> > 100 1054727.043139  1903616.306028
> > 200 964795.282957   1949200.338012
> > 300 906029.377539   1927881.231478
> > 400 845696.690912   1911065.369776
> > 500 812295.222497   1926237.255856
> > 600 888030.104213   1903047.236273
> > 700 866896.532490   1886537.202142
> > 800 863407.341506   1883768.592610
> > 900 871386.608563   1874638.012128
> > 1000887668.277133   1876402.391502
> > 1500860051.361395   1815103.564241
> > 2000890900.098657   1775435.271018
> > 3000874184.980039   1653953.817997
> > 4000845023.080703   1582582.316043
> > 5000817100.195728   1512260.802371
> >
> > I think these are pretty nice results.

FWIW, I took this for a spin on an AMD 3990x:

# setup
pgbench -i postgres

#benchmark
#!/bin/bash

for i in 1 10 50 100 200 300 400 500 600 700 800 900 1000 1500 2000
3000 4000 5000;
do
echo Testing with $i connections >> bench.log
pgbench2 -M prepared -c $i -j $i -S -n -T 60 postgres >> bench.log
done

pgbench2 is your patched version pgbench. I got some pretty strange
results with the unpatched version.  Up to about 50 million tps for
excluding connection establishing, which seems pretty farfetched

connectionsUnpatchedPatched
149062.2441349834.64983
10428673.1027453290.5985
501552413.0841849233.821
1002039675.0272261437.1
2003139648.8453632008.991
3003091248.3163597748.942
4003056453.53567888.293
5003019571.473574009.053
6002991052.3933537518.903
7002952484.7633553252.603
8002910976.8753539404.865
9002873929.9893514353.776
10002846859.4993490006.026
15002540003.0383370093.934
20002361799.1073197556.738
30002056973.7782949740.692
40001751418.1172627174.81
50001464786.4612334586.042

> Attached as a graph as well.

Likewise.

David


Re: Online checksums verification in the backend

2020-04-05 Thread Julien Rouhaud
On Sun, Apr 05, 2020 at 08:01:36PM +0900, Masahiko Sawada wrote:
> On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud  wrote:
> >
> > On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote:
> > >
> > > Why do we need two rows in the doc? For instance, replication slot
> > > functions have some optional arguments but there is only one row in
> > > the doc. So I think we don't need to change the doc from the previous
> > > version patch.
> > >
> >
> > I thought that if we document the function as pg_check_relation(regclass [,
> > fork]) users could think that the 2nd argument is optional, so that
> > pg_check_relation('something', NULL) could be a valid alias for the 
> > 1-argument
> > form, which it isn't.  After checking, I see that e.g. current_setting has 
> > the
> > same semantics and is documented the way you suggest, so fixed back to 
> > previous
> > version.
> >
> > > And I think these are not necessary as we already defined in
> > > include/catalog/pg_proc.dat:
> > >
> > > +CREATE OR REPLACE FUNCTION pg_check_relation(
> > > +  IN relation regclass,
> > > +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> > > +  OUT expected_checksum integer, OUT found_checksum integer)
> > > +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 
> > > 'pg_check_relation'
> > > +  PARALLEL RESTRICTED;
> > > +
> > > +CREATE OR REPLACE FUNCTION pg_check_relation(
> > > +  IN relation regclass, IN fork text,
> > > +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> > > +  OUT expected_checksum integer, OUT found_checksum integer)
> > > +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
> > > +  AS 'pg_check_relation_fork'
> > > +  PARALLEL RESTRICTED;
> > >
> >
> > Oh right this isn't required since there's no default value anymore, fixed.
> >
> > v9 attached.
> 
> Thank you for updating the patch! The patch looks good to me.
> 
> I've marked this patch as Ready for Committer. I hope this patch will
> get committed to PG13.
> 
Thanks a lot!




Re: Online checksums verification in the backend

2020-04-05 Thread Masahiko Sawada
On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud  wrote:
>
> On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote:
> >
> > Why do we need two rows in the doc? For instance, replication slot
> > functions have some optional arguments but there is only one row in
> > the doc. So I think we don't need to change the doc from the previous
> > version patch.
> >
>
> I thought that if we document the function as pg_check_relation(regclass [,
> fork]) users could think that the 2nd argument is optional, so that
> pg_check_relation('something', NULL) could be a valid alias for the 1-argument
> form, which it isn't.  After checking, I see that e.g. current_setting has the
> same semantics and is documented the way you suggest, so fixed back to 
> previous
> version.
>
> > And I think these are not necessary as we already defined in
> > include/catalog/pg_proc.dat:
> >
> > +CREATE OR REPLACE FUNCTION pg_check_relation(
> > +  IN relation regclass,
> > +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> > +  OUT expected_checksum integer, OUT found_checksum integer)
> > +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 
> > 'pg_check_relation'
> > +  PARALLEL RESTRICTED;
> > +
> > +CREATE OR REPLACE FUNCTION pg_check_relation(
> > +  IN relation regclass, IN fork text,
> > +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> > +  OUT expected_checksum integer, OUT found_checksum integer)
> > +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
> > +  AS 'pg_check_relation_fork'
> > +  PARALLEL RESTRICTED;
> >
>
> Oh right this isn't required since there's no default value anymore, fixed.
>
> v9 attached.

Thank you for updating the patch! The patch looks good to me.

I've marked this patch as Ready for Committer. I hope this patch will
get committed to PG13.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Index Skip Scan

2020-04-05 Thread Dilip Kumar
On Wed, Mar 25, 2020 at 2:19 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote:
> >
> > Seems like you forgot to add the uniquekey.c file in the
> > v33-0001-Unique-key.patch.
>
> Oh, you're right, thanks. Here is the corrected patch.

I was just wondering how the distinct will work with the "skip scan"
if we have some filter? I mean every time we select the unique row
based on the prefix key and that might get rejected by an external
filter right?  So I tried an example to check this.

postgres[50006]=# \d t
 Table "public.t"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Indexes:
"idx" btree (a, b)

postgres[50006]=# insert into t select 2, i from generate_series(1, 200)i;
INSERT 0 200
postgres[50006]=# insert into t select 1, i from generate_series(1, 200)i;
INSERT 0 200

postgres[50006]=# set enable_indexskipscan =off;
SET
postgres[50006]=# select distinct(a) from t where b%100=0;
 a
---
 1
 2
(2 rows)

postgres[50006]=# set enable_indexskipscan =on;
SET
postgres[50006]=# select distinct(a) from t where b%100=0;
 a
---
(0 rows)

postgres[50006]=# explain select distinct(a) from t where b%100=0;
QUERY PLAN
---
 Index Only Scan using idx on t  (cost=0.15..1.55 rows=10 width=4)
   Skip scan: true
   Filter: ((b % 100) = 0)
(3 rows)

I think in such cases we should not select the skip scan.  This should
behave like we have a filter on the non-index field.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Allow cluster owner to bypass authentication

2020-04-05 Thread Peter Eisentraut

On 2020-03-27 15:58, David Steele wrote:

Hi Peter,

On 12/27/19 3:22 PM, Stephen Frost wrote:

* Tom Lane (t...@sss.pgh.pa.us) wrote:


I think it'd be great if this behavior could be implemented
within the notation, because we could then just set up a
non-empty default pg_ident.conf with useful behavioral
examples in the form of prefab maps.  In particular, we
should think about how hard it is to do "I want the default
behavior plus allow joe to connect as charlie".  If the
default is a one-liner that you can copy and add to,
that's a lot better than if you have to reverse-engineer
what to write.


This direction certainly sounds more appealing to me.


Any thoughts on the discussion between Stephen and Tom?


It appears that the whole discussion of what a new default security 
configuration could or should be hasn't really moved to a new consensus, 
so given the time, I think it's best that we leave things as they are 
and continue the exploration at some future time.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Yet another fast GiST build

2020-04-05 Thread Komяpa
Hello Yuri,

PDEP is indeed first thing that comes up when you start googling
z-curve and bit interleaving :)
We had the code with z-curve generating PDEP instruction in PostGIS,
and dropped it since. In sorting, we now utilize sort support / prefix
search, and key generated as Hilbert curve, with fine tuning it for
different projections' geometric properties.

>From this patch the most valuable thing for us is the sorting build
infrastructure itself. Maybe to get it a bit more understandable for
people not deep in geometry it makes sense to first expose the
btree_gist datatypes to this thing? So that btree_gist index on
integer will be built exactly the same way the btree index on integer
is built. This will also get everyone a reference point on the
bottlenecks and optimality of patch.

On Fri, Apr 3, 2020 at 10:56 AM Yuri Astrakhan  wrote:
>
> Awesome addition!  Would it make sense to use x86's BMI2's PDEP instruction, 
> or is the interleave computation too small of a percentage to introduce 
> not-so-easy-to-port code?  Also, I think it needs a bit more documentation to 
> explain the logic, i.e. a link to 
> https://stackoverflow.com/questions/39490345/interleave-bits-efficiently ?  
> Thx for making it faster :)



-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa




Re: Online checksums verification in the backend

2020-04-05 Thread Julien Rouhaud
On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote:
> 
> Why do we need two rows in the doc? For instance, replication slot
> functions have some optional arguments but there is only one row in
> the doc. So I think we don't need to change the doc from the previous
> version patch.
> 

I thought that if we document the function as pg_check_relation(regclass [,
fork]) users could think that the 2nd argument is optional, so that
pg_check_relation('something', NULL) could be a valid alias for the 1-argument
form, which it isn't.  After checking, I see that e.g. current_setting has the
same semantics and is documented the way you suggest, so fixed back to previous
version.

> And I think these are not necessary as we already defined in
> include/catalog/pg_proc.dat:
> 
> +CREATE OR REPLACE FUNCTION pg_check_relation(
> +  IN relation regclass,
> +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> +  OUT expected_checksum integer, OUT found_checksum integer)
> +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 
> 'pg_check_relation'
> +  PARALLEL RESTRICTED;
> +
> +CREATE OR REPLACE FUNCTION pg_check_relation(
> +  IN relation regclass, IN fork text,
> +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> +  OUT expected_checksum integer, OUT found_checksum integer)
> +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
> +  AS 'pg_check_relation_fork'
> +  PARALLEL RESTRICTED;
> 

Oh right this isn't required since there's no default value anymore, fixed.

v9 attached.
>From 66e0ed1c3b12c42c4d52b2b27e79e16e82730b4b Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v9] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: 
https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  85 +
 doc/src/sgml/func.sgml|  51 +++
 src/backend/storage/page/checksum.c   | 318 +-
 src/backend/utils/adt/Makefile|   1 +
 src/backend/utils/adt/checksumfuncs.c | 217 
 src/backend/utils/init/globals.c  |   7 +
 src/backend/utils/misc/guc.c  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat   |  16 +
 src/include/miscadmin.h   |   7 +
 src/include/storage/checksum.h|  13 +
 src/include/utils/guc_tables.h|   1 +
 src/test/Makefile |   3 +-
 src/test/check_relation/.gitignore|   2 +
 src/test/check_relation/Makefile  |  23 ++
 src/test/check_relation/README|  23 ++
 .../check_relation/t/01_checksums_check.pl| 276 +++
 17 files changed, 1078 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 114db38116..a20501bb85 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2038,6 +2038,91 @@ include_dir 'conf.d'
  
 
 
+
+ Cost-based Checksum Verification Delay
+
+ 
+  During the execution of 
+  function, the system maintains an internal counter that keeps track of
+  the estimated cost of the various I/O operations that are performed.
+  When the accumulated cost reaches a limit (specified by
+  checksum_cost_limit), the process performing the
+  operation will sleep for a short period of time, as specified by
+  checksum_cost_delay. Then it will reset the counter
+  and continue execution.
+ 
+
+ 
+  This feature is disabled by default. To enable it, set the
+  checksum_cost_delay variable to a nonzero
+  value.
+ 
+
+ 
+  
+   checksum_cost_delay (floating 
point)
+   
+checksum_cost_delay configuration 
parameter
+   
+   
+   
+
+ The amount of time that the process will sleep
+ when the cost limit has been exceeded.
+ If this value is specified without units, it is taken as milliseconds.
+ The default value is zero, which disables the cost-based checksum
+ verification delay feature.  Positive values enable cost-based
+ checksum verification.
+
+   
+  
+
+  
+   checksum_cost_page (integer)
+   
+ch

Re: Online checksums verification in the backend

2020-04-05 Thread Masahiko Sawada
On Sun, 5 Apr 2020 at 17:44, Julien Rouhaud  wrote:
>
> On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote:
> >
> > Thank you for updating the patch! The patch looks good to me. Here are
> > some random comments mostly about cosmetic changes.
> >
>
> Thanks a lot for the review!

Thank you for updating the patch.

>
> >
> > 1.
> > I think we can have two separate SQL functions:
> > pg_check_relation(regclass, text) and pg_check_relation(regclass),
> > instead of setting NULL by default to the second argument.
> >
>
> I'm fine with it, so implemented this way with the required documentation
> changes.

Why do we need two rows in the doc? For instance, replication slot
functions have some optional arguments but there is only one row in
the doc. So I think we don't need to change the doc from the previous
version patch.

And I think these are not necessary as we already defined in
include/catalog/pg_proc.dat:

+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass, IN fork text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
+  AS 'pg_check_relation_fork'
+  PARALLEL RESTRICTED;

>
> >
> > 2.
> > + * Check data sanity for a specific block in the given fork of the given
> > + * relation, always retrieved locally with smgrred even if a version 
> > exists in
> >
> > s/smgrred/smgrread/
>
> Fixed.
>
> >
> > 3.
> > +   /* The buffer will have to check checked. */
> > +   Assert(checkit);
> >
> > Should it be "The buffer will have to be checked"?
> >
>
> Oops indeed, fixed.
>
> >
> > 4.
> > +   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
> > +   ereport(ERROR,
> > +   (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > +errmsg("only superuser or a member of the
> > pg_read_server_files role may use this function")));
> >
> > Looking at the definition of pg_stat_read_server_files role, this role
> > seems to be for operations that could read non-database files such as
> > csv files. Therefore, currently this role is used by file_fdw and COPY
> > command. I personally think pg_stat_scan_tables would be more
> > appropriate for this function but I'm not sure.
> >
>
> That's a very good point, especially since the documentation of this default
> role is quite relevant for those functions:
>
> "Execute monitoring functions that may take ACCESS SHARE locks on tables,
> potentially for a long time."
>
> So changed!
>
> >
> > 5.
> > +   /* Set cost-based vacuum delay */
> > +   ChecksumCostActive = (ChecksumCostDelay > 0);
> > +   ChecksumCostBalance = 0;
> >
> > s/vacuum/checksum verification/
> >
>
> Fixed.
>
> >
> > 6.
> > +   ereport(WARNING,
> > +   (errcode(ERRCODE_DATA_CORRUPTED),
> > +errmsg("invalid page in block %u of relation %s",
> > +   blkno,
> > +   relpath(relation->rd_smgr->smgr_rnode, forknum;
> >
> > I think it's better to show the relation name instead of the relation path 
> > here.
> >
>
> I'm here using the same pattern as what ReadBuffer_common() would display if a
> corrupted block is read.  I think it's better to keep the format for both, so
> any existing log analyzer will keep working with those new functions.

Ok, I agree with you.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Online checksums verification in the backend

2020-04-05 Thread Julien Rouhaud
On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote:
> 
> Thank you for updating the patch! The patch looks good to me. Here are
> some random comments mostly about cosmetic changes.
> 

Thanks a lot for the review!

> 
> 1.
> I think we can have two separate SQL functions:
> pg_check_relation(regclass, text) and pg_check_relation(regclass),
> instead of setting NULL by default to the second argument.
> 

I'm fine with it, so implemented this way with the required documentation
changes.

> 
> 2.
> + * Check data sanity for a specific block in the given fork of the given
> + * relation, always retrieved locally with smgrred even if a version exists 
> in
> 
> s/smgrred/smgrread/

Fixed.

> 
> 3.
> +   /* The buffer will have to check checked. */
> +   Assert(checkit);
> 
> Should it be "The buffer will have to be checked"?
> 

Oops indeed, fixed.

> 
> 4.
> +   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +errmsg("only superuser or a member of the
> pg_read_server_files role may use this function")));
> 
> Looking at the definition of pg_stat_read_server_files role, this role
> seems to be for operations that could read non-database files such as
> csv files. Therefore, currently this role is used by file_fdw and COPY
> command. I personally think pg_stat_scan_tables would be more
> appropriate for this function but I'm not sure.
> 

That's a very good point, especially since the documentation of this default
role is quite relevant for those functions:

"Execute monitoring functions that may take ACCESS SHARE locks on tables,
potentially for a long time."

So changed!

> 
> 5.
> +   /* Set cost-based vacuum delay */
> +   ChecksumCostActive = (ChecksumCostDelay > 0);
> +   ChecksumCostBalance = 0;
> 
> s/vacuum/checksum verification/
> 

Fixed.

> 
> 6.
> +   ereport(WARNING,
> +   (errcode(ERRCODE_DATA_CORRUPTED),
> +errmsg("invalid page in block %u of relation %s",
> +   blkno,
> +   relpath(relation->rd_smgr->smgr_rnode, forknum;
> 
> I think it's better to show the relation name instead of the relation path 
> here.
> 

I'm here using the same pattern as what ReadBuffer_common() would display if a
corrupted block is read.  I think it's better to keep the format for both, so
any existing log analyzer will keep working with those new functions.

> 
> 7.
> +   ereport(ERROR,
> +   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +errmsg("relation \"%s\" does not have storage to be checked",
> +quote_qualified_identifier(
> +get_namespace_name(get_rel_namespace(relid)),
> +get_rel_name(relid);
> 
> Looking at other similar error messages we don't show qualified
> relation name but the relation name gotten by
> RelationGetRelationName(relation). Can we do that here as well for
> consistency?
> 

Indeed, fixed.

> 
> 8.
> +   if (!(rsinfo->allowedModes & SFRM_Materialize))
> +   ereport(ERROR,
> +   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +errmsg("materialize mode required, but it is not " \
> +   "allowed in this context")));
> 
> I think it's better to have this error message in one line for easy grepping.

Fixed.

I also fixed missing leading tab in the perl TAP tests
>From a9634bf02c0e7e1f9b0e8cf4e1ad79a72ea80ec8 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v8] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: 
https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  85 +
 doc/src/sgml/func.sgml|  60 
 src/backend/catalog/system_views.sql  |  15 +
 src/backend/storage/page/checksum.c   | 318 +-
 src/backend/utils/adt/Makefile|   1 +
 src/backend/utils/adt/checksumfuncs.c | 217 
 src/backend/utils/init/globals.c  |   7 +
 src/backend/utils/misc/guc.c  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat   |  16 +
 src/include/miscadmin.h   |   7 +
 src/include/storage/checksum.h|  13 +
 src/include/utils/guc_tables.h|   1 +
 src/test/Makefile |   3 +-
 src/test/check_relation/.gitignore|   2 +
 src/test/check_relation/Makefile  |  23 ++
 

Re: Add A Glossary

2020-04-05 Thread Jürgen Purtz

a) Some rearrangements of the sequence of terms to meet alphabetical order.

b)   -->   
in two cases. Or should it be a ?



Kind regards, Jürgen


diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 8c6cb6e942..25762b7c3a 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -208,15 +208,6 @@

   
 
-  
-   Checkpointer (process)
-   
-
- A specialized process responsible for executing checkpoints.
-
-   
-  
-
   
Checkpoint

@@ -244,6 +235,15 @@

   
 
+  
+   Checkpointer (process)
+   
+
+ A specialized process responsible for executing checkpoints.
+
+   
+  
+
   
Class (archaic)

@@ -761,25 +761,6 @@

   
 
-  
-   Logger (process)
-   
-
- If activated, the
- Logger process
- writes information about database events into the current
- log file.
- When reaching certain time- or
- volume-dependent criteria, a new log file is created.
- Also called syslogger.
-
-
- For more information, see
- .
-
-   
-  
-
   
Log Record
 
@@ -803,6 +784,25 @@

   
 
+  
+   Logger (process)
+   
+
+ If activated, the
+ Logger process
+ writes information about database events into the current
+ log file.
+ When reaching certain time- or
+ volume-dependent criteria, a new log file is created.
+ Also called syslogger.
+
+
+ For more information, see
+ .
+
+   
+  
+
   
Master (server)

@@ -1651,6 +1651,11 @@

   
 
+  
+   WAL
+   
+  
+
   
WAL Archiver (process)

@@ -1696,11 +1701,6 @@

   
 
-  
-   WAL
-   
-  
-
   
WAL Record

@@ -1728,8 +1728,8 @@
   

 A process that writes WAL records
-from shared memory to
-WAL files.
+from shared memory to
+WAL files.


 For more information, see


segmentation fault using currtid and partitioned tables

2020-04-05 Thread Jaime Casanova
Hi,

Another one caught by sqlsmith, on the regression database run this
query (using any non-partitioned table works fine):

"""
select currtid('pagg_tab'::regclass::oid, '(0,156)'::tid) >= '(1,158)'::tid;
"""

This works on 11 (well it gives an error because the file doesn't
exists) but crash the server on 12+

attached the stack trace from master

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
#0  0x00a3a3f1 in table_beginscan_tid (rel=0x7ff8ad4d01b8, 
snapshot=0x1336430) at ../../../../src/include/access/tableam.h:842
flags = 8
#1  0x00a3b1e3 in currtid_byreloid (fcinfo=0x130c298) at tid.c:384
reloid = 37534
tid = 0x128f498
result = 0x135b960
rel = 0x7ff8ad4d01b8
aclresult = ACLCHECK_OK
snapshot = 0x1336430
scan = 0x10
#2  0x006fc889 in ExecInterpExpr (state=0x130c100, econtext=0x130be00, 
isnull=0x7fff1f9659e7) at execExprInterp.c:698
fcinfo = 0x130c298
args = 0x130c2b8
nargs = 2
d = 140733723337632
op = 0x130c2f0
resultslot = 0x130c008
innerslot = 0x0
outerslot = 0x0
scanslot = 0x0
dispatch_table = {0x6fc140 , 0x6fc165 
, 0x6fc19b , 0x6fc1d4 
,
  0x6fc20d , 0x6fc296 , 
0x6fc31f , 0x6fc3a8 , 0x6fc3d7 
,
  0x6fc406 , 0x6fc435 , 
0x6fc463 , 0x6fc50a , 0x6fc5b1 
,
  0x6fc658 , 0x6fc6b3 , 
0x6fc756 , 0x6fc78c ,
  0x6fc7f9 , 0x6fc8c8 , 
0x6fc8f6 , 0x6fc924 ,
  0x6fc92f , 0x6fc997 , 
0x6fc9f0 , 0x6fc9fb ,
  0x6fca63 , 0x6fcabc , 
0x6fcaf4 , 0x6fcb69 ,
  0x6fcb94 , 0x6fcbdf , 
0x6fcc2d , 0x6fcc88 ,
  0x6fccca , 0x6fcd0f , 
0x6fcd3d , 0x6fcd6b ,
  0x6fcda5 , 0x6fce08 , 
0x6fce6b , 0x6fcea5 ,
  0x6fced3 , 0x6fcf01 , 
0x6fcf31 , 0x6fd01d ,
  0x6fd073 , 0x6fd25b , 
0x6fd34e , 0x6fd432 ,
  0x6fd50e , 0x6fd535 , 
0x6fd55c , 0x6fd583 ,
  0x6fd5aa , 0x6fd5d8 , 
0x6fd5ff , 0x6fd745 ,
  0x6fd849 , 0x6fd870 , 
0x6fd89e , 0x6fd8cc ,
  0x6fd8fa , 0x6fd950 , 
0x6fd977 , 0x6fd99e ,
  0x6fcfa7 , 0x6fda1a , 
0x6fda41 , 0x6fd9c5 ,
  0x6fd9f3 , 0x6fda68 , 
0x6fda8f , 0x6fdb2d ,
  0x6fdb54 , 0x6fdbf2 , 
0x6fdc20 , 0x6fdc4e ,
  0x6fdc89 , 0x6fdd3c , 
0x6fddca , 0x6fde51 ,
  0x6fdede , 0x6fdffe , 
0x6fe0e8 , 0x6fe1bc ,
  0x6fe2d9 , 0x6fe3c0 , 
0x6fe491 , 0x6fe4bf ,
  0x6fe4ed }
#3  0x006fe563 in ExecInterpExprStillValid (state=0x130c100, 
econtext=0x130be00, isNull=0x7fff1f9659e7) at execExprInterp.c:1807
No locals.
#4  0x0074a22c in ExecEvalExprSwitchContext (state=0x130c100, 
econtext=0x130be00, isNull=0x7fff1f9659e7) at 
../../../src/include/executor/executor.h:313
retDatum = 18446462598752812812
oldContext = 0x130b990
#5  0x0074a295 in ExecProject (projInfo=0x130c0f8) at 
../../../src/include/executor/executor.h:347
econtext = 0x130be00
state = 0x130c100
slot = 0x130c008
isnull = false
#6  0x0074a47b in ExecResult (pstate=0x130bce8) at nodeResult.c:136
node = 0x130bce8
outerTupleSlot = 0x779642 
outerPlan = 0x0
econtext = 0x130be00
#7  0x00712656 in ExecProcNodeFirst (node=0x130bce8) at 
execProcnode.c:444
No locals.
#8  0x00707085 in ExecProcNode (node=0x130bce8) at 
../../../src/include/executor/executor.h:245
No locals.
#9  0x00709b39 in ExecutePlan (estate=0x130bab0, planstate=0x130bce8, 
use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x12b30c0, execute_once=true) at 
execMain.c:1646
slot = 0x0
current_tuple_count = 0
#10 0x007076a9 in standard_ExecutorRun (queryDesc=0x12ae280, 
direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364
estate = 0x130bab0
operation = CMD_SELECT
dest = 0x12b30c0
sendTuples = true
oldcontext = 0x12ae160
__func__ = "standard_ExecutorRun"
#11 0x007074d1 in ExecutorRun (queryDesc=0x12ae280, 
direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:308
No locals.
#12 0x00921d77 in PortalRunSelect (portal=0x12f3e70, forward=true, 
count=0, dest=0x12b30c0) at pquery.c:912
queryDesc = 0x12ae280
direction = ForwardScanDirection
nprocessed = 0
__func__ = "PortalRunSelect"
#13 0x00921a30 in PortalRun (portal=0x12f3e70, 
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x12b30c0, 
altdest=0x12b30c0, qc=0x7fff1f965d60)
at pquery.c:756
_save_exception_stack = 0x7fff1f965e70
_save_context_stack = 0x0
_local_sigjmp_buf = {{__jmpbuf = {0, 60194379429867500, 4706656, 
140733723337632, 0, 0, 60194379352272876, -59702997874741268}, __mask_was_saved 
= 0,
__saved_mask = 

Re: potential stuck lock in SaveSlotToPath()

2020-04-05 Thread Peter Eisentraut

On 2020-04-02 08:21, Michael Paquier wrote:

On Wed, Apr 01, 2020 at 04:26:25PM +0200, Peter Eisentraut wrote:

Good catch.  How about the attached patch?


WFM.  Another trick would be to call LWLockRelease() after generating
the log, but I find your patch more consistent with the surroundings.


done

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add A Glossary

2020-04-05 Thread Fabien COELHO



Hi Corey,


ISTM that occurrences of these words elsewhere in the documentation should
link to the glossary definitions?


Yes, that's a big project. I was considering writing a script to compile
all the terms as search terms, paired with their glossary ids, and then
invoke git grep to identify all pages that have term FOO but don't have
glossary-foo. We would then go about gloss-linking those pages as
appropriate, but only a few pages at a time to keep scope sane.


Id go for scripting the thing.

Should the glossary be backpatched, to possibly ease doc patchpatches?

Also, I'm unclear about the circumstances under which we should _not_ 
tag a term.


At least when then are explained locally.

I remember hearing that we should only tag it on the first usage, but is 
that per section or per page?


Page?


As the definitions are short and to the point, maybe the HTML display
could (also) "hover" the definitions when the mouse passes over the word,
using the "title" attribute?


I like that idea, if it doesn't conflict with accessibility standards
(maybe that's just titles on images, not sure).


The following worked fine:

  Title Tag Test
  The ACID
  property is great.
  

So basically the def can be put on the glossary link, however retrieving 
the definition should be automatic.



I suspect we would want to just carry over the first sentence or so with a
... to avoid cluttering the screen with my overblown definition of a
sequence.


Dunno. The definitions are quite short, maybe the can fit whole.


I suggest we pursue this idea in another thread, as we'd probably want to
do it for acronyms as well.


Or not. I'd test committer temperature before investing time because it 
would mean that backpatching the doc would be a little harder.



Entries could link to relevant wikipedia pages, like the acronyms section
does?


They could. I opted not to do that because each external link invites
debate about how authoritative that link is, which is easier to do with
acronyms. Now that the glossary is a reality, it's easier to have those
discussions.


Ok.

--
Fabien.




Comment explaining why vacuum needs to push snapshot seems insufficient.

2020-04-05 Thread Andres Freund
Hi,

vacuum_rel() has the following comment:
/*
 * Functions in indexes may want a snapshot set.  Also, setting a 
snapshot
 * ensures that RecentGlobalXmin is kept truly recent.
 */
PushActiveSnapshot(GetTransactionSnapshot());

which was added quite a while ago in

commit d53a56687f3d4772d17ffa0013a33231b7163731
Author: Alvaro Herrera 
Date:   2008-09-11 14:01:10 +


But to me that's understating the issue. Don't we e.g. need a snapshot
to ensure that pg_subtrans won't be truncated away?  I thought xlog.c
doesn't pass PROCARRAY_FLAGS_VACUUM to GetOldestXmin when truncating?
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));

It's fine for rows that vacuum could see according to its xmin to be
pruned away, since that won't happen while it has a page locked. But we
can't just have a pg_subtrans access error out, and there's no page
level interlock against that.

Also, without an xmin set, it seems possible that vacuum could see some
of the transaction ids it uses (in local memory) wrap around. While not
likely, it doesn't seem that unlikely either, since autovacuum will be
running full throttle if there's a 2 billion xid old transaction hanging
around. And if that super old transaction finishes, e.g. vacuum's
OldestXmin value could end up being in the future in the middle of
vacuuming the table (if that table has a new relfrozenxid).


How about replacing it with something like
/*
 * Need to acquire a snapshot to prevent pg_subtrans from being 
truncated,
 * cutoff xids in local memory wrapping around, and to have updated xmin
 * horizons.
 */

Greetings,

Andres Freund