date:20200811

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-08-11 Thread Justin Pryzby

On Tue, Aug 11, 2020 at 02:39:45PM +0900, Michael Paquier wrote:
> so I have removed that part, and applied the patch.

Thank you

> > The rest of your patch looks fine.  In my mind, REINDEX(CONCURRENTLY) was 
> > the
> > "new way" to write things, and it's what's easy to support, so I think I 
> > didn't
> > put special effort into making tab completion itself complete.
> 
> The grammar that has been committed was the one that for the most
> support, so we need to live with that.  I wonder if we should simplify
> ReindexStmt and move the "concurrent" flag to be under "options", but
> that may not be worth the time spent on as long as we don't have
> CONCURRENTLY part of the parenthesized grammar.

I think it's kind of a good idea, since the next patch does exactly that
(parenthesize (CONCURRENTLY)).

I included that as a new 0002, but it doesn't save anything though, so maybe
it's not a win.

$ git diff --stat
 src/backend/commands/indexcmds.c | 20 +++-
 src/backend/nodes/copyfuncs.c|  1 -
 src/backend/nodes/equalfuncs.c   |  1 -
 src/backend/parser/gram.y| 16 
 src/backend/tcop/utility.c   |  6 +++---
 src/include/nodes/parsenodes.h   |  2 +-
 6 files changed, 27 insertions(+), 19 deletions(-)

-- 
Justin
>From fed84348477785f8b0b2ae3a1caf2e927e405f96 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 27 Mar 2020 17:50:46 -0500
Subject: [PATCH v22 1/5] Change REINDEX/CLUSTER to accept an option list..

..like EXPLAIN (..), VACUUM (..), and ANALYZE (..).

Change docs in the style of VACUUM.  See also: 52dcfda48778d16683c64ca4372299a099a15b96
---
 doc/src/sgml/ref/cluster.sgml| 29 ++---
 doc/src/sgml/ref/reindex.sgml| 43 +---
 src/backend/commands/cluster.c   | 23 -
 src/backend/commands/indexcmds.c | 41 --
 src/backend/nodes/copyfuncs.c|  2 ++
 src/backend/nodes/equalfuncs.c   |  2 ++
 src/backend/parser/gram.y| 35 +++---
 src/backend/tcop/utility.c   | 36 +++---
 src/bin/psql/tab-complete.c  | 23 +
 src/include/commands/cluster.h   |  3 ++-
 src/include/commands/defrem.h|  7 +++---
 src/include/nodes/parsenodes.h   |  2 ++
 12 files changed, 180 insertions(+), 66 deletions(-)

diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 4da60d8d56..a6df8a3d81 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -21,8 +21,13 @@ PostgreSQL documentation
 
  
 
-CLUSTER [VERBOSE] table_name [ USING index_name ]
-CLUSTER [VERBOSE]
+CLUSTER [VERBOSE] [ ( option [, ...] ) ] table_name [ USING index_name ]
+CLUSTER [VERBOSE] [ ( option [, ...] ) ]
+
+where option can be one of:
+
+VERBOSE [ boolean ]
+
 
  
 
@@ -81,6 +86,16 @@ CLUSTER [VERBOSE]
   Parameters
 
   
+   
+VERBOSE
+
+ 
+  Prints a progress report as each table is clustered.
+
+ 
+
+   
+

 table_name
 
@@ -100,13 +115,19 @@ CLUSTER [VERBOSE]

 

-VERBOSE
+boolean
 
  
-  Prints a progress report as each table is clustered.
+  Specifies whether the selected option should be turned on or off.
+  You can write TRUE, ON, or
+  1 to enable the option, and FALSE,
+  OFF, or 0 to disable it.  The
+  boolean value can also
+  be omitted, in which case TRUE is assumed.
  
 

+
   
  
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index aac5d5be23..ff369a7313 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -25,7 +25,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 where option can be one of:
 
-VERBOSE
+VERBOSE [ boolean ]
 
  
 
@@ -141,19 +141,6 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
-   
-name
-
- 
-  The name of the specific index, table, or database to be
-  reindexed.  Index and table names can be schema-qualified.
-  Presently, REINDEX DATABASE and REINDEX SYSTEM
-  can only reindex the current database, so their parameter must match
-  the current database's name.
- 
-
-   
-

 CONCURRENTLY
 
@@ -181,6 +168,34 @@ REINDEX [ ( option [, ...] ) ] { IN
  
 

+
+   
+name
+
+ 
+  The name of the specific index, table, or database to be
+  reindexed.  Index and table names can be schema-qualified.
+  Presently, REINDEX DATABASE and REINDEX SYSTEM
+  can only reindex the current database, so their parameter must match
+  the current database's name.
+ 
+
+   
+
+   
+boolean
+
+ 
+  Specifies whether the selected option should be turned on or off.
+  You can write TRUE, ON, or
+  1 to enable the option, and FALSE,
+  OFF, or 0 to disable it.  The
+  boolean value can also
+  be omitted, in which case TRUE is assumed.
+ 
+
+   
+
   
  
 
diff --git a/src/b

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-11 Thread Ashutosh Sharma

On Fri, Aug 7, 2020 at 9:21 PM Robert Haas  wrote:
>
> On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma  wrote:
> There's a certain inconsistency to these messages:
>
> rhaas=# create table foo (a int);
> CREATE TABLE
> rhaas=# insert into foo values (1);
> INSERT 0 1
> rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> NOTICE:  skipping tid (0, 2) because it contains an invalid offset
>  heap_force_kill
> -
>
> (1 row)
>
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> ERROR:  invalid item pointer
> LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> ERROR:  block number 1 is out of range for relation "foo"
>
> From a user perspective it seems like I've made three very similar
> mistakes: in the first case, the offset is too high, in the second
> case it's too low, and in the third case the block number is out of
> range. But in one case I get a NOTICE and in the other two cases I get
> an ERROR. In one case I get the relation name and in the other two
> cases I don't. The two complaints about an invalid offset are phrased
> completely differently from each other. For example, suppose you do
> this:
>
> ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> number is out of range (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> number is out of range for this block (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
>

Thank you for your suggestions. To make this consistent, I am planning
to do the following changes:

Remove the error message to report "invalid item pointer" from
tids_same_page_fetch_offnums() and expand the if-check to detect any
invalid offset number in the CRITICAL section such that it not just
checks if the offset number is > maxoffset, but also checks if the
offset number is equal to 0. That way it would also do the job that
"if (!ItemPointerIsValid)" was doing for us.

Further, if any invalid block number is detected, then I am planning
to skip all the tids associated with this block and move to the next
block. Hence, instead of reporting the error I would report the NOTICE
message to the user.

The other two messages for reporting unused items and dead items
remain the same. Hence, with above change, we would be reporting the
following 4 messages:

NOTICE:  skipping all the tids in block %u for relation "%s" because
the block number is out of range

NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
number is out of range for this block

NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead

NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused

Please let me know if you are okay with the above changes or not?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

RE: Libpq support to connect to standby server as priority

2020-08-11 Thread Smith, Peter

Hi Greg,

I was able to successfully execute all the tests of the v17-0001 patch.

But I do have a couple of additional review comments about the test code.



COMMENT - missing "any" tests

In my earlier code review (previous email) I suggested that "any" should be 
added as valid option to the target_server_type parameter.

But this now means there are some missing test cases for

target_server_type = "any"



COMMENT - negative tests?

IIUC when "standby" (aka "secondary") is specified, and there is no in_recovery 
server available, then the result should be an error like "could not make a 
readonly connection to server "

But I did not find any such error combination tests.

e.g. Where are these test cases?

target_session_attrs = "standby", when no standby is available
target_session_attrs = "secondary", when no standby is available
target_server_type = "standby", when no standby is available
target_server_type = "secondary", when no standby is available

--

And, similarly for "could not make a writable connection to server ".

e.g. Where are these test cases?

target_session_attrs = "primary", when no primary is available
target_session_attrs = "read-write", when no primary is available
target_server_type = "primary", when no primary is available
target_server_type = "read-write", when no primary is available


Kind Regards,
Peter Smith
---
Fujitsu Australia

Re: [PATCH] Covering SPGiST index

2020-08-11 Thread Pavel Borisov

I added changes in documentation into the patch.


-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v4-0001-Covering-SP-GiST-index-support-for-INCLUDE-column.patch
Description: Binary data

Re: Add information to rm_redo_error_callback()

2020-08-11 Thread Masahiko Sawada

On Tue, 11 Aug 2020 at 15:30, Michael Paquier  wrote:
>
> On Tue, Aug 11, 2020 at 02:45:50PM +0900, Masahiko Sawada wrote:
> > Thank you for updating the patch!
> >
> > The patch looks good to me. I've set this patch as Ready for Committer.
>
> +   for (block_id = 0; block_id <= record->max_block_id; block_id++)
> +   {
> +   RelFileNode rnode;
> +   ForkNumber forknum;
> +   BlockNumber blknum;
>
> Doesn't this potentially create duplicate information in some of the
> RM's desc() callbacks, and are we sure that the information provided
> is worth having for all the RMs?  As one example, gin_desc() looks at
> some of the block information, so there are overlaps.

Yeah, there is duplicate information in some RMs. I thought that we
can change individual RM’s desc() functions to output the block
information but as long as I see the pg_waldump outputs these are not
annoying to me and many of RM’s desc() doesn’t show the block
information.

>  It may be
> worth thinking about showing more information for has_image and
> apply_image if a block is in_use?

Yes. I’m okay with adding information for has_image and apply_image
but IMHO I'm not sure how these shown in errcontext would help. If an
error related to has_image or apply_image happens, errmsg should show
something detailed information about FPI.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: psql: add \si, \sm, \st and \sr functions to show CREATE commands for indexes, matviews, triggers and tables

2020-08-11 Thread Anna Akenteva


On 2020-07-28 20:46, a.pervush...@postgrespro.ru wrote:

I've attached a patch that implements \si, \sm, \st and \sr functions
that show the CREATE command for indexes, matviews, triggers and
tables. The functions are implemented similarly to the existing sf/sv
functions with some modifications.


To me these functions seem useful.
As for adding them to server side, I don't see a big need for it. It 
feels more logical to follow the already eatablished pattern for the 
\s[...] commands.


About the patch:

1) There is some code duplication for the exec_command_[sm|si|st|sr] 
functions. Plus, it seems weird to separate sm (show matview) from sv 
(show view). Perhaps it would be more convenient to combine some of the 
code? Maybe by editing the already-existing exec_command_sf_sv() 
function.


2) Seeing how \s and \e functions were added together, I'm wondering - 
should there be \e functions too for any objects affected by this patch?


--
Anna Akenteva
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Make contrib modules' installation scripts more secure.

2020-08-11 Thread Christoph Berg

I think this change neglected to add plpgsql to the extension
dependencies in the .control file:

12:53:51 #   Failed test 'psql -qc 'CREATE EXTENSION "cube"''
12:53:51 #   at t/TestLib.pm line 213.
12:53:51 not ok 68 - psql -qc 'CREATE EXTENSION "cube"'
12:53:51 #  got: '1'
12:53:51 # expected: '0'
12:53:51 not ok 69 - extension cube installs without error
12:53:51 #   Failed test 'extension cube installs without error'
12:53:51 #   at t/TestLib.pm line 214.
12:53:51 #  got: 'ERROR:  language "plpgsql" does not exist
12:53:51 # HINT:  Use CREATE EXTENSION to load the language into the database.
12:53:51 # '

(The Debian regression tests remove plpgsql before testing all
extensions in turn.)

Christoph

Re: Add session statistics to pg_stat_database

2020-08-11 Thread Laurenz Albe

On Thu, 2020-07-23 at 18:16 +0500, Ahsan Hadi wrote:
> On Wed, Jul 8, 2020 at 4:17 PM Laurenz Albe  wrote:
> > Here is a patch that adds the following to pg_stat_database:
> > - number of connections
> 
> Is it expected behaviour to not count idle connections? The connection is 
> included after it is aborted but not while it was idle.

Thanks for looking.

Currently, the patch counts connections when they close.
I could change the behavior that they are counted immediately.

Yours,
Laurenz Albe

Re: Make contrib modules' installation scripts more secure.

2020-08-11 Thread Christoph Berg

Re: To PostgreSQL Hackers
> I think this change neglected to add plpgsql to the extension
> dependencies in the .control file:
> 
> 12:53:51 #   Failed test 'psql -qc 'CREATE EXTENSION "cube"''
> 12:53:51 #   at t/TestLib.pm line 213.
> 12:53:51 not ok 68 - psql -qc 'CREATE EXTENSION "cube"'
> 12:53:51 #  got: '1'
> 12:53:51 # expected: '0'
> 12:53:51 not ok 69 - extension cube installs without error
> 12:53:51 #   Failed test 'extension cube installs without error'
> 12:53:51 #   at t/TestLib.pm line 214.
> 12:53:51 #  got: 'ERROR:  language "plpgsql" does not exist
> 12:53:51 # HINT:  Use CREATE EXTENSION to load the language into the database.
> 12:53:51 # '

Or maybe the argument is that the extension needs plpgsql only at
install time, and not to run, and you could really remove it after the
CREATE EXTENSION has been done. But that argument feels pretty icky.
And dump-restore would fail.

At least the error message is very explicit.

Christoph

Re: SyncRepLock acquired exclusively in default configuration

2020-08-11 Thread Asim Praveen

The proposed fix looks good, it resolves the lock contention problem as 
intended. +1 from my side.

> On 09-Jul-2020, at 4:22 PM, Fujii Masao  wrote:
> 
> 
> Regarding how to fix, don't we need memory barrier when reading
> sync_standbys_defined? Without that, after SyncRepUpdateSyncStandbysDefined()
> updates it to true, SyncRepWaitForLSN() can see the previous value,
> i.e., false, and then exit out of the function. Is this right?
> If this is right, we need memory barrier to avoid this issue?
> 

There is no out-of-order execution hazard in the scenario you are describing, 
memory barriers don’t seem to fit.  Using locks to synchronise checkpointer 
process and a committing backend process is the right way.  We have made a 
conscious decision to bypass the lock, which looks correct in this case.

As an aside, there is a small (?) window where a change to 
synchronous_standby_names GUC is partially propagated among committing 
backends, checkpointer and walsender.  Such a window may result in walsender 
declaring a standby as synchronous while a commit backend fails to wait for it 
in SyncRepWaitForLSN.  The root cause is walsender uses sync_standby_priority, 
a per-walsender variable to tell if a standby is synchronous.  It is updated 
when walsender processes a config change.  Whereas sync_standbys_defined, a 
variable updated by checkpointer, is used by committing backends to determine 
if they need to wait.  If checkpointer is busy flushing buffers, it may take 
longer than walsender to reflect a change in sync_standbys_defined.  This is a 
low impact problem, should be ok to live with it.

Asim

Re: EDB builds Postgres 13 with an obsolete ICU version

2020-08-11 Thread Magnus Hagander

On Tue, Aug 4, 2020 at 11:42 AM Dave Page  wrote:

>
>
> On Tue, Aug 4, 2020 at 10:29 AM Magnus Hagander 
> wrote:
>
>> On Tue, Aug 4, 2020 at 10:07 AM Dave Page  wrote:
>>
>>>
>>>
>>> On Tue, Aug 4, 2020 at 1:04 AM Bruce Momjian  wrote:
>>>
 On Mon, Aug  3, 2020 at 08:56:06PM +0200, Daniel Verite wrote:
 >  Hi,
 >
 > As a follow-up to bug #16570 [1] and other previous discussions
 > on the mailing-lists, I'm checking out PG13 beta for Windows
 > from:
 >  https://www.enterprisedb.com/postgresql-early-experience
 > and it ships with the same obsolete ICU 53 that was used
 > for PG 10,11,12.
 > Besides not having the latest Unicode features and fixes, ICU 53
 > ignores the BCP 47 tags syntax in collations used as examples
 > in Postgres documentation, which leads to confusion and
 > false bug reports.
 > The current version is ICU 67.
 >
 > I don't see where the suggestion to upgrade it before the
 > next PG release should be addressed but maybe some people on
 > this list do know or have the leverage to make it happen?

 Well, you can ask EDB about this, but perhaps the have kept the same ICU
 version so indexes will not need to be reindexed.

>>>
>>> Correct - updating ICU would mean a reindex is required following any
>>> upgrade, major or minor.
>>>
>>> I would really like to find an acceptable solution to this however as it
>>> really would be good to be able to update ICU.
>>>
>>
>> It certainly couldn't and shouldn't be done in a minor.
>>
>> But doing so in v13 doesn't seem entirely unreasonable, especially given
>> that I believe we will detect the requirement to reindex thanks to the
>> versioning, and not just start returning invalid results (like, say, with
>> those glibc updates).
>>
>> Would it be possible to have the installer even check if there are any
>> icu indexes in the database. If there aren't, just put in the new version
>> of icu. If there are, give the user a choice of the old version or new
>> version and reindex?
>>
>
> That would require fairly large changes to the installer to allow it to
> login to the database server (whether that would work would be dependent on
> how pg_hba.conf is configured), and also assumes that the ICU ABI hasn't
> changed between releases. It would also require some hacky renaming of
> DLLs, as they have the version number in them.
>

I assumed it had code for that stuff already. Mainly because I assumed it
supported doing pg_upgrade, which requires similar things no?



>
> The chances of designing, building and testing that thoroughly before v13
> is released is about zero I'd say.
>

Yeah, I can see how it would be for 13 -- unfortunately. But I definitely
think it's something that should go high on the list of things to get fixed
for 14.

//Magnus

Re: Can I test Extended Query in core test framework

2020-08-11 Thread Ashutosh Bapat

You could run PREPARE and EXECUTE as SQL commands from psql. Please
take a look at the documentation of those two commands. I haven't
looked at TAP infrastructure, but you could open a psql session to a
running server and send an arbitrary number of SQL queries through it.

Said that a server starts caching plan only after it sees a certain
number of EXECUTEs. So if you are testing cached plans, that's
something to worry about.

On Tue, Aug 11, 2020 at 8:13 AM Andy Fan  wrote:
>
> I want to write some test cases with extended query in core test system. 
> basically it looks like
>
> PreparedStatement preparedStatement  = conn.prepareStatement("select * from 
> bigtable");
> preparedStatement.setFetchSize(4);
> ResultSet rs = preparedStatement.executeQuery();
> while(rs.next())
> {
> System.out.println(rs.getInt(1));
> // conn.commit();
> conn.rollback();
> }
>
>
> However I don't find a way to do that after checking the example in 
> src/test/xxx/t/xxx.pl
> where most often used object is PostgresNode, which don't have such abilities.
>
> Can I do that in core system, I tried grep '\->prepare'  and  '\->execute' 
> and get nothing.
> am I miss something?
>
>
> --
> Best Regards
> Andy Fan

-- 
Best Wishes,
Ashutosh Bapat

Inconsistent behavior of smart shutdown handling for queries with and without parallel workers

2020-08-11 Thread Bharath Rupireddy

Hi,

In case of smart shutdown postmaster receives SIGTERM from the pg_ctl,
it "disallows new connections, but lets existing sessions end their
work normally". Which means that it doesn't abort any ongoing txns in
any of the sessions and it lets the sessions to exit(on their own) and
then the postmaster is shut down. Looks like this behavior is true
only if the sessions are executing non-parallel queries. Parallel
queries are getting aborted, see [1].

Although the postmaster receives two different signals for
smart(SIGTERM) and fast(SIGINT) shutdowns, it only sends SIGTERM to
bgworkers for both the cases. (see pmdie() -> SignalSomeChildren() in
postmaster.c). In
StartBackgroundWorker(), bgworkers have the bgworker_die() as default
handler for SIGTERM, which just reports FATAL error. Is this handler
correct for both fast and smart shutdown for all types of bgworkers?

For parallel workers in ParallelWorkerMain(), SIGTERM handler gets
changed to die()(which means for both smart and fast shutdowns, the
same handler gets used), which sets ProcDiePending = true; and later
if the parallel workers try to CHECK_FOR_INTERRUPTS(); (for parallel
seq scan, it is done in ExecScanFetch()), since ProcDiePending was set
to true, the parallel workers throw error "terminating connection due
to administrator command" in ProcessInterrupts().
Having die() as a handler for fast shutdown may be correct, but for
smart shutdown, as mentioned in $subject, it looks inconsistent.

1. In general, do we need to allow postmaster to send different
signals to bgworkers for fast and smart shutdowns and let them
differentiate the two modes(if needed)?
2. Do we need to handle smart shutdown for dynamic bgworkers(parallel
workers) with a different signal than SIGTERM and retain SIGTERM
handler die() as is, since SIGTERM is being sent to bgworkers from
other places as well? If we do so, we can block that new signal until
the parallel workers finish the current query execution or completely
ignore it in ParallelWorkerMain(). If the bgw_flags flag is
BGWORKER_CLASS_PARALLEL, we can do some changes in postmaster's
SignalSomeChildren() to detect and send that new signal. Looks like
SIGUSR2 remains currently ignored for dynamic bgworker, and can be
used for this purpose.

Thoughts?

Thanks @vignesh C  for inputs and writeup review.

[1] (smart shutdown issued with pg_ctl, while the parallel query is running).
postgres=# EXPLAIN ANALYZE SELECT COUNT(*) FROM t_test t1, t_test t2
WHERE t1.many = t2.many;
ERROR:  terminating connection due to administrator command
CONTEXT:  parallel worker

[2] The usual behavior of: smart shutdown - lets existing sessions end
their work normally and fast shutdown - abort their current
transactions and exit promptly.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-11 Thread Robert Haas

On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma  wrote:
> The other two messages for reporting unused items and dead items
> remain the same. Hence, with above change, we would be reporting the
> following 4 messages:
>
> NOTICE:  skipping all the tids in block %u for relation "%s" because
> the block number is out of range
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
> number is out of range for this block
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused
>
> Please let me know if you are okay with the above changes or not?

That seems broadly reasonable, but I would suggest phrasing the first
message like this:

skipping block %u for relation "%s" because the block number is out of range

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

2020-08-11 Thread vignesh C

On Fri, Aug 7, 2020 at 1:34 AM Robert Haas  wrote:
>
> On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
>  wrote:
> > The v4 patch looks good to me. Hang is not seen, make check and make
> > check-world passes. I moved this to the committer for further review
> > in https://commitfest.postgresql.org/29/2636/.
>
> I don't think I agree with this approach. In particular, I don't
> understand the rationale for unblocking only SIGUSR1. Above, Vignesh
> says that he feels that unblocking only that signal would be the right
> approach, but no reason is given. I have two reasons why I suspect
> it's not the right approach. One, it doesn't seem to be what we do
> elsewhere; the only existing cases where we have special handling for
> particular signals are SIGQUIT and SIGPIPE, and those places have
> comments explaining the reason why they are handled in a special way.
> Two, SIGUSR1 is used for a LOT of things: look at all the different
> cases procsignal_sigusr1_handler() checks. If the intention is to only
> allow the things we know are safe, rather than all the signals there
> are, I think this coding utterly fails to achieve that - and for
> reasons that I don't think are really fixable.
>

My intention of blocking only SIGUSR1 over unblocking all signals
mainly because we are already in the error path and we are about to
exit after emitting the error report. I was not sure if we intended to
receive any other signal just before exiting.
The Solution Robert & Tom are suggesting by Calling
BackgroundWorkerUnblockSignals fixes the actual problem.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-11 Thread Ashutosh Sharma

On Tue, Aug 11, 2020 at 7:33 PM Robert Haas  wrote:
>
> On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma  wrote:
> > The other two messages for reporting unused items and dead items
> > remain the same. Hence, with above change, we would be reporting the
> > following 4 messages:
> >
> > NOTICE:  skipping all the tids in block %u for relation "%s" because
> > the block number is out of range
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
> > number is out of range for this block
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused
> >
> > Please let me know if you are okay with the above changes or not?
>
> That seems broadly reasonable, but I would suggest phrasing the first
> message like this:
>
> skipping block %u for relation "%s" because the block number is out of range
>

Okay, thanks for the confirmation. I'll do that.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Switch to multi-inserts for pg_depend

2020-08-11 Thread Robert Haas

On Tue, Aug 11, 2020 at 1:59 AM Michael Paquier  wrote:
> On Mon, Aug 10, 2020 at 05:32:21PM -0700, Andres Freund wrote:
> > Do we really want to end up with several separate defines for different
> > type of catalog batch inserts? That doesn't seem like a good
> > thing. Think there should be a single define for all catalog bulk
> > inserts.
>
> Unlikely so, but I kept them separate to potentially lower the
> threshold of 64kB for catalog rows that have a lower average size than
> pg_attribute.

Uh, why would we want to do that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-08-11 Thread Tom Lane

James Coleman  writes:
> Why is a CIC in active index-building something we need to wait for?
> Wouldn't it fall under a similar kind of logic to the other snapshot
> types we can explicitly ignore? CIC can't be run in a manual
> transaction, so the snapshot it holds won't be used to perform
> arbitrary operations (i.e., the reason why a manual ANALYZE can't be
> ignored).

Expression indexes that call user-defined functions seem like a
pretty serious risk factor for that argument.  Those are exactly
the same expressions that ANALYZE will evaluate, as a result of
which we judge it unsafe to ignore.  Why would CIC be different?

regards, tom lane

Re: Can I test Extended Query in core test framework

2020-08-11 Thread Tom Lane

Andy Fan  writes:
> I want to write some test cases with extended query in core test system.

Why?  (That is, what is it you need to test exactly?)

psql has no ability to issue extended queries AFAIR, so the normal
regression test scripts can't exercise this.  We haven't built anything
for it in the TAP infrastructure either.  We do have test coverage
via pgbench and ecpg, though I concede that's pretty indirect.

I recall someone (Andres, possibly) speculating about building a tool
specifically to exercise low-level protocol issues, but that hasn't
been done either.

None of these are necessarily germane to any particular test requirement,
which is why I'm wondering.  The JDBC fragment you show seems like it's
something that should be tested by, well, JDBC.  What's interesting about
it for any other client?

regards, tom lane

Re: SyncRepLock acquired exclusively in default configuration

2020-08-11 Thread Robert Haas

On Tue, Aug 11, 2020 at 7:55 AM Asim Praveen  wrote:
> There is no out-of-order execution hazard in the scenario you are describing, 
> memory barriers don’t seem to fit.  Using locks to synchronise checkpointer 
> process and a committing backend process is the right way.  We have made a 
> conscious decision to bypass the lock, which looks correct in this case.

Yeah, I am not immediately seeing why a memory barrier would help anything here.

> As an aside, there is a small (?) window where a change to 
> synchronous_standby_names GUC is partially propagated among committing 
> backends, checkpointer and walsender.  Such a window may result in walsender 
> declaring a standby as synchronous while a commit backend fails to wait for 
> it in SyncRepWaitForLSN.  The root cause is walsender uses 
> sync_standby_priority, a per-walsender variable to tell if a standby is 
> synchronous.  It is updated when walsender processes a config change.  
> Whereas sync_standbys_defined, a variable updated by checkpointer, is used by 
> committing backends to determine if they need to wait.  If checkpointer is 
> busy flushing buffers, it may take longer than walsender to reflect a change 
> in sync_standbys_defined.  This is a low impact problem, should be ok to live 
> with it.

I think this gets to the root of the issue. If we check the flag
without a lock, we might see a slightly stale value. But, considering
that there's no particular amount of time within which configuration
changes are guaranteed to take effect, maybe that's OK. However, there
is one potential gotcha here: if the walsender declares the standby to
be synchronous, a user can see that, right? So maybe there's this
problem: a user sees that the standby is synchronous and expects a
transaction committing afterward to provoke a wait, but really it
doesn't. Now the user is unhappy, feeling that the system didn't
perform according to expectations.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Make contrib modules' installation scripts more secure.

2020-08-11 Thread Tom Lane

Christoph Berg  writes:
> I think this change neglected to add plpgsql to the extension
> dependencies in the .control file:

Adding plpgsql to the extension's dependencies would be a cure worse
than the disease: it'd mean that you could not remove plpgsql from the
system after installing cube, either.  That is surely unhelpful from
the standpoint of someone who would like to have cube without plpgsql.

> (The Debian regression tests remove plpgsql before testing all
> extensions in turn.)

Meh.  I think that's testing a case that we don't guarantee to work.
There was already a plpgsql dependency in hstore--1.1--1.2.sql,
which I just cribbed from to make these fixes.

In the long term, perhaps it'd be worth inventing a concept of an
"install-time dependency", whereby an extension could name something
it needs to have to run its script, but not necessarily afterwards.
But if you're someone who's afraid to have plpgsql installed, the
idea that it can be sucked in on-demand, behind the scenes, might not
make you feel better either.

A band-aid sort of fix would be to roll up the base install scripts
for these modules to the latest version, so that a plain install from
scratch doesn't need to execute any of the catalog adjustments in
their update scripts.  That's not terribly attractive from a maintenance
or testing standpoint, though.

regards, tom lane

Re: Can I test Extended Query in core test framework

2020-08-11 Thread Andres Freund

Hi,

On 2020-08-11 11:22:49 -0400, Tom Lane wrote:
> I recall someone (Andres, possibly) speculating about building a tool
> specifically to exercise low-level protocol issues, but that hasn't
> been done either.

Yea, I mentioned the possibility, but didn't plan to work on it. I am
not a perl person by any stretch (even though that's where I
started...).  But we can (and do iirc) have tests that just use libpq,
so it should be possible to test things like this at a bit higher cost.

Greetings,

Andres Freund

Re: massive FPI_FOR_HINT load after promote

2020-08-11 Thread James Coleman

On Tue, Aug 11, 2020 at 2:55 AM Masahiko Sawada
 wrote:
>
> On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera  wrote:
> >
> > Last week, James reported to us that after promoting a replica, some
> > seqscan was taking a huge amount of time; on investigation he saw that
> > there was a high rate of FPI_FOR_HINT wal messages by the seqscan.
> > Looking closely at the generated traffic, HEAP_XMIN_COMMITTED was being
> > set on some tuples.
> >
> > Now this may seem obvious to some as a drawback of the current system,
> > but I was taken by surprise.  The problem was simply that when a page is
> > examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple
> > in isolation; and for each tuple we call SetHintBits().  And only the
> > first time the FPI happens; by the time we get to the second tuple, the
> > page is already dirty, so there's no need to emit an FPI.  But the FPI
> > we sent only had the bit on the first tuple ... so the standby will not
> > have the bit set for any subsequent tuple.  And on promotion, the
> > standby will have to have the bits set for all those tuples, unless you
> > happened to dirty the page again later for other reasons.
> >
> > So if you have some table where tuples gain hint bits in bulk, and
> > rarely modify the pages afterwards, and promote before those pages are
> > frozen, then you may end up with a massive amount of pages that will
> > need hinting after the promote, which can become troublesome.
>
> Did the case you observed not use hot standby? I thought the impact of
> this issue could be somewhat alleviated in hot standby cases since
> read queries on the hot standby can set hint bits.

We do have hot standby enabled, and there are sometimes large queries
that may do seq scans that run against a replica, but there are
multiple replicas (and each one would have to have the bits set), and
a given replica that gets promoted in our topology isn't guaranteed to
be one that's seen those reads.

James

Re: massive FPI_FOR_HINT load after promote

2020-08-11 Thread Alvaro Herrera

On 2020-Aug-11, Masahiko Sawada wrote:

> On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera  wrote:

> > So if you have some table where tuples gain hint bits in bulk, and
> > rarely modify the pages afterwards, and promote before those pages are
> > frozen, then you may end up with a massive amount of pages that will
> > need hinting after the promote, which can become troublesome.
> 
> Did the case you observed not use hot standby? I thought the impact of
> this issue could be somewhat alleviated in hot standby cases since
> read queries on the hot standby can set hint bits.

Oh, interesting, I didn't know that.  However, it's not 100% true: the
standby can set the bit in shared buffers, but it does not mark the page
dirty.  So when the page is evicted, those bits that were set are lost.
That's not great.  See MarkBufferDirtyHint:

/*
 * If we need to protect hint bit updates from torn writes, 
WAL-log a
 * full page image of the page. This full page image is only 
necessary
 * if the hint bit update is the first change to the page since 
the
 * last checkpoint.
 *
 * We don't check full_page_writes here because that logic is 
included
 * when we call XLogInsert() since the value changes 
dynamically.
 */
if (XLogHintBitIsNeeded() &&
(pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
{
/*
 * If we must not write WAL, due to a 
relfilenode-specific
 * condition or being in recovery, don't dirty the 
page.  We can
 * set the hint, just not dirty the page as a result so 
the hint
 * is lost when we evict the page or shutdown.
 *
 * See src/backend/storage/page/README for longer 
discussion.
 */
if (RecoveryInProgress() ||
RelFileNodeSkippingWAL(bufHdr->tag.rnode))
return;

> > One simple idea to try to forestall this problem would be to modify the
> > algorithm so that all tuples are scanned and hinted if the page is going
> > to be dirtied -- then send a single FPI setting bits for all tuples,
> > instead of just on the first tuple.
> 
> This idea seems good to me but I'm concerned a bit that the
> probability of concurrent processes writing FPI for the same page
> might get higher since concurrent processes could set hint bits at the
> same time. If it's true, I wonder if we can advertise hint bits are
> being updated to prevent concurrent FPI writes for the same page.

Hmm, a very good point.  Sounds like we would need to obtain an
exclusive lock on the page .. but that would be very problematic.

I don't have a concrete proposal to solve this problem ATM, but it's
more and more looking like it's a serious problem.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: autovac issue with large number of tables

2020-08-11 Thread Tom Lane

Jim Nasby  writes:
> Without reading the 100+ emails or the 260k patch, I'm guessing that it 
> won't help because the problem I observed was spending most of it's time in
>    42.62% postgres  [.] hash_search_with_hash_value
> I don't see how moving things to shared memory would help that at all.

So I'm a bit mystified as to why that would show up as the primary cost.
It looks to me like we force a re-read of the pgstats data each time
through table_recheck_autovac(), and it seems like the costs associated
with that would swamp everything else in the case you're worried about.

I suspect that the bulk of the hash_search_with_hash_value costs are
HASH_ENTER calls caused by repopulating the pgstats hash table, rather
than the single read probe that table_recheck_autovac itself will do.
It's still surprising that that would dominate the other costs of reading
the data, but maybe those costs just aren't as well localized in the code.

So I think Kasahara-san's point is that the shared memory stats collector
might wipe out those costs, depending on how it's implemented.  (I've not
looked at that patch in a long time either, so I don't know how much it'd
cut the reader-side costs.  But maybe it'd be substantial.)

In the meantime, though, do we want to do something else to alleviate
the issue?  I realize you only described your patch as a PoC, but I
can't say I like it much:

* Giving up after we've wasted 1000 pgstats re-reads seems like locking
the barn door only after the horse is well across the state line.

* I'm not convinced that the business with skipping N entries at a time
buys anything.  You'd have to make pretty strong assumptions about the
workers all processing tables at about the same rate to believe it will
help.  In the worst case, it might lead to all the workers ignoring the
same table(s).

I think the real issue here is autovac_refresh_stats's insistence that it
shouldn't throttle pgstats re-reads in workers.  I see the point about not
wanting to repeat vacuum work on the basis of stale data, but still ...
I wonder if we could have table_recheck_autovac do two probes of the stats
data.  First probe the existing stats data, and if it shows the table to
be already vacuumed, return immediately.  If not, *then* force a stats
re-read, and check a second time.

BTW, can you provide a test script that reproduces the problem you're
looking at?  The rest of us are kind of guessing at what's happening.

regards, tom lane

Re: EDB builds Postgres 13 with an obsolete ICU version

2020-08-11 Thread Jaime Casanova

On Mon, 3 Aug 2020 at 13:56, Daniel Verite  wrote:
>
>  Hi,
>
> As a follow-up to bug #16570 [1] and other previous discussions
> on the mailing-lists, I'm checking out PG13 beta for Windows
> from:
>  https://www.enterprisedb.com/postgresql-early-experience
> and it ships with the same obsolete ICU 53 that was used
> for PG 10,11,12.
> Besides not having the latest Unicode features and fixes, ICU 53
> ignores the BCP 47 tags syntax in collations used as examples
> in Postgres documentation, which leads to confusion and
> false bug reports.
> The current version is ICU 67.
>

Hi,

Sadly, that is managed by EDB and not by the community.

You can try 
https://www.2ndquadrant.com/en/resources/postgresql-installer-2ndquadrant/
which uses ICU-62.2, is not the latest but should allow you to follow
the examples in the documentation.

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-08-11 Thread James Coleman

On Tue, Aug 11, 2020 at 11:14 AM Tom Lane  wrote:
>
> James Coleman  writes:
> > Why is a CIC in active index-building something we need to wait for?
> > Wouldn't it fall under a similar kind of logic to the other snapshot
> > types we can explicitly ignore? CIC can't be run in a manual
> > transaction, so the snapshot it holds won't be used to perform
> > arbitrary operations (i.e., the reason why a manual ANALYZE can't be
> > ignored).
>
> Expression indexes that call user-defined functions seem like a
> pretty serious risk factor for that argument.  Those are exactly
> the same expressions that ANALYZE will evaluate, as a result of
> which we judge it unsafe to ignore.  Why would CIC be different?

The comments for WaitForOlderSnapshots() don't discuss that problem at
all; as far as ANALYZE goes they only say:

* Manual ANALYZEs, however, can't be excluded because they
* might be within transactions that are going to do arbitrary operations
* later.

But nonetheless over in the thread Álvaro linked to we'd discussed
these issues already. Andres in [1] and [2] believed that:

> For the snapshot waits we can add a procarray flag
> (alongside PROCARRAY_VACUUM_FLAG) indicating that the backend is
> doing. Which WaitForOlderSnapshots() can then use to ignore those CICs,
> which is safe, because those transactions definitely don't insert into
> relations targeted by CIC. The change to WaitForOlderSnapshots() would
> just be to pass the new flag to GetCurrentVirtualXIDs, I think.

and

> What I was thinking of was a new flag, with a distinct value from
> PROC_IN_VACUUM. It'd currently just be specified in the
> GetCurrentVirtualXIDs() calls in WaitForOlderSnapshots(). That'd avoid
> needing to wait for other CICs on different relations. Since CIC is not
> permitted on system tables, and CIC doesn't do DML on normal tables, it
> seems fairly obviously correct to exclude other CICs.

In [3] I'd brought up that a function index can do arbitrary
operations (including, terribly, a query of another table), and Andres
(in [4]) noted that:

> Well, even if we consider this an actual problem, we could still use
> PROC_IN_CIC for non-expression non-partial indexes (index operator
> themselves better ensure this isn't a problem, or they're ridiculously
> broken already - they can get called during vacuum).

But went on to describe how this is a problem with all expression
indexes (even if those expressions don't do dangerous things) because
of syscache lookups, but that ideally for expression indexes we'd have
a way to use a different (or more frequently taken) snapshot for the
purpose of computing those expressions. That's a significantly more
involved patch though.

So from what I understand, everything that I'd claimed in my previous
message still holds true for non-expression/non-partial indexes. Is
there something else I'm missing?

James

1: 
https://www.postgresql.org/message-id/20200325191935.jjhdg2zy5k7ath5v%40alap3.anarazel.de
2: 
https://www.postgresql.org/message-id/20200325195841.gq4hpl25t6pxv3gl%40alap3.anarazel.de
3: 
https://www.postgresql.org/message-id/CAAaqYe_fveT_tvKonVt1imujOURUUMrydGeaBGahqLLy9D-REw%40mail.gmail.com
4: 
https://www.postgresql.org/message-id/20200416221207.wmnzbxvjqqpazeob%40alap3.anarazel.de

Re: EDB builds Postgres 13 with an obsolete ICU version

2020-08-11 Thread Thomas Kellerer


Jaime Casanova schrieb am 11.08.2020 um 20:39:

As a follow-up to bug #16570 [1] and other previous discussions
on the mailing-lists, I'm checking out PG13 beta for Windows
from:
  https://www.enterprisedb.com/postgresql-early-experience
and it ships with the same obsolete ICU 53 that was used
for PG 10,11,12.
Besides not having the latest Unicode features and fixes, ICU 53
ignores the BCP 47 tags syntax in collations used as examples
in Postgres documentation, which leads to confusion and
false bug reports.
The current version is ICU 67.



Sadly, that is managed by EDB and not by the community.

You can try 
https://www.2ndquadrant.com/en/resources/postgresql-installer-2ndquadrant/
which uses ICU-62.2, is not the latest but should allow you to follow
the examples in the documentation.



One of the reasons I prefer the EDB builds is, that they provide a ZIP file 
without the installer overhead.
Any chance 2ndQuadrant can supply something like that as well?

Thomas

Re: [PATCH] Covering SPGiST index

2020-08-11 Thread Pavel Borisov

вт, 11 авг. 2020 г. в 12:11, Pavel Borisov :

> I added changes in documentation into the patch.
>
>
> --
> Best regards,
> Pavel Borisov
>
> Postgres Professional: http://postgrespro.com 
>


v5-0001-Covering-SP-GiST-index-support-for-INCLUDE-column.patch
Description: Binary data

Re: doc examples for pghandler

2020-08-11 Thread Mark Wong

On Mon, Jun 15, 2020 at 04:47:01PM +0900, Michael Paquier wrote:
> On Sun, Jun 14, 2020 at 08:45:17PM -0700, Mark Wong wrote:
> > Sounds good to me.  Something more like the attached patch?
> 
> That's the idea.  I have not gone in details into what you have here,
> but perhaps it would make sense to do a bit more and show how things
> are done in the context of a PL function called in a trigger?  Your
> patch removes from the docs a code block that outlined that.

Ah, right.  For the moment I've added some empty conditionals for
trigger and event trigger handling.

I've created a new entry in the commitfest app. [1]  I'll keep at it. :)

Regards,
Mark

[1] https://commitfest.postgresql.org/29/2678/

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/
diff --git a/doc/src/sgml/plhandler.sgml b/doc/src/sgml/plhandler.sgml
index e1b0af7a60..7b2c5624c0 100644
--- a/doc/src/sgml/plhandler.sgml
+++ b/doc/src/sgml/plhandler.sgml
@@ -96,62 +96,12 @@

 

-This is a template for a procedural-language handler written in C:
-
-#include "postgres.h"
-#include "executor/spi.h"
-#include "commands/trigger.h"
-#include "fmgr.h"
-#include "access/heapam.h"
-#include "utils/syscache.h"
-#include "catalog/pg_proc.h"
-#include "catalog/pg_type.h"
-
-PG_MODULE_MAGIC;
-
-PG_FUNCTION_INFO_V1(plsample_call_handler);
-
-Datum
-plsample_call_handler(PG_FUNCTION_ARGS)
-{
-Datum  retval;
-
-if (CALLED_AS_TRIGGER(fcinfo))
-{
-/*
- * Called as a trigger function
- */
-TriggerData*trigdata = (TriggerData *) fcinfo->context;
-
-retval = ...
-}
-else
-{
-/*
- * Called as a function
- */
-
-retval = ...
-}
-
-return retval;
-}
-
-Only a few thousand lines of code have to be added instead of the
-dots to complete the call handler.
-   
-
-   
-After having compiled the handler function into a loadable module
-(see ), the following commands then
-register the sample procedural language:
-
-CREATE FUNCTION plsample_call_handler() RETURNS language_handler
-AS 'filename'
-LANGUAGE C;
-CREATE LANGUAGE plsample
-HANDLER plsample_call_handler;
-
+A template for a procedural-language handler written as a C extension is
+provided in src/test/modules/plsample.  This is a
+working sample demonstrating one way to create a procedural-language
+handler, process parameters, and return a value.  A few thousand lines of
+additional code may have to be added to complete a fully functional
+handler.

 

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 29de73c060..95144d8d7c 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -9,6 +9,7 @@ SUBDIRS = \
 		  commit_ts \
 		  dummy_index_am \
 		  dummy_seclabel \
+		  plsample \
 		  snapshot_too_old \
 		  test_bloomfilter \
 		  test_ddl_deparse \
diff --git a/src/test/modules/plsample/Makefile b/src/test/modules/plsample/Makefile
new file mode 100644
index 00..757b47c785
--- /dev/null
+++ b/src/test/modules/plsample/Makefile
@@ -0,0 +1,20 @@
+# src/test/modules/plsample/Makefile
+
+PGFILEDESC = "PL/Sample - procedural language"
+
+REGRESS = create_pl create_func select_func
+
+EXTENSION = plsample
+EXTVERSION = 0.1
+
+MODULE_big = plsample
+
+OBJS = plsample.o
+
+DATA = plsample.control plsample--0.1.sql
+
+plsample.o: plsample.c
+
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
diff --git a/src/test/modules/plsample/README b/src/test/modules/plsample/README
new file mode 100644
index 00..7ee213700b
--- /dev/null
+++ b/src/test/modules/plsample/README
@@ -0,0 +1,3 @@
+plsample is an example procedural-language handler.  It is a simple functional
+template that demonstrates some of the things that need to be done in order to
+build a fully functional procedural-language handler.
diff --git a/src/test/modules/plsample/expected/create_func.out b/src/test/modules/plsample/expected/create_func.out
new file mode 100644
index 00..df2b915a97
--- /dev/null
+++ b/src/test/modules/plsample/expected/create_func.out
@@ -0,0 +1,5 @@
+CREATE FUNCTION plsample_func(a1 NUMERIC, a2 TEXT, a3 INTEGER[])
+RETURNS TEXT
+AS $$
+  This is function's source text.
+$$ LANGUAGE plsample;
diff --git a/src/test/modules/plsample/expected/create_pl.out b/src/test/modules/plsample/expected/create_pl.out
new file mode 100644
index 00..5365391284
--- /dev/null
+++ b/src/test/modules/plsample/expected/create_pl.out
@@ -0,0 +1,8 @@
+CREATE FUNCTION plsample_call_handler()
+RETURNS language_handler
+AS '$libdir/plsample'
+LANGUAGE C;
+CREATE LANGUAGE plsample
+HANDLER plsample_call_handler;
+COMMENT ON LANGUAGE plsample
+IS 'PL/Sample procedural language';
diff --git a/src/test/modules/plsample/expected/select_func.out b/src/test/modules/plsample/expected/select_func.out
new file mode 100644
index 0

Re: Inconsistent behavior of smart shutdown handling for queries with and without parallel workers

2020-08-11 Thread Tom Lane

Bharath Rupireddy  writes:
> In case of smart shutdown postmaster receives SIGTERM from the pg_ctl,
> it "disallows new connections, but lets existing sessions end their
> work normally". Which means that it doesn't abort any ongoing txns in
> any of the sessions and it lets the sessions to exit(on their own) and
> then the postmaster is shut down. Looks like this behavior is true
> only if the sessions are executing non-parallel queries. Parallel
> queries are getting aborted, see [1].

Hm.  I kind of wonder why we're killing *anything* early in the
smart-shutdown case.  postmaster.c has (in pmdie()):

/* autovac workers are told to shut down immediately */
/* and bgworkers too; does this need tweaking? */
SignalSomeChildren(SIGTERM,
   BACKEND_TYPE_AUTOVAC | 
BACKEND_TYPE_BGWORKER);
/* and the autovac launcher too */
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGTERM);
/* and the bgwriter too */
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);

and it seems like every one of those actions is pretty damfool if we want
the existing sessions to continue to have normal performance, quite aside
from whether we're breaking parallel queries.  Indeed, to the extent that
this is hurting performance, we can expect the existing sessions to take
*longer* to finish, making this pretty counterproductive.

So I'm thinking we should move all of these actions to happen only after
the regular children are all gone.

regards, tom lane

Re: [BUG] Error in BRIN summarization

2020-08-11 Thread Alvaro Herrera

On 2020-Jul-30, Anastasia Lubennikova wrote:

> While testing this fix, Alexander Lakhin spotted another problem. I
> simplified  the test case to this:

Ah, good catch.  I think a cleaner way to fix this problem is to just
consider the range as not summarized and return NULL from there, as in
the attached patch.  Running your test case with a telltale WARNING
added at that point, it's clear that it's being hit.

By returning NULL, we're forcing the caller to scan the heap, which is
not great.  But note that if you retry, and your VACUUM hasn't run yet
by the time we go through the loop again, the same thing would happen.
So it seems to me a good enough answer.

A much more troubling thought is what happens if the range is
desummarized, then the index item is used for the summary of a different
range.  Then the index might end up returning corrupt results.

> At first, I tried to fix it by holding the lock on revmap->rm_currBuf until
> we locked the regular page, but it causes a deadlock with brinsummarize(),
> It can be easily reproduced with the same test as above.
> Is there any rule about the order of locking revmap and regular pages in
> brin? I haven't found anything in README.

Umm, I thought that stuff was in the README, but it seems I didn't add
it there.  I think I had a .org file with my notes on that ... must be
in an older laptop disk, because it's not in my worktree for that.  I'll
see if I can fish it out.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index e8b8308f82..35746714a7 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -282,10 +282,17 @@ brinGetTupleForHeapBlock(BrinRevmap *revmap, BlockNumber heapBlk,
 		/* If we land on a revmap page, start over */
 		if (BRIN_IS_REGULAR_PAGE(page))
 		{
+			/*
+			 * If the offset number is greater than what's in the page, it's
+			 * possible that the range was desummarized concurrently. Just
+			 * return NULL to handle that case.
+			 */
 			if (*off > PageGetMaxOffsetNumber(page))
-ereport(ERROR,
-		(errcode(ERRCODE_INDEX_CORRUPTED),
-		 errmsg_internal("corrupted BRIN index: inconsistent range map")));
+			{
+LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
+return NULL;
+			}
+
 			lp = PageGetItemId(page, *off);
 			if (ItemIdIsUsed(lp))
 			{

Re: [BUG] Error in BRIN summarization

2020-08-11 Thread Alvaro Herrera

On 2020-Jul-23, Anastasia Lubennikova wrote:

> This error is caused by the problem with root_offsets array bounds. It
> occurs if a new HOT tuple was inserted after we've collected root_offsets,
> and thus we don't have root_offset for tuple's offnum. Concurrent insertions
> are possible, because brin_summarize_new_values() only holds ShareUpdateLock
> on table and no lock (only pin) on the page.

Excellent detective work, thanks.

> The draft fix is in the attachments. It saves root_offsets_size and checks
> that we only access valid fields.

I think this is more complicated than necessary.  It seems easier to
solve this problem by just checking whether the given root pointer is
set to InvalidOffsetNumber, which is already done in the existing coding
of heap_get_root_tuples (only they spell it "0" rather than
InvalidOffsetNumber, which I propose to change).  AFAIR this should only
happen in the 'anyvisible' mode, so I added that in an assert.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 267a6ee25a..08252bc0c0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1324,6 +1324,12 @@ heapam_index_build_range_scan(Relation heapRelation,
 		 * buffer continuously while visiting the page, so no pruning
 		 * operation can occur either.
 		 *
+		 * In the case where we don't hold ShareUpdateExclusiveLock on the
+		 * table, it's possible for some HOT tuples to appear that we didn't
+		 * know about when we first read the page.  To handle that case, we
+		 * re-obtain the list of root offsets when a HOT tuple points to a
+		 * root item that we don't know about.
+		 *
 		 * Also, although our opinions about tuple liveness could change while
 		 * we scan the page (due to concurrent transaction commits/aborts),
 		 * the chain root locations won't, so this info doesn't need to be
@@ -1625,6 +1631,23 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 			offnum = ItemPointerGetOffsetNumber(&heapTuple->t_self);
 
+			/*
+			 * If a HOT tuple points to a root that we don't know
+			 * about, obtain root items afresh.  If that still fails,
+			 * report it as corruption.
+			 */
+			if (root_offsets[offnum - 1] == InvalidOffsetNumber)
+			{
+Page	page = BufferGetPage(hscan->rs_cbuf);
+
+/* This can only happen in 'anyvisible' mode */
+Assert(anyvisible);
+
+LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+heap_get_root_tuples(page, root_offsets);
+LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+			}
+
 			if (!OffsetNumberIsValid(root_offsets[offnum - 1]))
 ereport(ERROR,
 		(errcode(ERRCODE_DATA_CORRUPTED),
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 256df4de10..7e3d44dfd6 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -732,7 +732,7 @@ heap_page_prune_execute(Buffer buffer,
  * root_offsets[k - 1] = j.
  *
  * The passed-in root_offsets array must have MaxHeapTuplesPerPage entries.
- * We zero out all unused entries.
+ * Unused entries are filled with InvalidOffsetNumber (zero).
  *
  * The function must be called with at least share lock on the buffer, to
  * prevent concurrent prune operations.
@@ -747,7 +747,8 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
 	OffsetNumber offnum,
 maxoff;
 
-	MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber));
+	MemSet(root_offsets, InvalidOffsetNumber,
+		   MaxHeapTuplesPerPage * sizeof(OffsetNumber));
 
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))

Re: Improving connection scalability: GetSnapshotData()

2020-08-11 Thread Andres Freund

Hi,

On 2020-07-29 19:20:04 +1200, Thomas Munro wrote:
> On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro  wrote:
> > +static inline FullTransactionId
> > +FullXidViaRelative(FullTransactionId rel, TransactionId xid)
> >
> > I'm struggling to find a better word for this than "relative".
> 
> The best I've got is "anchor" xid.  It is an xid that is known to
> limit nextFullXid's range while the receiving function runs.

Thinking about it, I think that relative is a good descriptor. It's just
that 'via' is weird. How about: FullXidRelativeTo?

Greetings,

Andres Freund

Re: Improving connection scalability: GetSnapshotData()

2020-08-11 Thread Thomas Munro

On Wed, Aug 12, 2020 at 12:19 PM Andres Freund  wrote:
> On 2020-07-29 19:20:04 +1200, Thomas Munro wrote:
> > On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro  wrote:
> > > +static inline FullTransactionId
> > > +FullXidViaRelative(FullTransactionId rel, TransactionId xid)
> > >
> > > I'm struggling to find a better word for this than "relative".
> >
> > The best I've got is "anchor" xid.  It is an xid that is known to
> > limit nextFullXid's range while the receiving function runs.
>
> Thinking about it, I think that relative is a good descriptor. It's just
> that 'via' is weird. How about: FullXidRelativeTo?

WFM.

Re: Can I test Extended Query in core test framework

2020-08-11 Thread Andy Fan

Thank you Ashutosh for your reply.

On Tue, Aug 11, 2020 at 9:06 PM Ashutosh Bapat 
wrote:

> You could run PREPARE and EXECUTE as SQL commands from psql. Please
> take a look at the documentation of those two commands. I haven't
> looked at TAP infrastructure, but you could open a psql session to a
> running server and send an arbitrary number of SQL queries through it.
>
>
PREPARE & EXECUTE doesn't go with the extended query way.  it is
still exec_simple_query.What I did is I hacked some exec_bind_message
[1]  logic, that's why I want to test extended queries.

[1]
https://www.postgresql.org/message-id/CAKU4AWqvwmo=nlpga_ohxb4f+u4ts1_3yry9m6xtjlt9dkh...@mail.gmail.com


-- 
Best Regards
Andy Fan

Re: public schema default ACL

2020-08-11 Thread Noah Misch

On Mon, Aug 10, 2020 at 10:21:06AM +0200, Magnus Hagander wrote:
> On Thu, Aug 6, 2020 at 3:34 PM Stephen Frost  wrote:
> > Not sure how much it happens in these days of docker and containers, but
> > certainly it was common at one point to have home directories
> > automatically created on login.  There's one particularly large
> > difference here though- home directories go in /home/ (or whatever) and
> > have a specific namespace, which our schemas don't.  That is to say, if
> > someone has CREATE rights on the database they can create an 'sfrost'
> > schema that they own, dump whatever they want into it, and then it's in
> > my default search_path when I log in, even if this feature to
> > auto-create role schemas exists.  Sure, you could argue that in the unix
> > case, that would have been an 'admin' user to be able to make a
> > directory in /home/, but we haven't got any other way to make
> > 'directories', so perhaps the analogy just doesn't fit close enough.
> 
> Yeah, the fact that a owner can just create a schema called "postgres" and
> thereby sticking things in the search path of postgres is not great. And
> that's not fixed by changing how "public" works, per any of the suggested
> methods I think. Only the database owner can do mean things there, but
> database owner != superuser (at least in theory).

https://www.postgresql.org/docs/devel/ddl-schemas.html#DDL-SCHEMAS-PATTERNS
does document the power of untrusted database owners.  Unfortunately, I've not
seen or thought of a specification of database owner powers that included
enough power to be useful yet not enough power to cause mischief.

Re: Allows Extend Protocol support CURSOR_OPT_HOLD with prepared stmt.

2020-08-11 Thread Andy Fan

On Mon, Jul 27, 2020 at 11:57 AM Andy Fan  wrote:

>
>> 2. Currently I want to add a new GUC parameter, if set it to true, server
>> will
>> create a holdable portal, or else nothing changed.  Then let the user set
>> it to true in the above case and reset it to false afterward.  Is there
>> any issue
>> with this method?
>>
>>
> I forget to say in this case, the user has to drop the holdable
> portal  explicitly.
>
>
>
After some days's hack and testing, I found more issues to support the
following case

rs = prepared_stmt.execute(1);
while(rs.next())
{
// do something with the result  (mainly DML )
conn.commit();  or  conn.rollback();

// commit / rollback to avoid the long lock holding.
}

The holdable portal is still be dropped in transaction aborted/rollbacked
case since
the HoldPortal doesn't happens before that and "abort/rollabck" means
something
wrong so it is risk to hold it again.  What I did to fix this issue is
HoldPortal just after
we define a Holdable portal.  However, that's bad for performance.
Originally, we just
needed to scan the result when needed, now we have to hold all the results
and then fetch
and the data one by one.

The above user case looks reasonable to me IMO,  I would say it is kind of
"tech debt"
in postgres.  To support this completely, looks we have to decouple the
snapshot/locking
management with transaction? If so, it looks like a huge change. I wonder
if anybody
tried to resolve this issue and where do we get to that point?

-- 
Best Regards
Andy Fan

Re: Can I test Extended Query in core test framework

2020-08-11 Thread Andy Fan

On Tue, Aug 11, 2020 at 11:22 PM Tom Lane  wrote:

> Andy Fan  writes:
> > I want to write some test cases with extended query in core test system.
>
> Why?  (That is, what is it you need to test exactly?)
>
>
Thanks for your attention.  The background is I hacked exec_bind_message[1],
then I want to add some test cases  to make sure the logic can be tested
automatically
in the core system.  I can't distinguish if the logic might be so straight
or not so it
doesn't deserve the test in practice.

psql has no ability to issue extended queries AFAIR, so the normal
> regression test scripts can't exercise this.  We haven't built anything
> for it in the TAP infrastructure either.  We do have test coverage
> via pgbench and ecpg, though I concede that's pretty indirect.
>
> I recall someone (Andres, possibly) speculating about building a tool
> specifically to exercise low-level protocol issues, but that hasn't
> been done either.
>

Thanks for this information.  and Thanks Andres for the idea and practice.


>
> None of these are necessarily germane to any particular test requirement,
> which is why I'm wondering.  The JDBC fragment you show seems like it's
> something that should be tested by, well, JDBC.  What's interesting about
> it for any other client?
>
>
The main purpose is I want to test it in core without other infrastructure
involved.
I have added a python script to do that now.   So the issue is not so
blocking.
but what I am working on[1] is still challenging for me:(

[1]
https://www.postgresql.org/message-id/CAKU4AWqvwmo=nlpga_ohxb4f+u4ts1_3yry9m6xtjlt9dkh...@mail.gmail.com


-- 
Best Regards
Andy Fan

RE: Can I test Extended Query in core test framework

2020-08-11 Thread tsunakawa.ta...@fujitsu.com

Tatsuo Ishii san, a committer, proposed this to test extended query protocol.  
Can it be included in Postgres core?

A toool to test programs by issuing frontend/backend protocol messages
https://github.com/tatsuo-ishii/pgproto


Regards
Takayuki Tsunakawa

Re: Parallel copy

2020-08-11 Thread Greg Nancarrow

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, failed

Hi,

I don't claim to yet understand all of the Postgres internals that this patch 
is updating and interacting with, so I'm still testing and debugging portions 
of this patch, but would like to give feedback on what I've noticed so far.
I have done some ad-hoc testing of the patch using parallel copies from 
text/csv/binary files and have not yet struck any execution problems other than 
some option validation and associated error messages on boundary cases.

One general question that I have: is there a user benefit (over the normal 
non-parallel COPY) to allowing "COPY ... FROM ... WITH (PARALLEL 1)"?


My following comments are broken down by patch:

(1) v2-0001-Copy-code-readjustment-to-support-parallel-copy.patch

(i) Whilst I can't entirely blame these patches for it (as they are following 
what is already there), I can't help noticing the use of numerous macros in 
src/backend/commands/copy.c which paste in multiple lines of code in various 
places.
It's getting a little out-of-hand. Surely the majority of these would be best 
inline functions instead?
Perhaps hasn't been done because too many parameters need to be passed - 
thoughts?


(2) v2-0002-Framework-for-leader-worker-in-parallel-copy.patch

(i) minor point: there are some tabbing/spacing issues in this patch (and the 
other patches), affecting alignment.
e.g. mixed tabs/spaces and misalignment in PARALLEL_COPY_KEY_xxx definitions

(ii)

+/*
+ * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
+ * block to process to avoid lock contention. This value should be mode of
+ * RINGSIZE, as wrap around cases is currently not handled while selecting the
+ * WORKER_CHUNK_COUNT by the worker.
+ */
+#define WORKER_CHUNK_COUNT 50


"This value should be mode of RINGSIZE ..."

-> typo: mode  (mod?  should evenly divide into RINGSIZE?)


(iii)
+ *using pg_atomic_compare_exchange_u32, worker will change the sate to

->typo: sate  (should be "state")


(iv)

+errmsg("parallel option 
supported only for copy from"),

-> suggest change to:   errmsg("parallel option is supported only for 
COPY FROM"),

(v)

+   errno = 0; /* To distinguish success/failure after call 
*/
+   val = strtol(str, &endptr, 10);
+
+   /* Check for various possible errors */
+   if ((errno == ERANGE && (val == LONG_MAX || val == 
LONG_MIN))
+   || (errno != 0 && val == 0) ||
+   *endptr)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("improper use of 
argument to option \"%s\"",
+   defel->defname),
+parser_errposition(pstate, 
defel->location)));
+
+   if (endptr == str)
+  ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("no digits were found 
in argument to option \"%s\"",
+   defel->defname),
+parser_errposition(pstate, 
defel->location)));
+
+   cstate->nworkers = (int) val;
+
+   if (cstate->nworkers <= 0)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("argument to option 
\"%s\" must be a positive integer greater than zero",
+   defel->defname),
+parser_errposition(pstate, 
defel->location)));


I think this validation code needs to be improved, including the error messages 
(e.g. when can a "positive integer" NOT be greater than zero?)

There is some overlap in the "no digits were found" case between the two 
conditions above, depending, for example, if the argument is quoted. 
Also, "improper use of argument to option" sounds a bit odd and vague to me. 
Finally, not range checking before casting long to int can lead to allowing 
out-of-range int values like in the following case:

test=# copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2147483648');
ERROR:  argument to option "parallel" must be a positive integer greater than 
zero
LINE 1: copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2...

Re: 回复：how to create index concurrently on partitioned table

2020-08-11 Thread Michael Paquier

On Sun, Aug 09, 2020 at 06:44:23PM -0500, Justin Pryzby wrote:
> On Sun, Aug 09, 2020 at 02:00:09PM +0900, Michael Paquier wrote:
>> For now, I would recommend to focus first on 0001 to add support for
>> partitioned tables and indexes to REINDEX.  CIC is much more
>> complicated btw, but I am not entering in the details now.
>> 
>> +   /* Avoid erroring out */
>> if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>> {
>> This comment does not help, and actually this becomes incorrect as
>> reindex for this relkind becomes supported once 0001 is done.
> 
> I made a minimal change to avoid forgetting to eventually change
> that part.

Why not changing it then?  We already filter out per relkind in all
the code paths calling reindex_relation(), be it in indexcmds.c for
schema-level reindex or even tablecmds.c, so I have switched this part
to an elog().

>> - We should *not* handle directly partitioned index and/or table in
>> ReindexRelationConcurrently() to not complicate the logic where we
>> gather all the indexes of a table/matview.  So I think that the list
>> of partition indexes/tables to work on should be built directly in
>> ReindexIndex() and ReindexTable(), and then this should call the
>> second part of ReindexMultipleTables() refactored in the previous
>> point.
> 
> I think I addressed these mostly as you intended.

Mostly.  I have been hacking on this patch, and basically rewrote it
as the attached.  The handling of the memory context used to keep the
list of partitions intact across transactions was rather clunky: the
context was not reset when we are done, and we would call more APIs
than necessary while switching to it, like find_all_inheritors() which
could do much more allocations.  I have fixed that by minimizing the
areas where the private context is used, switching to it only when
saving a new OID in the list of partitions, or a session lock (see
below for this part).

While on it, I found that the test coverage was not enough, so I have
extended the set of tests to make sure any concurrent and
non-concurrent operation for partitioned tables and indexes change the
correct set of relfilenodes for each operation.  I have written some
custom functions to minimize the duplication (the whole thing cannot
be grouped as those commands cannot run in a transaction block).

Speaking of which, the patch missed that REINDEX INDEX/TABLE should
not run in a transaction block when working on a partitioned
relation.  And the documentation needs to be clear about the
limitation of each operation, so I have written more about all that.
The patch also has commented out areas with slashes or such, and I
have added some elog() and some asserts to make sure that we don't
cross any areas that should not work with partitioned relations.

While hacking on this patch, I have found an old bug in the REINDEX
logic: we build a list of relations to reindex in
ReindexMultipleTables() for schema and database reindexes, but it
happens that we don't recheck if the relations listed actually exists
or not, so dropping a relation during a large reindex can cause 
sparse failures because of relations that cannot be found anymore.  In
the case of this thread, the problem is different though (the proposed
patch was full of holes regarding that) and we need to use session
locks on the parent *table* partitions (not the indexes) to avoid any
issues within the first transaction building the list of relations to 
work on, similarly to REINDEX CONCURRENTLY.  So I fixed this problem
this way. For the schema and database cases, I think that we would
need to do something similar to VACUUM, aka have an extra code path to
skip relations not defined.  I'll leave that for another thread.

One last thing.  I think that the patch is in a rather good shape, but
there is one error message I am not happy with when running some
commands in a transaction block.  Say, this sequence: 
CREATE TABLE parent_tab (id int) PARTITION BY RANGE (id);
CREATE INDEX parent_index ON parent_tab (id);
BEGIN;
REINDEX INDEX parent_index; -- error
ERROR:  25001: REINDEX INDEX cannot run inside a transaction block
LOCATION:  PreventInTransactionBlock, xact.c:3386

This error can be confusing, because we don't tell directly that the
relation involved here is partitioned, and REINDEX INDEX/TABLE are
fine when doing their stuff on non-partitions.  For other code paths,
we have leveraged such errors to use the grammar specific to
partitions, for example "CREATE TABLE .. PARTITION OF" or such as
these don't cause translation issues, but we don't have a specific
syntax of REINDEX for partitioned relations, and I don't think that we
need more grammar just for that.  The simplest idea I have here is to
just use an error callback to set an errcontext(), saying roughly:
"while reindexing partitioned table/index %s" while we go through
PreventInTransactionBlock().  I have done nothing about that yet but
adding an errcallback is simple enough.  Perhaps somebody ha

Re: SyncRepLock acquired exclusively in default configuration

2020-08-11 Thread Asim Praveen

> On 11-Aug-2020, at 8:57 PM, Robert Haas  wrote:
> 
> I think this gets to the root of the issue. If we check the flag
> without a lock, we might see a slightly stale value. But, considering
> that there's no particular amount of time within which configuration
> changes are guaranteed to take effect, maybe that's OK. However, there
> is one potential gotcha here: if the walsender declares the standby to
> be synchronous, a user can see that, right? So maybe there's this
> problem: a user sees that the standby is synchronous and expects a
> transaction committing afterward to provoke a wait, but really it
> doesn't. Now the user is unhappy, feeling that the system didn't
> perform according to expectations.

Yes, pg_stat_replication reports a standby in sync as soon as walsender updates 
priority of the standby to something other than 0.

The potential gotcha referred above doesn’t seem too severe.  What is the 
likelihood of someone setting synchronous_standby_names GUC with either “*” or 
a standby name and then immediately promoting that standby?  If the standby is 
promoted before the checkpointer on master gets a chance to update 
sync_standbys_defined in shared memory, commits made during this interval on 
master may not make it to standby.  Upon promotion, those commits may be lost.

Asim

Re: 回复：how to create index concurrently on partitioned table

2020-08-11 Thread Justin Pryzby

Thanks for helping with this.

On Wed, Aug 12, 2020 at 01:54:38PM +0900, Michael Paquier wrote:
> +++ b/src/backend/catalog/index.c
> @@ -3661,20 +3662,12 @@ reindex_relation(Oid relid, int flags, int options)
> + elog(ERROR, "unsupported relation kind for relation \"%s\"",
> +  RelationGetRelationName(rel));

I guess it should show the relkind(%c) in the message, like these:

src/backend/commands/tablecmds.c:   elog(ERROR, "unexpected 
relkind: %d", (int) relkind);
src/backend/tcop/utility.c: 
elog(ERROR, "unexpected relkind \"%c\" on partition \"%s\"",

ISTM reindex_index is missing that, too:

8b08f7d4820fd7a8ef6152a9dd8c6e3cb01e5f99
+   if (iRel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
+   elog(ERROR, "unsupported relation kind for index \"%s\"",
+RelationGetRelationName(iRel));


> diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
> @@ -259,8 +263,12 @@ REINDEX [ (  class="parameter">option [, ...] ) ] { IN
>
>  
>
> -   Reindexing partitioned tables or partitioned indexes is not supported.
> -   Each individual partition can be reindexed separately instead.
> +   Reindexing partitioned indexes or partitioned tables is supported
> +   with respectively REINDEX INDEX or
> +   REINDEX TABLE.

Should say "..with REINDEX INDEX or REINDEX TABLE, respectively".

> + Each partition of the partitioned
> +   relation defined is rebuilt in its own transaction.

=> Each partition of the specified partitioned relation is reindexed in a
separate transaction.

-- 
Justin

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-11 Thread Thomas Munro

On Sat, Aug 8, 2020 at 2:44 AM Robert Haas  wrote:
> On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro  wrote:
> > * Master is around 11% faster than last week before commit c5315f4f
> > "Cache smgrnblocks() results in recovery."
> > * This patch gives a similar speedup, bringing the total to around 25%
> > faster than last week (the time is ~20% less, the WAL processing speed
> > is ~1.25x).
>
> Dang, that's pretty nice, especially for the relatively small amount
> of code that it seems to require.

Yeah, the combined effect of these two patches is better than I
expected.  To be clear though, I was only measuring the time between
the "redo starts at ..." and "redo done at ..." messages, since I've
been staring at the main recovery code, but there are also some more
fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles())
that are unaffected.  I think it's probably possible to do something
about those too, but that's another topic.

I spotted a small problem: if the transaction ID wrap all the way
around between checkpoints, then you might have cancelled requests for
a removed SLRU segment from the previous epoch, so we'd better
uncancel them if we see that.  That's a one line fix, done in the
attached.  I also adjusted the commit message to be a little clearer
(this work deferment/collapsing scheme works in crash recovery too,
not just when there is a checkpointer process to hand the work to).
From 63235aca6b2525716f8f29caf07e0a0e6a26965e Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 4 Aug 2020 17:57:18 +1200
Subject: [PATCH] Defer flushing of SLRU files.

Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages, leading to an I/O stall in user backends and recovery.
Collapse requests for the same file into a single system call as part of
the next checkpoint, as we do for relation files.

Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  13 +++-
 src/backend/access/transam/commit_ts.c |  12 ++-
 src/backend/access/transam/multixact.c |  24 +-
 src/backend/access/transam/slru.c  | 101 +++--
 src/backend/access/transam/subtrans.c  |   4 +-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/lmgr/predicate.c   |   4 +-
 src/backend/storage/sync/sync.c|  28 ++-
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  12 ++-
 src/include/storage/sync.h |   7 +-
 13 files changed, 174 insertions(+), 46 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f3da40ae01..0c5b7a525e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -692,7 +693,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 903280ae92..b4edbdb4e3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
   CommitTsSLRULock, "pg_commit_ts",
-  LWTRANCHE_COMMITTS_BUFFER);
+  LWTRANCHE_COMMITTS_BUFFER,
+  SYNC_HANDLER_COMMIT_TS);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 	 sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 475f5ed861..27ae2edbdc 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
 	SimpleLruInit(MultiXactOffsetCtl,
   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
   MultiXactOffse

Re: SyncRepLock acquired exclusively in default configuration

2020-08-11 Thread Masahiko Sawada

On Wed, 12 Aug 2020 at 14:06, Asim Praveen  wrote:
>
>
>
> > On 11-Aug-2020, at 8:57 PM, Robert Haas  wrote:
> >
> > I think this gets to the root of the issue. If we check the flag
> > without a lock, we might see a slightly stale value. But, considering
> > that there's no particular amount of time within which configuration
> > changes are guaranteed to take effect, maybe that's OK. However, there
> > is one potential gotcha here: if the walsender declares the standby to
> > be synchronous, a user can see that, right? So maybe there's this
> > problem: a user sees that the standby is synchronous and expects a
> > transaction committing afterward to provoke a wait, but really it
> > doesn't. Now the user is unhappy, feeling that the system didn't
> > perform according to expectations.
>
> Yes, pg_stat_replication reports a standby in sync as soon as walsender 
> updates priority of the standby to something other than 0.
>
> The potential gotcha referred above doesn’t seem too severe.  What is the 
> likelihood of someone setting synchronous_standby_names GUC with either “*” 
> or a standby name and then immediately promoting that standby?  If the 
> standby is promoted before the checkpointer on master gets a chance to update 
> sync_standbys_defined in shared memory, commits made during this interval on 
> master may not make it to standby.  Upon promotion, those commits may be lost.

I think that if the standby is quite behind the primary and in case of
the primary crashes, the likelihood of losing commits might get
higher. The user can see the standby became synchronous standby via
pg_stat_replication but commit completes without a wait because the
checkpointer doesn't update sync_standbys_defined yet. If the primary
crashes before standby catching up and the user does failover, the
committed transaction will be lost, even though the user expects that
transaction commit has been replicated to the standby synchronously.
And this can happen even without the patch, right?

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

46 matches

Mail list logo